
MMX Arithmetic Instructions
The MMX technology supports both saturating
and wraparound modes. In wraparound mode, results that overflow
or underflow are truncated and only the lower (least significant) bits
of the result are returned. In saturation mode, results of an operation
that overflow or underflow are clipped (saturated) to a datarange limit
for the data type. The result of an operation that exceeds the range of
a data type saturates to the maximum value of the range, while a result
that is less than the range of a data type saturates to the minimum value
of the range. This method of handling overflow and underflow is useful
in many applications, such as color calculations.
PADDB mm,mm/m64
PADDW mm,mm/m64
PADDD mm,mm/m64 
The PADD (Packed Add) instructions
add the data elements of the source operand to the data elements of the
destination register, and the result is written to the destination register.
If the result exceeds the datarange limit for the data type, it wraps
around. PADD support packed byte (PADDB), packed word (PADDW), and packed
doubleword (PADDD) data types.

PADDB instruction with 64bit operands:
DEST[7..0] ← DEST[7..0] + SRC[7..0];
* repeat add operation for 2nd through 7th byte *;
DEST[63..56] ← DEST[63..56] + SRC[63..56];
PADDW instruction with 64bit operands:
DEST[15..0] ← DEST[15..0] + SRC[15..0];
* repeat add operation for 2nd and 3th word *;
DEST[63..48] ← DEST[63..48] + SRC[63..48];
PADDD instruction with 64bit operands:
DEST[31..0] ← DEST[31..0] + SRC[31..0];
DEST[63..32] ← DEST[63..32] + SRC[63..32]; 
PADDB __m64 _mm_add_pi8(__m64 m1, __m64 m2)
PADDW __m64 _mm_addw_pi16(__m64 m1, __m64 m2)
PADDD __m64 _mm_add_pi32(__m64 m1, __m64 m2) 
PADDSB mm, mm/m64
PADDSW mm, mm/m64 
The PADDS (Packed Add with Saturation)
instructions add the packed signed data elements of the source operand
to the packed signed data elements of the destination operand and saturate
the result. PADDS support packed byte (PADDSB) and packed word (PADDSW)
data types.

PADDSB instruction with 64bit operands:
DEST[7..0] ← SaturateToSignedByte(DEST[7..0] + SRC (7..0]) ;
* repeat add operation for 2nd through 7th bytes *;
DEST[63..56] ← SaturateToSignedByte(DEST[63..56] + SRC[63..56] );
PADDSW instruction with 64bit operands
DEST[15..0] ¨ SaturateToSignedWord(DEST[15..0] + SRC[15..0] );
* repeat add operation for 2nd and 3rd words *;
DEST[63..48] ¨ SaturateToSignedWord(DEST[63..48] + SRC[63..48] ); 
PADDSB __m64 _mm_adds_pi8(__m64 m1, __m64 m2)
PADDSW __m64 _mm_adds_pi16(__m64 m1, __m64 m2) 
PADDUSB mm, mm/m64
PADDUSW mm, mm/m64 
The PADDUS (Packed Add Unsigned with
Saturation) instructions add the packed unsigned data elements of the
source operand to the packed unsigned data elements of the destination
operand and saturate the results. PADDUS support packed byte (PADDUSB)
and packed word (PADDUSW) data types. 
PADDUSB instruction with 64bit operands:
DEST[7..0] ← SaturateToUnsignedByte(DEST[7..0] + SRC (7..0] );
* repeat add operation for 2nd through 7th bytes *:
DEST[63..56] ← SaturateToUnsignedByte(DEST[63..56] + SRC[63..56]
PADDUSW instruction with 64bit operands:
DEST[15..0] ¨ SaturateToUnsignedWord(DEST[15..0] + SRC[15..0] );
* repeat add operation for 2nd and 3rd words *:
DEST[63..48] ¨ SaturateToUnsignedWord(DEST[63..48] + SRC[63..48] ); 
PADDUSB __m64 _mm_adds_pu8(__m64 m1, __m64
m2) PADDUSW __m64 _mm_adds_pu16(__m64 m1, __m64 m2) 
PSUBB mm, mm/m64
PSUBW mm, mm/m64
PSUBD mm, mm/m64 
The PSUB (Packed Subtract) instructions
subtract the data elements of the source operand from the data elements
of the destination operand. If the result is larger or smaller than the
datarange limit for the data type, it wraps around. PSUB support packed
byte (PSUBB), packed word (PSUBW), and packed doubleword (PSUBD) data types.

PSUBB instruction with 64bit operands:
DEST[7..0] ← DEST[7..0] − SRC[7..0];
* repeat subtract operation for 2nd through 7th byte *;
DEST[63..56] ← DEST[63..56] − SRC[63..56];
PSUBW instruction with 64bit operands:
DEST[15..0] ← DEST[15..0] − SRC[15..0];
* repeat subtract operation for 2nd and 3rd word *;
DEST[63..48] ← DEST[63..48] − SRC[63..48];
PSUBD instruction with 64bit operands:
DEST[31..0] ← DEST[31..0] − SRC[31..0];
DEST[63..32] ← DEST[63..32] − SRC[63..32];

PSUBB __m64 _mm_sub_pi8(__m64 m1, __m64 m2)
PSUBW __m64 _mm_sub_pi16(__m64 m1, __m64 m2)
PSUBD __m64 _mm_sub_pi32(__m64 m1, __m64 m2) 
PSUBSB mm, mm/m64
PSUBSW mm, mm/m64 
The PSUBS (Packed Subtract with Saturation)
instructions subtract the signed data elements of the source operand from
the signed data elements of the destination operand, then the results are
saturated to the limits of a signed data element and written to the destination
operand. PSUBS support packed byte (PSUBSB) and packed word (PSUBSW)
data types.

PSUBSB instruction with 64bit operands:
DEST[7..0] ← SaturateToSignedByte(DEST[7..0] − SRC (7..0]) ;
* repeat subtract operation for 2nd through 7th bytes *;
DEST[63..56] ← SaturateToSignedByte(DEST[63..56] − SRC[63..56] );
PSUBSW instruction with 64bit operands
DEST[15..0] ← SaturateToSignedWord(DEST[15..0] − SRC[15..0] );
* repeat subtract operation for 2nd and 7th words *;
DEST[63..48] ← SaturateToSignedWord(DEST[63..48] − SRC[63..48] ); 
PSUBSB __m64 _mm_subs_pi8(__m64 m1, __m64 m2)
PSUBSW __m64 _mm_subs_pi16(__m64 m1, __m64 m2) 
PSUBUSB mm, mm/m64
PSUBUSW mm, mm/m64 
The PSUBUS (Packed Subtract Unsigned
with Saturation) instructions subtract the unsigned data elements of
the source operand from the unsigned data elements of the destination register,
then the results are saturated to the limits of an unsigned data element
and written to the destination operand. PSUBUS support packed byte (PSUBUSB)
and packed word (PSUBUSW) data types. 
PSUBUSB instruction with 64bit operands:
DEST[7..0] ← SaturateToUnsignedByte(DEST[7..0] − SRC (7..0] );
* repeat add operation for 2nd through 7th bytes *:
DEST[63..56] ← SaturateToUnsignedByte(DEST[63..56] − SRC[63..56]
PSUBUSW instruction with 64bit operands:
DEST[15..0] ← SaturateToUnsignedWord(DEST[15..0] − SRC[15..0] );
* repeat add operation for 2nd and 3rd words *:
DEST[63..48] ← SaturateToUnsignedWord(DEST[63..48] − SRC[63..48] ); 
PSUBUSB __m64 _mm_sub_pu8(__m64 m1, __m64 m2)
PSUBUSW __m64 _mm_sub_pu16(__m64 m1, __m64 m2) 
As an example of saturated arithmetic,
let us consider the absolute difference of two arrays of bytes: there are
no IF statements in MMX, but it is necessary to implement the following
algorithm:
if (a > b)
then c = a – b
else c = b – a
This algorithm can be coded using saturated
substractions: subtracting a from b and b from a, a zero result and the
desired absolute difference are obtained, but since it is impossible
to know which is which, the final result is achieved by ORing them together:
c = (a – b) OR (b – a)
Assuming that the MMX registers named MM0
and MM1 hold the source vectors, the following code will compute the absolute
difference and store it into MM0:
MOVQ MM2, MM0 make a copy of MM0
PSUBUSB MM0, MM1 compute difference
one way
PSUBUSB MM1, MM2 compute difference
the other way
POR MM0, MM1 OR them together
PMULHW mm, mm/m64
PMULLW mm, mm/m64 
The PMULHW (Packed Multiply High)
and PMULLW (Packed Multiply Low) instructions multiply the four
signed words of the source and destination operands and write the highorder
or loworder 16 bits of the 32bit intermediate results to the destination
operand.

PMULHW instruction with 64bit operands:
TEMP0[310] ← DEST[150] * SRC[150]; * Signed multiplication *
TEMP1[310] ← DEST[3116] * SRC[3116];
TEMP2[310] ← DEST[4732] * SRC[4732];
TEMP3[310] ← DEST[6348] * SRC[6348];
DEST[150] ← TEMP0[3116];
DEST[3116] ← TEMP1[3116];
DEST[4732] ← TEMP2[3116];
DEST[6348] ← TEMP3[3116];PMULLW instruction with
64bit operands:
TEMP0[310] ← DEST[150] * SRC[150]; * Signed multiplication *
TEMP1[310] ← DEST[3116] * SRC[3116];
TEMP2[310] ← DEST[4732] * SRC[4732];
TEMP3[310] ← DEST[6348] * SRC[6348];
DEST[150] ← TEMP0[150];
DEST[3116] ← TEMP1[150];
DEST[4732] ← TEMP2[150];
DEST[6348] ← TEMP3[150]; 
PMULHW __m64 _mm_mulhi_pi16 (__m64 m1, __m64
m2) PMULLW __m64 _mm_mullo_pi16(__m64 m1, __m64 m2) 
PMADDWD mm, mm/m64 
The PMADDWD (Packed Multiply and Add)
instruction multiplies the four signed words of the destination operand
by the four signed words of the source operand. The two highorder words
are summed and stored in the upper doubleword of the destination operand,
and the two loworder words are summed and stored in the lower doubleword
of the destination operand.

PMADDWD instruction with 64bit operands:
DEST[31..0] ← (DEST[15..0] * SRC[15..0]) + (DEST[31..16] * SRC[31..16]);
DEST[63..32] ← (DEST[47..32] * SRC[47..32]) + (DEST[63..48] * SRC[63..48]); 
PMADDWD __m64 _mm_madd_pi16(__m64 m1, __m64
m2) 
Complex multiplication is an operation
which requires four multiplications and two additions, leading naturally
to the use of the PMADDWD instruction. In order to use this instruction
it is necessary to format the data into four 16bit values, each holding
a read or imaginary component: the constant vector can be outlined as [Re
–Im Im Re].
The following code fragment multiplies
the complex number stored in the MMX register MM0 by the complex constant
hold in register MM1 with the pattern explained above. The real component
of the complex product is given by
Re(Data)*Re(Const) – Im(Data)*Im(Const)
and the imaginary component of the complex
product by
Re(Data)*Im(Const) + Im(Data)*Re(Const).
PUNPCKLDQ MM0, MM0 convert the data
in the [Re Im Re Im] format
PMADDWD MM0, MM1 perform the complex
multiply
Note that the output is a packed word,
so a pack instruction may be used to convert the result to 16bit, matching
the format of the input.

