SSE Primer

The Intel Streaming SIMD Extensions (SSE) comprise a set of extensions to the Intel x86 architecture that is designed to greatly enhance the performance of advanced media and communication applications. In this section the SSE integer instructions that extend the MMX instruction set will be closely examined. They may be very useful, as implementing them in standard MMX code typically requires a long sequence of instructions, but the programmer should be aware that using these instructions will restrict the market to Pentium III-only systems, as other processors do not support SSE. An optimal solution is writing two versions of the same routine, the first using standard MMX code, the second SSE-enhanced, and putting a CPU detection function that decides which one to execute: this way the program will run on all MMX-equipped processors, but with the performance benefits of SSE on Pentium IIIs.
Note: SSE integer instructions are also supported on the AMD Athlon processor.

PAVGB mm, mm/m64
PAVGW mm, mm/m64

PAVG (Packed Average) adds the unsigned data elements of the source operand to the unsigned data elements of the destination register, along with a carry-in. The results of the add are then each independently right-shifted by one bit position. The high order bits of each element are filled with the carry bits of the corresponding sum. PAVGB operates on packed unsigned bytes, and PAVGW operates on packed unsigned words.

PAVGB instruction with 64-bit operands:
SRC[7-0) ← (SRC[7-0) + DEST[7-0) + 1) >> 1; * temp sum before shifting is 9 bits *
* repeat operation performed for bytes 2 through 6;
SRC[63-56) ← (SRC[63-56) + DEST[63-56) + 1) >> 1;
PAVGW instruction with 64-bit operands:
SRC[15-0) ← (SRC[15-0) + DEST[15-0) + 1) >> 1; * temp sum before shifting is 17 bits *
* repeat operation performed for words 2 and 3;
SRC[63-48) ← (SRC[63-48) + DEST[63-48) + 1) >> 1;

PAVGB __m64_mm_avg_pu8 (__m64 a, __m64 b)
PAVGW __m64_mm_avg_pu16 (__m64 a, __m64 b)

Table look-up operations, such as the ones found in histogram-related algorithms, have always been critical to the performance of digital imaging. In such cases, each of the computed R, G, and B values is used as an index into its respective color look-up table. Such operations have been difficult to implement in MMX technology due to the fact that the computed RGBA values would be residing in an MMX register, which cannot be used directly for addressing. Extracting each of them into the appropriate registers for addressing, fetching the contents from the table, and inserting them back into MMX registers was cumbersome and harmful to performance. SSE includes a pair of instructions (PEXTRW and PINSRW) that help enhance the performance of such algorithms.

PEXTRW r32, mm, imm8

PEXTRW (Packed Extract Word) moves the word in an MMX register selected by the two least significant bits of the immediate operand to the lower half of a 32-bit integer register; the upper word in the integer register is cleared.

PEXTRW instruction with 64-bit source operand:
SEL ← COUNT AND 3H;
TEMP ← (SRC >> (SEL * 16)) AND FFFFH;
r32[15-0] ← TEMP[15-0];
r32[31-16] ← 0000H;

PEXTRW int_mm_extract_pi16 (__m64 a, int n)

PINSRW mm, r32/m16, imm8

PINSRW (Packed Insert Word) moves the lower word in a 32-bit integer register or 16-bit word from memory into one of the four word locations in an MMX register, selected by the two least significant bits of the immediate operand. The insertion is done in such a way that the three other words from the destination register are left untouched.

PINSRW instruction with 64-bit source operand:
SEL ← COUNT AND 3H;
CASE (determine word position) OF
SEL ← 0: MASK ← 000000000000FFFFH;
SEL ← 1: MASK ← 00000000FFFF0000H;
SEL ← 2: MASK ← 0000FFFF00000000H;
SEL ← 3: MASK ← FFFF000000000000H;
DEST ← (DEST AND NOT MASK) OR (((SRC << (SEL * 16)) AND MASK);

PINSRW __m64 _mm_insert_pi16 (__m64 a, int d, int n)

PMAXSW mm, mm/m64

PMAXSW (Packed Signed Word Maximum) returns the maximum between the four signed words in the source and destination operands.

PMAXSW instruction for 64-bit operands:
IF DEST[15-0] > SRC[15-0]) THEN (DEST[15-0] ← DEST[15-0];
ELSE (DEST[15-0] ← SRC[15-0];
* repeat operation for 2nd and 3rd words in source and destination operands *
IF DEST[63-48] > SRC[63-48]) THEN (DEST[63-48] ← DEST[63-48];
ELSE (DEST[63-48] ← SRC[63-48];

PMAXSW __m64 _mm_max_pi16(__m64 a, __m64 b)

PMAXUB mm, mm/m64

PMAXUB (Packed Unsigned Byte Maximum) returns the maximum between the eight unsigned bytes in the source and destination operands.

PMAXUB instruction for 64-bit operands:
IF DEST[7-0] > SRC[17-0]) THEN
(DEST[7-0] ← DEST[7-0];
ELSE
(DEST[7-0] ← SRC[7-0];
* repeat operation for 2nd through 7th bytes in source and destination operands *
IF DEST[63-56] > SRC[63-56]) THEN
(DEST[63-56] ← DEST[63-56];
ELSE
(DEST[63-56] ← SRC[63-56];

PMAXUB __m64 _mm_max_pu8(__m64 a, __m64 b)

PMINSW mm, mm/m64

PMINSW (Packed Signed Word Minimum) returns the minimum between the four signed words in the source and destination operands.

PMINSW instruction for 64-bit operands:
IF DEST[15-0] < SRC[15-0]) THEN
(DEST[15-0] ← DEST[15-0];
ELSE
(DEST[15-0] ← SRC[15-0];
* repeat operation for 2nd and 3rd words in source and destination operands *
IF DEST[63-48] < SRC[63-48]) THEN
(DEST[63-48] ← DEST[63-48];
ELSE
(DEST[63-48] ← SRC[63-48];

PMINSW __m64 _mm_min_pi16 (__m64 a, __m64 b)

PMINUB mm, mm/m64

PMINUB (Packed Unsigned Byte Minimum) returns the minimum between the eight unsigned bytes in the source and destination operands.

PMINUB instruction for 64-bit operands:
IF DEST[7-0] < SRC[17-0]) THEN
(DEST[7-0] ← DEST[7-0];
ELSE
(DEST[7-0] ← SRC[7-0];
* repeat operation for 2nd through 7th bytes in source and destination operands *
IF DEST[63-56] < SRC[63-56]) THEN
(DEST[63-56] ← DEST[63-56];
ELSE
(DEST[63-56] ← SRC[63-56];

PMINUB __m64 _m_min_pu8 (__m64 a, __m64 b)

Within 3D geometry and rasterization, color clamping is an example that benefits from the use of PMIN. Also, a fundamental component in many speech recognition engines is the evaluation of a Hidden-Markov Model (HMM), that takes up to 80% of execution time: the PMIN instruction improves this kernel performance by 33%, giving a 19% application gain.

PMOVMSKB r32, mm

PMOVMSKB (Move Byte Mask To Integer) returns an 8-bit mask formed of the most significant bits of each byte of its source operand in an MMX register to an x86 integer register.

PMOVMSKB instruction with 64-bit source operand:
r32[0] ← SRC[7];
r32[1] ← SRC[15];
* repeat operation for bytes 2 through 6;
r32[7] ← SRC[63];
r32[31-8] ← 000000H;

PMOVMSKB int_mm_movemask_pi8(__m64 a)

PMULHUW mm, mm/m64

PMULHUW (Packed Multiply High Unsigned) multiplies the four unsigned words in the destination operand with the four unsigned words in the source operand, then writes the high-order 16 bits of the 32-bit intermediate results to the destination operand.

PMULHUW instruction with 64-bit operands:
TEMP0[31-0] ← DEST[15-0] * SRC[15-0]; * Unsigned multiplication *
TEMP1[31-0] ← DEST[31-16] * SRC[31-16];
TEMP2[31-0] ← DEST[47-32] * SRC[47-32];
TEMP3[31-0] ← DEST[63-48] * SRC[63-48];
DEST[15-0] ← TEMP0[31-16];
DEST[31-16] ← TEMP1[31-16];
DEST[47-32] ← TEMP2[31-16];
DEST[63-48] ← TEMP3[31-16];

PMULHUW __m64 _mm_mulhi_pu16(__m64 a, __m64 b)

The lack of an unsigned MMX multiply operation in the original MMX instruction set was a reason for inefficiency in 3D rasterization performance: this function inherently works with unsigned pixel data, and the existing PMULHW instruction operates only on signed data. Providing an unsigned PMULHUW eliminates fix-up overhead required in using the signed version, creating an application-level performance gain of 8%-10%.
Although the MPEG video encode pipeline is quite complex and involves many stages, the bulk of the execution time (40% to 70%) is spent in the motion-estimation function: this stage compares a sub-block of the current frame with those in the same neighborhood of the previous and next frames in order to find the best match. Consequently, only a vector representing the position of this match, and the residual difference sub-block, needs to be included in the compressed output stream as opposed to the entire original sub-block.

PSADBW mm, mm/m64

PSADBW (Packed Sum of Absolute Differences) computes the absolute value of the difference of unsigned bytes for mm1 and mm2/m64. These differences are then summed to produce a word result in the lower 16-bit field; the upper three words are cleared.

PSADBW instructions when using 64-bit operands:
TEMP0 ← ABS(DEST[7-0] − SRC[7-0]);
* repeat operation for bytes 2 through 6 *;
TEMP7 ← ABS(DEST[63-56] − SRC[63-56]);
DEST[15:0] ← SUM(TEMP0...TEMP7);
DEST[63:16] ← 000000000000H;

PSADBW __m64_mm_sad_pu8(__m64 a,__m64 b)

PSADBW retains byte-level parallelism of execution, working on 8 bytes at a time, and the accumulation does not exceed a 16-bit word. This single instruction replaces seven MMX instructions in the motion-estimation inner loop, increasing performance by a factor of two.
When implementing an SIMD imaging algorithm, there is often the need to rearrange data within an MMX register.

PSHUFW mm, mm/m64, imm8

PSHUFW (Packed Shuffle Word) uses the imm8 operand to select which of the four words in the source operand will be placed in each of the words in the destination operand. Bits 1 and 0 of imm8 encode the source for destination word 0 , bits 3 and 2 encode for word 1, bits 5 and 4 encode for word 2, and bits 7 and 6 encode for word 3. Similarly, the two-bit encoding represents which source word is to be used, i.e. a binary encoding of 10 indicates that source word 2 will be used.

DEST[15-0] ← (SRC >> (ORDER[1-0] * 16) )[15-0]
DEST[31-16] ← (SRC >> (ORDER[3-2] * 16) )[15-0]
DEST[47-32] ← (SRC >> (ORDER[5-4] * 16) )[15-0]
DEST[63-48] ← (SRC >> (ORDER[7-6] * 16) )[15-0]

PSHUFW __m64 _mm_shuffle_pi16(__m64 a, int n)

As an example, an efficient SIMD implementation of alpha saturation would compare all of the R, G, and B components with the corresponding alpha value in parallel. To be able to do so, the alpha value itself needs to be replicated in a different MMX register: while this requires three instructions in MMX technology, with SSE it would need just one.