|

The Intel Streaming SIMD
Extensions (SSE) comprise a set of extensions to the Intel x86 architecture
that is designed to greatly enhance the performance of advanced media and
communication applications. In this section the SSE integer instructions
that extend the MMX instruction set will be
closely examined. They may be very useful, as implementing them in standard
MMX code typically requires a long sequence of instructions, but the programmer
should be aware that using these instructions will restrict the market
to Pentium III-only systems, as other processors do not support SSE. An
optimal solution is writing two versions of the same routine, the first
using standard MMX code, the second SSE-enhanced, and putting a CPU detection
function that decides which one to execute: this way the program will run
on all MMX-equipped processors, but with the performance benefits of SSE
on Pentium IIIs.
Note: SSE integer instructions
are also supported on the AMD Athlon processor.
PAVGB mm, mm/m64
PAVGW mm, mm/m64 |
PAVG (Packed Average) adds the
unsigned data elements of the source operand to the unsigned data elements
of the destination register, along with a carry-in. The results of the
add are then each independently right-shifted by one bit position. The
high order bits of each element are filled with the carry bits of the corresponding
sum. PAVGB operates on packed unsigned bytes, and PAVGW operates on packed
unsigned words.
 |
PAVGB instruction with 64-bit operands:
SRC[7-0) ← (SRC[7-0) + DEST[7-0) + 1) >> 1; * temp sum before shifting is 9 bits
*
* repeat operation performed for bytes 2 through 6;
SRC[63-56) ← (SRC[63-56) + DEST[63-56) + 1) >> 1;
PAVGW instruction with 64-bit operands:
SRC[15-0) ← (SRC[15-0) + DEST[15-0) + 1) >> 1; * temp sum before shifting is 17
bits *
* repeat operation performed for words 2 and 3;
SRC[63-48) ← (SRC[63-48) + DEST[63-48) + 1) >> 1; |
PAVGB __m64_mm_avg_pu8 (__m64 a, __m64 b)
PAVGW __m64_mm_avg_pu16 (__m64 a, __m64 b) |
Table look-up operations, such as the ones
found in histogram-related algorithms, have always been critical to the
performance of digital imaging. In such cases, each of the computed R,
G, and B values is used as an index into its respective color look-up table.
Such operations have been difficult to implement in MMX technology due
to the fact that the computed RGBA values would be residing in an MMX register,
which cannot be used directly for addressing. Extracting each of them into
the appropriate registers for addressing, fetching the contents from the
table, and inserting them back into MMX registers was cumbersome and harmful
to performance. SSE includes a pair of instructions (PEXTRW and PINSRW)
that help enhance the performance of such algorithms.
PEXTRW r32, mm, imm8 |
PEXTRW (Packed Extract Word) moves
the word in an MMX register selected by the two least significant bits
of the immediate operand to the lower half of a 32-bit integer register;
the upper word in the integer register is cleared.
 |
PEXTRW instruction with 64-bit source operand:
SEL ← COUNT AND 3H;
TEMP ← (SRC >> (SEL * 16)) AND FFFFH;
r32[15-0] ← TEMP[15-0];
r32[31-16] ← 0000H; |
PEXTRW int_mm_extract_pi16 (__m64 a, int n) |
PINSRW mm, r32/m16, imm8 |
PINSRW (Packed Insert Word) moves
the lower word in a 32-bit integer register or 16-bit word from memory
into one of the four word locations in an MMX register, selected by the
two least significant bits of the immediate operand. The insertion is done
in such a way that the three other words from the destination register
are left untouched.
 |
PINSRW instruction with 64-bit source operand:
SEL ← COUNT AND 3H;
CASE (determine word position) OF
SEL ← 0: MASK ← 000000000000FFFFH;
SEL ← 1: MASK ← 00000000FFFF0000H;
SEL ← 2: MASK ← 0000FFFF00000000H;
SEL ← 3: MASK ← FFFF000000000000H;
DEST ← (DEST AND NOT MASK) OR (((SRC << (SEL * 16)) AND MASK); |
PINSRW __m64 _mm_insert_pi16 (__m64 a, int d, int n) |
PMAXSW mm, mm/m64 |
PMAXSW (Packed Signed Word Maximum)
returns the maximum between the four signed words in the source and destination
operands. |
PMAXSW instruction for 64-bit operands:
IF DEST[15-0] > SRC[15-0]) THEN (DEST[15-0] ← DEST[15-0];
ELSE (DEST[15-0] ← SRC[15-0];
* repeat operation for 2nd and 3rd words in source and destination operands *
IF DEST[63-48] > SRC[63-48]) THEN (DEST[63-48] ← DEST[63-48];
ELSE (DEST[63-48] ← SRC[63-48]; |
PMAXSW __m64 _mm_max_pi16(__m64 a, __m64 b) |
PMAXUB mm, mm/m64 |
PMAXUB (Packed Unsigned Byte Maximum)
returns the maximum between the eight unsigned bytes in the source and
destination operands.
 |
PMAXUB instruction for 64-bit operands:
IF DEST[7-0] > SRC[17-0]) THEN
(DEST[7-0] ← DEST[7-0];
ELSE
(DEST[7-0] ← SRC[7-0];
* repeat operation for 2nd through 7th bytes in source and destination operands
*
IF DEST[63-56] > SRC[63-56]) THEN
(DEST[63-56] ← DEST[63-56];
ELSE
(DEST[63-56] ← SRC[63-56]; |
PMAXUB __m64 _mm_max_pu8(__m64 a, __m64 b) |
PMINSW mm, mm/m64 |
PMINSW (Packed Signed Word Minimum)
returns the minimum between the four signed words in the source and destination
operands. |
PMINSW instruction for 64-bit operands:
IF DEST[15-0] < SRC[15-0]) THEN
(DEST[15-0] ← DEST[15-0];
ELSE
(DEST[15-0] ← SRC[15-0];
* repeat operation for 2nd and 3rd words in source and destination operands *
IF DEST[63-48] < SRC[63-48]) THEN
(DEST[63-48] ← DEST[63-48];
ELSE
(DEST[63-48] ← SRC[63-48]; |
PMINSW __m64 _mm_min_pi16 (__m64 a, __m64 b) |
PMINUB mm, mm/m64 |
PMINUB (Packed Unsigned Byte Minimum)
returns the minimum between the eight unsigned bytes in the source and
destination operands.
 |
PMINUB instruction for 64-bit operands:
IF DEST[7-0] < SRC[17-0]) THEN
(DEST[7-0] ← DEST[7-0];
ELSE
(DEST[7-0] ← SRC[7-0];
* repeat operation for 2nd through 7th bytes in source and destination operands
*
IF DEST[63-56] < SRC[63-56]) THEN
(DEST[63-56] ← DEST[63-56];
ELSE
(DEST[63-56] ← SRC[63-56]; |
PMINUB __m64 _m_min_pu8 (__m64 a, __m64 b) |
Within 3D geometry and rasterization, color
clamping is an example that benefits from the use of PMIN. Also, a fundamental
component in many speech recognition engines is the evaluation of a Hidden-Markov
Model (HMM), that takes up to 80% of execution time: the PMIN instruction
improves this kernel performance by 33%, giving a 19% application gain.
PMOVMSKB r32, mm |
PMOVMSKB (Move Byte Mask To Integer)
returns an 8-bit mask formed of the most significant bits of each byte
of its source operand in an MMX register to an x86 integer register.
 |
PMOVMSKB instruction with 64-bit source operand:
r32[0] ← SRC[7];
r32[1] ← SRC[15];
* repeat operation for bytes 2 through 6;
r32[7] ← SRC[63];
r32[31-8] ← 000000H; |
PMOVMSKB int_mm_movemask_pi8(__m64 a) |
PMULHUW mm, mm/m64 |
PMULHUW (Packed Multiply High Unsigned)
multiplies the four unsigned words in the destination operand with the
four unsigned words in the source operand, then writes the high-order 16
bits of the 32-bit intermediate results to the destination operand.
 |
PMULHUW instruction with 64-bit operands:
TEMP0[31-0] ← DEST[15-0] * SRC[15-0]; * Unsigned multiplication *
TEMP1[31-0] ← DEST[31-16] * SRC[31-16];
TEMP2[31-0] ← DEST[47-32] * SRC[47-32];
TEMP3[31-0] ← DEST[63-48] * SRC[63-48];
DEST[15-0] ← TEMP0[31-16];
DEST[31-16] ← TEMP1[31-16];
DEST[47-32] ← TEMP2[31-16];
DEST[63-48] ← TEMP3[31-16]; |
PMULHUW __m64 _mm_mulhi_pu16(__m64 a, __m64 b) |
The lack of an unsigned MMX multiply operation
in the original MMX instruction set was a reason for inefficiency in 3D
rasterization performance: this function inherently works with unsigned
pixel data, and the existing PMULHW instruction operates only on signed
data. Providing an unsigned PMULHUW eliminates fix-up overhead required
in using the signed version, creating an application-level performance
gain of 8%-10%.
Although the MPEG video encode pipeline
is quite complex and involves many stages, the bulk of the execution time
(40% to 70%) is spent in the motion-estimation function: this stage compares
a sub-block of the current frame with those in the same neighborhood of
the previous and next frames in order to find the best match. Consequently,
only a vector representing the position of this match, and the residual
difference sub-block, needs to be included in the compressed output stream
as opposed to the entire original sub-block.
PSADBW mm, mm/m64 |
PSADBW (Packed Sum of Absolute Differences)
computes the absolute value of the difference of unsigned bytes for mm1
and mm2/m64. These differences are then summed to produce a word result
in the lower 16-bit field; the upper three words are cleared.
 |
PSADBW instructions when using 64-bit operands:
TEMP0 ← ABS(DEST[7-0] − SRC[7-0]);
* repeat operation for bytes 2 through 6 *;
TEMP7 ← ABS(DEST[63-56] − SRC[63-56]);
DEST[15:0] ← SUM(TEMP0...TEMP7);
DEST[63:16] ← 000000000000H; |
PSADBW __m64_mm_sad_pu8(__m64 a,__m64 b) |
PSADBW retains byte-level parallelism of
execution, working on 8 bytes at a time, and the accumulation does not
exceed a 16-bit word. This single instruction replaces seven MMX instructions
in the motion-estimation inner loop, increasing performance by a factor
of two.
When implementing an SIMD imaging algorithm,
there is often the need to rearrange data within an MMX register.
PSHUFW mm, mm/m64, imm8 |
PSHUFW (Packed Shuffle Word) uses the
imm8 operand to select which of the four words in the source operand will
be placed in each of the words in the destination operand. Bits 1 and 0
of imm8 encode the source for destination word 0 , bits 3 and 2 encode
for word 1, bits 5 and 4 encode for word 2, and bits 7 and 6 encode for
word 3. Similarly, the two-bit encoding represents which source word is
to be used, i.e. a binary encoding of 10 indicates that source word 2 will
be used.
 |
DEST[15-0] ← (SRC >> (ORDER[1-0] * 16) )[15-0]
DEST[31-16] ← (SRC >> (ORDER[3-2] * 16) )[15-0]
DEST[47-32] ← (SRC >> (ORDER[5-4] * 16) )[15-0]
DEST[63-48] ← (SRC >> (ORDER[7-6] * 16) )[15-0] |
PSHUFW __m64 _mm_shuffle_pi16(__m64 a, int n) |
As an example, an efficient SIMD implementation
of alpha saturation would compare all of the R, G, and B components with
the corresponding alpha value in parallel. To be able to do so, the alpha
value itself needs to be replicated in a different MMX register: while
this requires three instructions in MMX technology, with SSE it would need
just one.

|
|