Support
 
 

      128-bit MMX

      I’m quite sure that Intel would not like to see SSE2 named 128-bit MMX. In fact, MMX has a bad reputation: the Intel marketing hype pushed it as an universal solution to multimedia requirements, but at the same time the gaming industry switched from mostly 2D games to Virtual Reality-like 3D games that were not accelerated by MMX. Bad press coverage spread the news that MMX was meaningless as it did not improve the Quake frame-rate. That would be correct if the only applications worth running were 3D games, but the overly simplified vision of the world shared by most hardware sites missed several points: in fact, MMX instructions are constantly used to perform a wide array of tasks. PhotoShop users surely remember the performance boost given by MMX, but it should be made clear that each time you play a MP3, view a JPEG image in your browser or play a MPEG video a lot of MMX instructions are executed. Today all multimedia applications are built on MMX instructions, and they are the key to run computing-intensive tasks such as speech recognition on commonplace PCs.
      Writing MMX code is still very hard, as you have to go back to assembler, but the performance benefits are rewarding. The support offered by current compilers is barebone. There are a few attempts to write C++ compilers that can automatically turn normal C code into vector MMX code, but they deal only with limited complexity loop vectorization and place too many constraints on the parallelizable code; in general, they appear notably less mature than vectorizing compilers available in the supercomputing domain.
      So we cannot expect to have SSE2 enabled compilers anytime soon. This will not stop large companies that sell shrinkwrap software from exploiting SSE2 instructions as they can afford the required development time, but small-scale software firms are not likely to use SSE2 until the appearance of better development tools. In my opinion, the Pentium 4 scenario closely resembles the Pentium MMX one, where lack of software support made the additional investment for the Pentium MMX over plain old Pentium quite useless.
      We have just analyzed the dark side of SSE2, i.e. difficult programming; now we can go on and delve into the technical details.
      SSE2 extends MMX by using 128-bit registers instead of 64-bit ones, effectively doubling the level of parallelism. We may be tempted to replace MMX register names with SSE2 ones (e.g. turning MM0 into XMM0), recompile it and see it running at twice the speed. Unfortunately, it would not work, actually it would not even compile. These are the steps required to migrate MMX code to SSE2:
      1) replace MMX register names with SSE2 ones, e.g. MM0 becomes XMM0;
      2) replace MOVQ instructions with MOVAPD (if the memory address is 16-byte aligned) or MOVUPD (if the memory address is not aligned);
      3) replace PSHUFW, which is a SSE extension to MMX, by a combination of the following instructions: PSHUFHW, PSHUFLW, PSHUFD;
      4) replace PSLLQ and PSRLQ with PSLLDQ and PSRLDQ respectively;
      5) update loop counters and numeric memory offsets, since we work on 128 bits at once instead of 64.
      Looks easy, doesn’t it? Actually, it is not that simple. Replacing 64-bit shifts with 128-bit ones is trivial, but SSE2 expects memory references to be 16-byte aligned: while the MOVUPD instruction lets you load unaligned memory blocks at the expense of poor performance (so it should be not used unless strictly necessary), every instruction that uses a memory source operand, e.g. a PADDB MM0,[EAX], is a troublesome spot. Using unaligned memory references raises a General Protection fault, but avoiding GPF requires quite a lot of work. First of all, the memory allocators used in current compiler do not align data blocks on 16-bytes boundaries, so you will have to build a wrapper function around the malloc() function that allocates a slightly larger block than required and correctly aligns the resulting pointer (note: the Processor Pack for Visual C++ features an aligned_malloc() function that supports user-definable alignment of allocated blocks). Then you will have to find out all the lines in your source code where the code blocks that are processed with SSE2 instructions get allocated, and replace the standard allocation call with an invocation to your wrapper function: this is fairly easy if you have access to all the source code of your app, but impossible when third-party libraries allocate misaligned memory blocks; in this case, contact the software vendor and ask for an update.
      If your MMX routine spills some variables onto the stack, we are in for more trouble, as we have to force the alignment of the stack, and it requires the modification of the entry and exit code of the routine.
      The easiest way to fix a PSHUFW instruction is parting it in two, a PSHUFHW and a PSHUFLW, each operating respectively on the high and low 64-bit halves of the 128-bit register.
      Here is the list of SSE2 instructions that extend MMX (adapted from Intel’s documentation):

       

      PADDQ mm1,mm2/m64

      PADDQ xmm1,xmm2/m128

      Adds the first operand (destination operand) to the second operand (source operand) and stores the result in the destination operand. The source operand can be a quadword integer stored in an MMX register or a 64-bit memory location, or it can be two packed quadword integers stored in an XMM register or an 128-bit memory location. The destination operand can be a quadword integer stored in an MMX register or two packed quadword integers stored in an XMM register. When packed quadword operands are used, a SIMD add is performed. When a quadword result is too large to be represented in 64 bits (overflow), the result is wrapped around and the low 64 bits are written to the destination element (that is, the carry is ignored).

      PADDQ instruction with 64-Bit operands:
      DEST[63-0] ¨ DEST[63-0] + SRC[63-0];


      PADDQ instruction with 128-Bit operands:
      DEST[63-0] ¨ DEST[63-0] + SRC[63-0];
      DEST[127-64] ¨ DEST[127-64] + SRC[127-64];

      PADDQ __m64 _mm_add_si64 (__m64 a, __m64 b)
      PADDQ __m128i _mm_add_epi64 ( __m128i a, __m128i b)

       

      PSUBQ mm1,mm2/m64

      PSUBQ xmm1,xmm2/m128

      Subtracts the second operand (source operand) from the first operand (destination operand) and stores the result in the destination operand. The source operand can be a quadword integer stored in an MMX register or a 64-bit memory location, or it can be two packed quadword integers stored in an XMM register or an 128-bit memory location. The destination operand can be a quadword integer stored in an MMX register or two packed quadword integers stored in an XMM register. When packed quadword operands are used, a SIMD subtract is performed. When a quadword result is too large to be represented in 64 bits (overflow), the result is wrapped around and the low 64 bits are written to the destination element (that is, the carry is ignored).

      PSUBQ instruction with 64-Bit operands:
      DEST[63-0] ← DEST[63-0] − SRC[63-0];


      PSUBQ instruction with 128-Bit operands:
      DEST[63-0] ← DEST[63-0] − SRC[63-0];
      DEST[127-64] ← DEST[127-64] − SRC[127-64];

      PSUBQ __m64 _mm_sub_si64(__m64 m1, __m64 m2)
      PSUBQ __m128i _mm_sub_epi64(__m128i m1, __m128i m2)

       
      PMULUDQ mm1,mm2/m64

      PMULUDQ xmm1,xmm2/m128

      Multiplies the first operand (destination operand) by the second operand (source operand) and stores the result in the destination operand. The source operand can be a unsigned doubleword integer stored in the low doubleword of an MMX register or a 64-bit memory location, or it can be two packed unsigned doubleword integers stored in the first (low) and third doublewords of an XMM register or an 128-bit memory location. The destination operand can be a unsigned doubleword integer stored in the low doubleword an MMX register or two packed doubleword integers stored in the first and third doublewords of an XMM register. The result is an unsigned quadword integer stored in the destination an MMX register or two packed unsigned quadword integers stored in an XMM register. When a quadword result is too large to be represented in 64 bits (overflow), the result is wrapped around and the low 64 bits are written to the destination element (that is, the carry is ignored).

      PMULUDQ instruction with 64-Bit operands:
      DEST[63-0] ← DEST[31-0] ∗ SRC[31-0];


      PMULUDQ instruction with 128-Bit operands:
      DEST[63-0] ← DEST[31-0] ∗ SRC[31-0];
      DEST[127-64] ← DEST[95-64] ∗ SRC[95-64];

      PMULUDQ __m64 _mm_mul_su32 (__m64 a, __m64 b)
      PMULUDQ __m128i _mm_mul_epu32 ( __m128i a, __m128i b)

       
      PSHUFHW xmm1, xmm2/m128, imm8
      Copies words from the high quadword of the source operand (second operand) and inserts them in the high quadword of the destination operand (first operand) at word locations selected with the order operand (third operand). For the PSHUFHW instruction, each 2-bit field in the order operand selects the contents of one word location in the high quadword of the destination operand. The binary encodings of the order operand fields select words (0, 1, 2, or 3 4) from the high quadword of the source operand to be copied to the destination operand. The low quadword of the source operand is copied to the low quadword of the destination operand.
      DEST[63-0] ← (SRC[63-0]
      DEST[79-64] ← (SRC >> (ORDER[1-0] * 16) )[79-64]
      DEST[95-80] ← (SRC >> (ORDER[3-2] * 16) )[79-64]
      DEST[111-96] ← (SRC >> (ORDER[5-4] * 16) )[79-64]
      DEST[127-112] ← (SRC >> (ORDER[7-6] * 16) )[79-64]
      PSHUFHW __m128i _mm_shufflehi_epi16(__m128i a, int n)

       
      PSHUFLW xmm1, xmm2/m128, imm8
      Copies words from the low quadword of the source operand (second operand) and inserts them in the low quadword of the destination operand (first operand) at word locations selected with the order operand (third operand). For the PSHUFLW instruction, each 2-bit field in the order operand selects the contents of one word location in the low quadword of the destination operand. The binary encodings of the order operand fields select words (0, 1, 2, or 3) from the low quadword of the source operand to be copied to the destination operand. The high quadword of the source operand is copied to the high quadword of the destination operand.

      The source operand can be an XMM register or a 128-bit memory location. The destination
      operand is an XMM register. The order operand is an 8-bit immediate.

      Note that this instruction permits a word in the low quadword of the source operand to be copied
      to more than one word location in the low quadword of the destination operand.

      DEST[15-0] ← (SRC >> (ORDER[1-0] * 16) )[15-0]
      DEST[31-16] ← (SRC >> (ORDER[3-2] * 16) )[15-0]
      DEST[47-32] ← (SRC >> (ORDER[5-4] * 16) )[15-0]
      DEST[63-48] ← (SRC >> (ORDER[7-6] * 16) )[15-0]
      DEST[127-64] ← (SRC[127-64]
      PSHUFLW __m128i _mm_shufflelo_epi16(__m128i a, int n)

       
      PSHUFD xmm1, xmm2/m128, imm8
      Copies doublewords from source operand (second operand) and inserts them in the destination operand (first operand) at locations selected with the order operand (third operand). Each 2-bit field in the order operand selects the contents of one doubleword location in the destination operand. For example, bits 0 and 1 of the order operand selects the contents of doubleword 0 of the destination operand. The encoding of bits 0 and 1 of the order operand determines which doubleword from the source operand will be copied to doubleword 0 of the destination operand.

      DEST[31-0] ← (SRC >> (ORDER[1-0] ∗ 32) )[31-0]
      DEST[63-32] ← (SRC >> (ORDER[3-2] ∗ 32) )[31-0]
      DEST[95-64] ← (SRC >> (ORDER[5-4] ∗ 32) )[31-0]
      DEST[127-96] ← (SRC >> (ORDER[7-6] ∗ 32) )[31-0]
      PSHUFD __m128i _mm_shuffle_epi32(__m128i a, int n)

       
      PSLLDQ xmm1, imm8
      Shifts the destination operand (first operand) to the left by the number of bytes specified in the count operand (second operand). The empty low-order bytes are cleared (set to all 0s). If the value specified by the count operand is greater than 15, the destination operand is set to all 0s. The destination operand is an XMM register. The count operand is an 8-bit immediate.

      TEMP ← COUNT;
      if (TEMP > 15) TEMP ← 16;
      DEST ← DEST << (TEMP * 8);
      PSLLDQ __m128i _mm_slli_si128 ( __m128i a, int imm)

       
      PSRLDQ xmm1, imm8
      Shifts the destination operand (first operand) to the right by the number of bytes specified in the count operand (second operand). The empty high-order bytes are cleared (set to all 0s). If the value specified by the count operand is greater than 15, the destination operand is set to all 0s. The destination operand is an XMM register. The count operand is an 8-bit immediate.
      TEMP ← COUNT;
      if (TEMP > 15) TEMP ← 16;
      DEST ← DEST >> (temp * 8);
      PSRLDQ __m128i _mm_srli_si128 ( __m128i a, int imm)

       
      PUNPCKHQDQ xmm1, xmm2/m128
      Interleaves the high quadword of the source operand and the high quadword of the destination operand and writes them to the destination register. The low quadwords of the source operands are ignored.

      PUNPCKHQDQ instruction:
      DEST[63-0] ← DEST[127-64];
      DEST[127-64] ← SRC[127-64];
      PUNPCKHQDQ __m128i _mm_unpackhi_epi64 ( __m128i a, __m128i b)

       
      PUNPCKLQDQ xmm1, xmm2/m128
      Interleaves the low quadwords of the source operand and the low quadwords of the destination operand and writes them to the destination register. The high quadwords of the source operands are ignored.

      PUNPCKLQDQ
      DEST[63-0] ← DEST[63-0];
      DEST[127-64] ← SRC[63-0];
      PUNPCKLQDQ __m128i _mm_unpacklo_epi64 (__m128i m1, __m128i m2)

       
      MOVQ2DQ xmm, mm
      Moves the quadword from the source operand (second operand) to the low quadword of the destination operand (first operand). The high 64 bits of the destination register are cleared to zero. The source operand is an MMX register and the destination operand is an XMM register.

      DEST[63-0] ← SRC[63-0];
      DEST[127-64] ← 00000000000000000H;
      MOVQ2DQ __128i _mm_movpi64_pi64 ( __m64 a)

       
      MOVDQ2Q mm, xmm
      Moves the low quadword from the source operand (second operand) to the destination operand (first operand). The source operand is an XMM register and the destination operand is an MMX register.

      DEST ← SRC[63-0]
      MOVDQ2Q __m64 _mm_movepi64_pi64 ( __m128i a)


      An interesting opportunity for optimization comes from the separate MMX and SSE2 register files: it may be possible to unroll a loop by two and code an iteration using MMX and the other one using SSE2, therefore having 16 available registers instead of only 8. The scarcity of the MMX register set severely hampers the performance of unrolled loops, as the number of live variables tipically exceeds that of available registers. So mixing MMX and SSE2 may yield almost painless 2x loop unrolling.

      We can only speculate on the level of performance achievable with SSE2. The Intel Pentium III breaks 128-bit SSE instruction down into two 64-bit instructions, each operating on a half of the data values. The Pentium 4 processor is likely to feature a similar implementation, so each 128-bit SSE2 instruction will be broken into two 64-bit instructions. So will it be faster than issueing two 64-bit MMX instructions? In my opinion, yes, but that is more due to register usage than microarchitectural reasons: two independent MMX instructions use 4 different registers to elaborate 128 bits of data, while a single SSE2 instruction uses just 2 registers, so that it is possible to unroll a loop two times more with SSE2 than with MMX; an heavily unrolled loop increase the Instruction Level Parallelism and maximizes the usage of available hardware resources, i.e. execution units, by giving the out-of-order execution code a larger number of independent instructions. Under the assumption that the latencies of MMX and SSE2 instructions are equal, properly written SSE2 code should be faster than MMX code, especially when running multiplication-intensive routines: the multiplier has a latency of 3 cycles, and it is quite hard to put 5 instructions between a multiply and a dependant instruction so that it does not stall, while using SSE2 only 2 instructions would be needed.
       
       
       
       

      Home
      Next: 64-bit floating-point >>
Copyright Stefano Tommesani 2000/03 - All trademarks belong to their respective holders