|
|

128-bit MMX
I’m quite sure that Intel would not like to
see SSE2 named 128-bit MMX. In fact, MMX has
a bad reputation: the Intel marketing hype pushed it as an universal solution
to multimedia requirements, but at the same time the gaming industry switched
from mostly 2D games to Virtual Reality-like 3D games that were not accelerated
by MMX. Bad press coverage spread the news that MMX was meaningless as
it did not improve the Quake frame-rate. That would be correct if the only
applications worth running were 3D games, but the overly simplified vision
of the world shared by most hardware sites missed several points: in fact,
MMX instructions are constantly used to perform a wide array of tasks.
PhotoShop users surely remember the performance boost given by MMX, but
it should be made clear that each time you play a MP3, view a JPEG image
in your browser or play a MPEG video a lot of MMX instructions are executed.
Today all multimedia applications are built on MMX instructions, and they
are the key to run computing-intensive tasks such as speech recognition
on commonplace PCs.
Writing MMX code is still very hard, as
you have to go back to assembler, but the performance benefits are rewarding.
The support offered by current compilers is barebone. There are a few attempts
to write C++ compilers that can automatically turn normal C code into vector
MMX code, but they deal only with limited complexity loop vectorization
and place too many constraints on the parallelizable code; in general,
they appear notably less mature than vectorizing compilers available in
the supercomputing domain.
So we cannot expect to have SSE2 enabled
compilers anytime soon. This will not stop large companies that sell shrinkwrap
software from exploiting SSE2 instructions as they can afford the required
development time, but small-scale software firms are not likely to use
SSE2 until the appearance of better development tools. In my opinion, the
Pentium 4 scenario closely resembles the Pentium MMX one, where lack of
software support made the additional investment for the Pentium MMX over
plain old Pentium quite useless.
We have just analyzed the dark side of
SSE2, i.e. difficult programming; now we can go on and delve into the technical
details.
SSE2 extends MMX by using 128-bit registers
instead of 64-bit ones, effectively doubling the level of parallelism.
We may be tempted to replace MMX register names with SSE2 ones (e.g. turning
MM0 into XMM0), recompile it and see it running at twice the speed. Unfortunately,
it would not work, actually it would not even compile. These are the steps
required to migrate MMX code to SSE2:
1) replace MMX register names with
SSE2 ones, e.g. MM0 becomes XMM0;
2) replace MOVQ instructions with MOVAPD
(if the memory address is 16-byte aligned) or MOVUPD (if the memory
address is not aligned);
3) replace PSHUFW, which is a SSE
extension to MMX, by a combination of the following instructions: PSHUFHW,
PSHUFLW, PSHUFD;
4) replace PSLLQ and PSRLQ with PSLLDQ
and PSRLDQ respectively;
5) update loop counters and numeric
memory offsets, since we work on 128 bits at once instead of 64.
Looks easy, doesn’t it? Actually, it is
not that simple. Replacing 64-bit shifts with 128-bit ones is trivial,
but SSE2 expects memory references to be 16-byte aligned: while the MOVUPD
instruction lets you load unaligned memory blocks at the expense of poor
performance (so it should be not used unless strictly necessary), every
instruction that uses a memory source operand, e.g. a PADDB MM0,[EAX],
is a troublesome spot. Using unaligned memory references raises a General
Protection fault, but avoiding GPF requires quite a lot of work. First
of all, the memory allocators used in current compiler do not align data
blocks on 16-bytes boundaries, so you will have to build a wrapper function
around the malloc() function that allocates a slightly larger block than
required and correctly aligns the resulting pointer (note: the Processor Pack for
Visual C++ features an aligned_malloc() function that supports user-definable
alignment of allocated blocks). Then you will have to find out all
the lines in your source code where the code blocks that are processed
with SSE2 instructions get allocated, and replace the standard allocation
call with an invocation to your wrapper function: this is fairly easy if
you have access to all the source code of your app, but impossible when
third-party libraries allocate misaligned memory blocks; in this case,
contact the software vendor and ask for an update.
If your MMX routine spills some variables
onto the stack, we are in for more trouble, as we have to force the alignment
of the stack, and it requires the modification of the entry and exit code
of the routine.
The easiest way to fix a PSHUFW instruction
is parting it in two, a PSHUFHW and a PSHUFLW, each operating respectively
on the high and low 64-bit halves of the 128-bit register.
Here is the list of SSE2 instructions
that extend MMX (adapted from Intel’s documentation):
PADDQ mm1,mm2/m64
PADDQ xmm1,xmm2/m128 |
Adds the first operand
(destination operand) to the second operand (source operand) and stores the
result in the destination operand. The source operand can be a quadword
integer stored in an MMX register or a 64-bit memory location, or it can be
two packed quadword integers stored in an XMM register or an 128-bit memory
location. The destination operand can be a quadword integer stored in an MMX
register or two packed quadword integers stored in an XMM register. When
packed quadword operands are used, a SIMD add is performed. When a quadword
result is too large to be represented in 64 bits (overflow), the result is
wrapped around and the low 64 bits are written to the destination element
(that is, the carry is ignored).
 |
PADDQ instruction with 64-Bit
operands:
DEST[63-0] ¨ DEST[63-0] + SRC[63-0];
PADDQ instruction with 128-Bit operands:
DEST[63-0] ¨ DEST[63-0] + SRC[63-0];
DEST[127-64] ¨ DEST[127-64] + SRC[127-64]; |
PADDQ __m64 _mm_add_si64
(__m64 a, __m64 b)
PADDQ __m128i _mm_add_epi64 ( __m128i a, __m128i b) |
PSUBQ mm1,mm2/m64
PSUBQ xmm1,xmm2/m128 |
Subtracts the second operand
(source operand) from the first operand (destination operand) and stores the
result in the destination operand. The source operand can be a quadword
integer stored in an MMX register or a 64-bit memory location, or it can be
two packed quadword integers stored in an XMM register or an 128-bit memory
location. The destination operand can be a quadword integer stored in an MMX
register or two packed quadword integers stored in an XMM register. When
packed quadword operands are used, a SIMD subtract is performed. When a
quadword result is too large to be represented in 64 bits (overflow), the
result is wrapped around and the low 64 bits are written to the destination
element (that is, the carry is ignored).
 |
PSUBQ instruction with 64-Bit
operands:
DEST[63-0] ← DEST[63-0] − SRC[63-0];
PSUBQ instruction with 128-Bit operands:
DEST[63-0] ← DEST[63-0] − SRC[63-0];
DEST[127-64] ← DEST[127-64] − SRC[127-64]; |
PSUBQ __m64 _mm_sub_si64(__m64
m1, __m64 m2)
PSUBQ __m128i _mm_sub_epi64(__m128i m1, __m128i m2) |
PMULUDQ mm1,mm2/m64
PMULUDQ xmm1,xmm2/m128 |
Multiplies the first operand
(destination operand) by the second operand (source operand) and stores the
result in the destination operand. The source operand can be a unsigned
doubleword integer stored in the low doubleword of an MMX register or a
64-bit memory location, or it can be two packed unsigned doubleword integers
stored in the first (low) and third doublewords of an XMM register or an
128-bit memory location. The destination operand can be a unsigned
doubleword integer stored in the low doubleword an MMX register or two
packed doubleword integers stored in the first and third doublewords of an
XMM register. The result is an unsigned quadword integer stored in the
destination an MMX register or two packed unsigned quadword integers stored
in an XMM register. When a quadword result is too large to be represented in
64 bits (overflow), the result is wrapped around and the low 64 bits are
written to the destination element (that is, the carry is ignored).
 |
PMULUDQ instruction with
64-Bit operands:
DEST[63-0] ← DEST[31-0] ∗ SRC[31-0];
PMULUDQ instruction with 128-Bit operands:
DEST[63-0] ← DEST[31-0] ∗ SRC[31-0];
DEST[127-64] ← DEST[95-64] ∗ SRC[95-64]; |
PMULUDQ __m64 _mm_mul_su32
(__m64 a, __m64 b)
PMULUDQ __m128i _mm_mul_epu32 ( __m128i a, __m128i b) |
PSHUFHW xmm1, xmm2/m128, imm8 |
Copies words from the high
quadword of the source operand (second operand) and inserts them in the high
quadword of the destination operand (first operand) at word locations
selected with the order operand (third operand). For the PSHUFHW
instruction, each 2-bit field in the order operand selects the contents of
one word location in the high quadword of the destination operand. The
binary encodings of the order operand fields select words (0, 1, 2, or 3 4)
from the high quadword of the source operand to be copied to the destination
operand. The low quadword of the source operand is copied to the low
quadword of the destination operand. |
DEST[63-0] ← (SRC[63-0]
DEST[79-64] ← (SRC >> (ORDER[1-0] * 16) )[79-64]
DEST[95-80] ← (SRC >> (ORDER[3-2] * 16) )[79-64]
DEST[111-96] ← (SRC >> (ORDER[5-4] * 16) )[79-64]
DEST[127-112] ← (SRC >> (ORDER[7-6] * 16) )[79-64] |
PSHUFHW __m128i
_mm_shufflehi_epi16(__m128i a, int n) |
PSHUFLW xmm1, xmm2/m128, imm8 |
Copies words from the low
quadword of the source operand (second operand) and inserts them in the low
quadword of the destination operand (first operand) at word locations
selected with the order operand (third operand). For the PSHUFLW
instruction, each 2-bit field in the order operand selects the contents of
one word location in the low quadword of the destination operand. The binary
encodings of the order operand fields select words (0, 1, 2, or 3) from the
low quadword of the source operand to be copied to the destination operand.
The high quadword of the source operand is copied to the high quadword of
the destination operand.
The source operand can be an XMM register or
a 128-bit memory location. The destination
operand is an XMM register. The order operand is an 8-bit immediate.
Note that this instruction permits a word in
the low quadword of the source operand to be copied
to more than one word location in the low quadword of the destination
operand.
 |
DEST[15-0] ← (SRC >>
(ORDER[1-0] * 16) )[15-0]
DEST[31-16] ← (SRC >> (ORDER[3-2] * 16) )[15-0]
DEST[47-32] ← (SRC >> (ORDER[5-4] * 16) )[15-0]
DEST[63-48] ← (SRC >> (ORDER[7-6] * 16) )[15-0]
DEST[127-64] ← (SRC[127-64] |
PSHUFLW __m128i
_mm_shufflelo_epi16(__m128i a, int n) |
PSHUFD xmm1, xmm2/m128, imm8 |
Copies doublewords from source
operand (second operand) and inserts them in the destination operand (first
operand) at locations selected with the order operand (third operand). Each
2-bit field in the order operand selects the contents of one doubleword
location in the destination operand. For example, bits 0 and 1 of the order
operand selects the contents of doubleword 0 of the destination operand. The
encoding of bits 0 and 1 of the order operand determines which doubleword
from the source operand will be copied to doubleword 0 of the destination
operand.
 |
DEST[31-0] ← (SRC >>
(ORDER[1-0] ∗ 32) )[31-0]
DEST[63-32] ← (SRC >> (ORDER[3-2] ∗ 32) )[31-0]
DEST[95-64] ← (SRC >> (ORDER[5-4] ∗ 32) )[31-0]
DEST[127-96] ← (SRC >> (ORDER[7-6] ∗ 32) )[31-0] |
PSHUFD __m128i
_mm_shuffle_epi32(__m128i a, int n) |
PSLLDQ xmm1, imm8 |
Shifts the destination operand
(first operand) to the left by the number of bytes specified in the count
operand (second operand). The empty low-order bytes are cleared (set to all
0s). If the value specified by the count operand is greater than 15, the
destination operand is set to all 0s. The destination operand is an XMM
register. The count operand is an 8-bit immediate.
 |
TEMP ← COUNT;
if (TEMP > 15) TEMP ← 16;
DEST ← DEST << (TEMP * 8); |
PSLLDQ __m128i _mm_slli_si128
( __m128i a, int imm) |
PSRLDQ xmm1, imm8 |
Shifts the destination operand
(first operand) to the right by the number of bytes specified in the count
operand (second operand). The empty high-order bytes are cleared (set to all
0s). If the value specified by the count operand is greater than 15, the
destination operand is set to all 0s. The destination operand is an XMM
register. The count operand is an 8-bit immediate. |
TEMP ← COUNT;
if (TEMP > 15) TEMP ← 16;
DEST ← DEST >> (temp * 8); |
PSRLDQ __m128i _mm_srli_si128
( __m128i a, int imm) |
PUNPCKHQDQ xmm1, xmm2/m128 |
Interleaves the high quadword
of the source operand and the high quadword of the destination operand and
writes them to the destination register. The low quadwords of the source
operands are ignored.
 |
PUNPCKHQDQ instruction:
DEST[63-0] ← DEST[127-64];
DEST[127-64] ← SRC[127-64]; |
PUNPCKHQDQ __m128i
_mm_unpackhi_epi64 ( __m128i a, __m128i b) |
PUNPCKLQDQ xmm1, xmm2/m128 |
Interleaves the low quadwords
of the source operand and the low quadwords of the destination operand and
writes them to the destination register. The high quadwords of the source
operands are ignored.
 |
PUNPCKLQDQ
DEST[63-0] ← DEST[63-0];
DEST[127-64] ← SRC[63-0]; |
PUNPCKLQDQ __m128i
_mm_unpacklo_epi64 (__m128i m1, __m128i m2) |
MOVQ2DQ xmm, mm |
Moves the quadword from the
source operand (second operand) to the low quadword of the destination
operand (first operand). The high 64 bits of the destination register are
cleared to zero. The source operand is an MMX register and the destination
operand is an XMM register.
 |
DEST[63-0] ← SRC[63-0];
DEST[127-64] ← 00000000000000000H; |
MOVQ2DQ __128i
_mm_movpi64_pi64 ( __m64 a) |
MOVDQ2Q mm, xmm |
Moves the low quadword from
the source operand (second operand) to the destination operand (first
operand). The source operand is an XMM register and the destination operand
is an MMX register.
 |
DEST ← SRC[63-0] |
MOVDQ2Q __m64
_mm_movepi64_pi64 ( __m128i a) |
An interesting opportunity for optimization
comes from the separate MMX and SSE2 register files: it may be possible
to unroll a loop by two and code an iteration using MMX and the other one
using SSE2, therefore having 16 available registers instead of only 8.
The scarcity of the MMX register set severely hampers the performance of
unrolled loops, as the number of live variables tipically exceeds that
of available registers. So mixing MMX and SSE2 may yield almost painless
2x loop unrolling.
We can only speculate on the level of
performance achievable with SSE2. The Intel Pentium III breaks 128-bit
SSE
instruction down into two 64-bit instructions, each operating on a half
of the data values. The Pentium 4 processor is likely to feature a similar
implementation, so each 128-bit SSE2 instruction will be broken into two
64-bit instructions. So will it be faster than issueing two 64-bit MMX
instructions? In my opinion, yes, but that is more due to register usage
than microarchitectural reasons: two independent MMX instructions use 4
different registers to elaborate 128 bits of data, while a single SSE2
instruction uses just 2 registers, so that it is possible to unroll a loop
two times more with SSE2 than with MMX; an heavily unrolled loop increase
the Instruction Level Parallelism and maximizes the usage of available
hardware resources, i.e. execution units, by giving the out-of-order execution
code a larger number of independent instructions. Under the assumption
that the latencies of MMX and SSE2 instructions are equal, properly written
SSE2 code should be faster than MMX code, especially when running multiplication-intensive
routines: the multiplier has a latency of 3 cycles, and it is quite hard
to put 5 instructions between a multiply and a dependant instruction so
that it does not stall, while using SSE2 only 2 instructions would be needed.
|
|