|

1.
Introduction
The MMX technology is
designed to accelerate
multimedia and communications applications by including new
instructions
and data types that allow applications to achieve a new level of
performance.
It exploits the parallelism inherent in many multimedia and
communications
algorithms, yet maintains full compatibility with existing operating
systems
and applications.
A wide range of software
applications,
including graphics, MPEG video,
music synthesis, speech compression
and
recognition, image processing, games,
video
conferencing and more, shows many common, fundamental
characteristics:
- small integer data
types (for example: 8-bit
pixels, 16-bit audio samples)
- small, highly
repetitive loops
- frequent
multiplies
and accumulates
- compute-intensive
algorithms
- highly parallel
operations
The MMX technology is
designed as a set of
general
purpose integer instructions that can be applied to the needs of the
wide
diversity of multimedia and communications applications. The highlights
of the technology are:
- Single
Instruction,
Multiple Data (SIMD)
technique
- 57 new
instructions
- 8 64-bit wide MMX
registers, named mm0 up to mm7
- 4 new data
types
MMX technology introduces
four new data types:
three packed data types (bytes, words and doublewords, respectively
being 8, 16 and 32 bits wide for each data element) and a new 64-bit
entity. Each element within
the
packed data types is an independent fixed-point integer. The
architecture
does not specify the place of the fixed point within the elements,
because
it is up to the developer the control of its place within each element
throughout the calculation. This adds a burden on the developer, but it
also leaves a large amount of flexibility to choose and change the
precision
of fixed-point numbers during the course of the application in order to
fully control the dynamic range of values.
The four MMX technology data
types are:
-
Packed
byte --
8 bytes packed into
one 64-bit quantity
-
Packed
word --
4 16-bit words packed
into one 64-bit quantity
-
Packed
doubleword
– 2 32-bit double
words packed into one 64-bit quantity
-
Quadword
--
one 64-bit quantity
As an
example, graphics
pixel data are generally
represented in 8-bit integers, or bytes. With MMX technology, eight of
these pixels are packed together in a 64-bit quantity and moved into an
MMX register; when an MMX instruction executes, it takes all eight of
the
pixel values at once from the MMX register, performs the arithmetic or
logical operation on all eight elements in parallel, and writes the
result
into an MMX register. The degree of parallelism that can be achieved
with
the MMX technology depends on the size of data, ranging from 8 when
using
8-bit data to 1, i.e. no parallelism, when using 64-bit data.
The
MMX technology is integrated into Intel x86 architecture in a way that
maintains full compatibility with existing operating systems. This is
obtained by aliasing MMX
registers and state upon the x86 floating-point registers and state.
Therefore,
no new registers or states are added to support MMX technology, so that
the operating system uses the standard mechanisms for interacting with
the floating point state to save and restore MMX code: floating-point
instructions
that save/restore the floating-point state also handle the MMX state
(for
example, during context switching).
Aliasing the MMX state upon the
floating-point
state does not preclude applications from executing both MMX routines
and
floating point routines, but the developer cannot freely interleave MMX
and
floating
point instructions, and he must insert an EMMS
instruction before switching between MMX and floating point code
sequences.
2. Instruction
set
The MMX instructions cover
several functional
areas including:
- basic arithmetic
operations such as add, subtract, multiply, arithmetic shift and
multiply-add
- comparison
operations
- conversion
instructions to convert between the new data types: pack data together,
and unpack from small to larger data types
- logical operations
such as AND, AND NOT,OR, and XOR
- shift operations
- data transfer
instructions for MMX register-to-register transfers, or 64-bit and
32-bit
load/store to memory
- state management
instruction to handle MMX to floating point transitions
Arithmetic, comparison and
shift instructions
are designed to support the different packed integer data types: these
instructions have a different opcode for each supported data type. As a
result, the MMX technology instructions are implemented with 57 opcodes.
All MMX instructions,
except the EMMS
instruction, reference and operate on two operands: the source and the
destination operand. The first operand is the destination and the
second
operand is the source. The instruction overwrites the destination
operand
with the result. For example, a two-operand instruction
OPERATION DEST, SRC
would be decoded as:
DEST = DEST OPERATION
SRC
A typical MMX
instruction has this syntax:
- Prefix: P
for
Packed
- Instruction
operation: for example
- ADD, CMP, or XOR
- Suffix:
- US for
Unsigned Saturation
- S for
Signed
saturation
- B, W, D, Q
for the data type: packed
byte, packed word, packed doubleword, or quadword.
As an example, PADDSB is a
MMX instruction
(P) that sums (ADD) the 8 bytes (B) of the source and destination
operands
and saturates the result (S).
Instructions that have
different input
and output data elements have two data-type suffixes: for example, the
conversion instruction converts from one data type to another, so it
has
two suffixes, one for the original data type and the second for the
converted
data type.
The next pages describe in
depth the full
set of MMX instructions, grouped by functional areas. The box on the
right
side representes the syntax of that instruction; here is a list of the
symbols used to represent operands in the instruction statements:
- imm8: an
immediate byte value, imm8
is a signed number between -128 and +127 inclusive.
- r/m32: a
doubleword register or memory
operand used for instructions whose operand-size attribute is 32
bits.
- mm/m32:
indicates the lowest 32 bits
of an MMX register or a 32-bit memory location.
- mm/m64:
indicates a 64-bit MMX register
or a 64-bit memory location.
As an example,
OP mm, mm/m64
means that the destination
operand of
the OP instruction is an MMX register, while the source operand can
either
be an MMX register or a 64-bit memory operand.
3. Examples
and benchmarks
The Intel
MMX Application Notes offer a wide overview of the benefits
achievable
by using MMX instructions. All
performance
data
was extracted from
Application Notes, and it generally refers to the Pentium MMX
microarchitecture.
Before starting to code in
assembly for MMX, you should take a look at Quexal,
the visual development environment for MMX and ISSE coding that will
make your life a lot easier!
Here is a list of currently available
Application Notes, grouped by arguments. The column on the right shows
the speed-up
obtained moving from scalar C code to MMX code.
Title |
Speed-up |
Audio |
|
Audio Echo
Effects |
5.9x |
MPEG1 Audio
Kernels |
|
G.728 Code
Book
Search |
2.7x |
Levinson-Durbin
Filter |
|
Schur-Weiner
Filter |
|
Communications |
|
Passband Echo
Canceller |
|
Baseband Echo
Canceller |
|
1/3 T Equalizer |
|
2/3 T Spaced
Equalizer |
|
DSP Kernels |
|
Efficient
Vector/Matrix Multiply Routine |
14.6x |
Matrix
Transpose |
2x |
Real 16-bit FFT |
|
Dot Product -
16x16 -> 32 |
5x |
Real FIR - 16
bit |
5x |
Vector
Arithmetic and Logic Operations |
6x |
High Precision
Multiply |
|
Data Alignment |
|
Graphics (2D) |
|
Fractals with
MMX Technology |
1.5x |
Sprite Overlay |
|
Graphics (3D) |
|
Advanced
Procedural Texturing |
10x |
AGP and 3D
Graphics Software |
|
MMX Technology
for 3D Rendering |
|
3D Bilinear
Texture Mapping |
7x |
Gourand Shading |
|
3D Transform |
3.1x |
Image Processing |
|
YUV12 to RGB
Color Conversion |
|
2X 8-bit Image
Scaling |
13.5x |
Bilinear
Interpolation |
3.9x |
Median Filter |
3.8x |
Row Filter - 8
bit |
|
Column Filter |
|
Alpha Blending |
8x |
24 to 16 bit
Conversion |
|
RGB -> YUV |
> 10x |
Speech
Recognition |
|
Viterbi
Decoding |
2x |
L1 Distance
Measure |
3.3x |
L2 Norm
Distance
Measure |
7.3x |
Video |
|
IDCT 2D 8x8 |
3.5x |
Motion
Compensation |
|
Absolute
Difference |
5x |
Haar Transform
-
2x2 |
2.2x |
Get Bits |
2.4x |
Video Loop
Filter |
1.9 |

|
|