MMX Primer

1. Introduction

2. Instruction set

3. Examples and benchmarks

1. Introduction
The MMX technology is designed to accelerate multimedia and communications applications by including new instructions and data types that allow applications to achieve a new level of performance. It exploits the parallelism inherent in many multimedia and communications algorithms, yet maintains full compatibility with existing operating systems and applications.
A wide range of software applications, including graphics, MPEG video, music synthesis, speech compression and recognition, image processing, games, video conferencing and more, shows many common, fundamental characteristics:

small integer data types (for example: 8-bit pixels, 16-bit audio samples)

small, highly repetitive loops

frequent multiplies and accumulates

compute-intensive algorithms

highly parallel operations

The MMX technology is designed as a set of general purpose integer instructions that can be applied to the needs of the wide diversity of multimedia and communications applications. The highlights of the technology are:

Single Instruction, Multiple Data (SIMD) technique

57 new instructions

8 64-bit wide MMX registers, named mm0 up to mm7

4 new data types

MMX technology introduces four new data types: three packed data types (bytes, words and doublewords, respectively being 8, 16 and 32 bits wide for each data element) and a new 64-bit entity. Each element within the packed data types is an independent fixed-point integer. The architecture does not specify the place of the fixed point within the elements, because it is up to the developer the control of its place within each element throughout the calculation. This adds a burden on the developer, but it also leaves a large amount of flexibility to choose and change the precision of fixed-point numbers during the course of the application in order to fully control the dynamic range of values.
The four MMX technology data types are:

Packed byte -- 8 bytes packed into one 64-bit quantity

Packed word -- 4 16-bit words packed into one 64-bit quantity

Packed doubleword – 2 32-bit double words packed into one 64-bit quantity

Quadword -- one 64-bit quantity

As an example, graphics pixel data are generally represented in 8-bit integers, or bytes. With MMX technology, eight of these pixels are packed together in a 64-bit quantity and moved into an MMX register; when an MMX instruction executes, it takes all eight of the pixel values at once from the MMX register, performs the arithmetic or logical operation on all eight elements in parallel, and writes the result into an MMX register. The degree of parallelism that can be achieved with the MMX technology depends on the size of data, ranging from 8 when using 8-bit data to 1, i.e. no parallelism, when using 64-bit data.
The MMX technology is integrated into Intel x86 architecture in a way that maintains full compatibility with existing operating systems. This is obtained by aliasing MMX registers and state upon the x86 floating-point registers and state. Therefore, no new registers or states are added to support MMX technology, so that the operating system uses the standard mechanisms for interacting with the floating point state to save and restore MMX code: floating-point instructions that save/restore the floating-point state also handle the MMX state (for example, during context switching).
Aliasing the MMX state upon the floating-point state does not preclude applications from executing both MMX routines and floating point routines, but the developer cannot freely interleave MMX and floating point instructions, and he must insert an EMMS instruction before switching between MMX and floating point code sequences.

2. Instruction set
The MMX instructions cover several functional areas including:

basic arithmetic operations such as add, subtract, multiply, arithmetic shift and multiply-add

comparison operations

conversion instructions to convert between the new data types: pack data together, and unpack from small to larger data types

logical operations such as AND, AND NOT,OR, and XOR

shift operations

data transfer instructions for MMX register-to-register transfers, or 64-bit and 32-bit load/store to memory

state management instruction to handle MMX to floating point transitions

Arithmetic, comparison and shift instructions are designed to support the different packed integer data types: these instructions have a different opcode for each supported data type. As a result, the MMX technology instructions are implemented with 57 opcodes.
All MMX instructions, except the EMMS instruction, reference and operate on two operands: the source and the destination operand. The first operand is the destination and the second operand is the source. The instruction overwrites the destination operand with the result. For example, a two-operand instruction
OPERATION DEST, SRC

would be decoded as:

DEST = DEST OPERATION SRC

A typical MMX instruction has this syntax:

Prefix: P for Packed

Instruction operation: for example - ADD, CMP, or XOR

Suffix:

US for Unsigned Saturation

S for Signed saturation

B, W, D, Q for the data type: packed byte, packed word, packed doubleword, or quadword.

As an example, PADDSB is a MMX instruction (P) that sums (ADD) the 8 bytes (B) of the source and destination operands and saturates the result (S).
Instructions that have different input and output data elements have two data-type suffixes: for example, the conversion instruction converts from one data type to another, so it has two suffixes, one for the original data type and the second for the converted data type.
The next pages describe in depth the full set of MMX instructions, grouped by functional areas. The box on the right side representes the syntax of that instruction; here is a list of the symbols used to represent operands in the instruction statements:

imm8: an immediate byte value, imm8 is a signed number between -128 and +127 inclusive.

r/m32: a doubleword register or memory operand used for instructions whose operand-size attribute is 32 bits.

mm/m32: indicates the lowest 32 bits of an MMX register or a 32-bit memory location.

mm/m64: indicates a 64-bit MMX register or a 64-bit memory location.

As an example,
OP mm, mm/m64
means that the destination operand of the OP instruction is an MMX register, while the source operand can either be an MMX register or a 64-bit memory operand.

3. Examples and benchmarks
The Intel MMX Application Notes offer a wide overview of the benefits achievable by using MMX instructions. All performance data was extracted from Application Notes, and it generally refers to the Pentium MMX microarchitecture.
Before starting to code in assembly for MMX, you should take a look at Quexal, the visual development environment for MMX and ISSE coding that will make your life a lot easier!
Here is a list of currently available Application Notes, grouped by arguments. The column on the right shows the speed-up obtained moving from scalar C code to MMX code.

Title Speed-up

Audio

Audio Echo Effects 5.9x

MPEG1 Audio Kernels

G.728 Code Book Search 2.7x

Levinson-Durbin Filter

Schur-Weiner Filter

Communications

Passband Echo Canceller

Baseband Echo Canceller

1/3 T Equalizer

2/3 T Spaced Equalizer

DSP Kernels

Efficient Vector/Matrix Multiply Routine 14.6x

Matrix Transpose 2x

Real 16-bit FFT

Dot Product - 16x16 -> 32 5x

Real FIR - 16 bit 5x

Vector Arithmetic and Logic Operations 6x

High Precision Multiply

Data Alignment

Graphics (2D)

Fractals with MMX Technology 1.5x

Sprite Overlay

Graphics (3D)

Advanced Procedural Texturing 10x

AGP and 3D Graphics Software

MMX Technology for 3D Rendering

3D Bilinear Texture Mapping 7x

Gourand Shading

3D Transform 3.1x

Image Processing

YUV12 to RGB Color Conversion

2X 8-bit Image Scaling 13.5x

Bilinear Interpolation 3.9x

Median Filter 3.8x

Row Filter - 8 bit

Column Filter

Alpha Blending 8x

24 to 16 bit Conversion

RGB -> YUV > 10x

Speech Recognition

Viterbi Decoding 2x

L1 Distance Measure 3.3x

L2 Norm Distance Measure 7.3x

Video

IDCT 2D 8x8 3.5x

Motion Compensation

Absolute Difference 5x

Haar Transform - 2x2 2.2x

Get Bits 2.4x

Video Loop Filter 1.9

1. Introduction

2. Instruction set

3. Examples and benchmarks