
This section describes example uses of the
MMX
instruction set to implement basic coding structures.
Conditional
Select
Operating on multiple data operands using
a single instruction presents an interesting issue: what happens when a
computation is only done if the operand value passes some conditional check?
For example, in an absolute value calculation, only if the number is already
negative a 2’s complement is performed on it:
for i = 1 to 100
if a[i] <
0
then
b[i] = - a[i]
else
b[i] = a[i]
There are different approaches possible,
and some are simpler than others. Using a branch approach does not work
well for two reasons: a branch-based solution is slower because of the
inherent branch misprediction penalty, and because of the need to convert
packed data types to scalars. Direct conditional execution support does
not work well for the x86 IA since it requires three independent operands
(source, source/destination, and predicate vector).
The MMX technology adopts a simpler design:
a conditional execution is converted into a conditional assignment. MMX
compare operations result in a bit mask corresponding to the length of
the operands: for example, a compare operation operating on packed byte
operands produce byte-wide masks. These masks then can be used in conjunction
with logical operations to achieve conditional assignment. Consider the
following example:
If True
then Ra := Rb
else Ra := Rc
Assuming that register Rx contains all
1’s if the condition is true and all 0’s if the condition is false, Ra
can be computed with the following logical expression:
Ra = (Rb AND Rx) OR (Rc ANDNOT Rx)
This approach works for operations with
a register as the destination. Conditional assignment to memory can be
implemented as a sequence of load, conditional assignment, and store.
The Chroma Keying example demonstrates
how conditional selection using the MMX instruction set removes branches,
in addition to performing multiple selection operations in parallel. Text
overlay on a pix/video background, and sprite overlays in games are some
of the other operations that would benefit from this technique.
Most have seen the television weather
man overlaid on the image of a weather map. In this example a blue screen
is used to overlay an image of a woman on a background picture.
PCMPEQ
(packed compare for equality) is performed on the weathercaster and blue-screen
images, yielding a bitmask that traces the outline of the weathercaster.

This bitmask image is PANDNed
(packed and not) with the weathercaster image, yielding the first intermediate
image: now the weathercaster has no background behind her.

The same bitmask image is PANDed
(packed and) with the weather map image, yielding the second intermediate
image.

The two intermediate images are PORed
(packed or) together, resulting in final composite of the weathercaster
over weather map

Vector Dot
Product
The vector dot product is one of the most
basic algorithms used in signal-processing of multimedia data such as images,
audio, video and sound. The following example shows how the PMADD
instruction helps speed up algorithms using vector dot products.
The PMADD instruction handles four multiplies
and two additions at a time: it starts from a 16-bit, packed data type
and generates a 32-bit packed, data type result, then it multiplies all
the corresponding elements generating four 32-bit results, and adds the
two products on the left together for one result and the two products on
the right together for the other result. To complete a multiply-accumulate
operation, the results would then be added to another register which is
used as the accumulator.
Assuming that the precision supported
by the PMADD instruction is sufficient, this dot-product example on eight-element
vectors can be completed using eight MMX instructions: 2 PMADDs,
2 more PADDs, 2 shifts
(if needed to fix the precision after the multiply operation), and 2 memory
moves to load one of the vectors (the other vector is loaded by the
PMADD instruction which can have one of its operands come from memory).
Comparing instruction counts with and without
MMX technology for this operation yields that only one third of the number
of instructions is needed with MMX.
