
The latest trend in PC games is 3D graphics:
during the past few years, almost all kinds of games have turned to 3D
graphics, greatly increasing the demand of processors with strong floatingpoint
performance, because the front end of a typical 3D game software pipeline
must perform object physics, geometry transformations, clipping and lighting
calculations. On the screen, 3D objects consist of thousands of polygons,
usually triangles. The smaller and more numerous the polygons, the more
detailed the object. When an object moves on the screen, rotates to a different
viewing angle, or moves forward or backward in virtual space, the program
must recalculate every vertex of every polygon. This is a geometry transformation,
and it requires heavyduty matrix multiplication: the program must multiply
a 1x4 vector of coordinates against a 4x4 transform matrix. This requires
up to 16 FP multiplies and 12 FP additions for each vertex, so the floatingpoint
unit of the processor is usually the performance bottleneck of 3D games.
Intel’s processors have always had strong numeric performance, given that
they were designed not only for typical business applications, but also
for scientific programs that are usually floatingpoint intensive. AMD
processors have been designed to run business software fast, but they lack
the floatingpoint performance of Intel chips. To gain a share of the rapidlygrowing
gaming market, AMD has introduced the 3D Now! instructions, designed to
exceed the performance of the Intel Pentium II when running 3D games.
The AMD
3D Now! technology provides 21 additional instructions to support highperformance
3D graphics and audio processing. The 3D Now! instructions are vector instructions
that operate on 64bit registers, divided into two 32bit singleprecision
floatingpoint words that are compatible with the IEEE754, singleprecision
format. This format comprises a 1bit sign, an 8bit biased exponent, and
a 23bit significand with one hidden integer bit for a total of 24 bits
in the significand. In contrast to the IEEE standard that dictates four
rounding modes, the AMD K6 processors support only the roundtonearest
mode.
The AMD K6/Athlon processors implement
eight 3D Now! registers that are mapped onto the floatingpoint registers
just like MMX registers: aliasing the 3D Now!
registers onto the floatingpoint stack provides a safe method to introduce
this technology, because it does not require modifications to existing
operating systems; furthermore it is possible to write x86 programs containing
both integer, MMX and SIMD floating point instructions with no performance
penalty for switching between the integer MMX and the floatingpoint 3D
Now! units.
The following sections give a brief overview
of each group of instructions in the 3D Now! set and the instructions within
each group.

Arithmetic instructions
PFADD adds the source operand to the destination
operand.
PFACC accumulates the two words of the
destination operand and the source operand and stores the results in the
low and high words of destination operand respectively.
PFSUB subtracts the source operand from
the destination operand, while PFSUBR subtracts the destination operand
from the source operand.
PFMUL multiplies the source and destination
operands.

Comparison instructions
PFCMPEQ compares the destination operand
and the source operand and generates all one bits or all zero bits based
on the result of the corresponding comparison.
PFCMPGT tests if the destination operand
is greater than the source operand, and PFCMPGE tests if the destination
operand is greater or equal to the source operand.
PFMAX returns the larger of the two singleprecision
floatingpoint operands, PFMIN the smaller.

Conversion instructions
PI2FD converts a MMX register containing
signed 32bit integers to singleprecision floatingpoint operands.
PF2ID converts a 3D Now! register containing
singleprecision floatingpoint operands to 32bit signed integers using
truncation.

Reciprocal instructions
PFRCP returns a lowprecision estimate
of the reciprocal of the source operand. Increased accuracy requires the
use of two additional instructions (PFRCPIT1 and PFRCPIT2). As an example,
consider the quotient q = a/b. The PFRCP instruction quickly produces
a 1415 bit precision approximation of 1/b using an onchip, ROMbased
table lookup. A fullprecision can then be computer from this approximation
using the NewtonRaphson algorithm: X[i+1] = X[i] * (2  b*X[i]).
Given that the initial approximation X[0]
is accurate to at least 14 bits, and that the IEEE single precision contains
24 bits of mantissa, only one NewtonRaphson iteration is required. The
following code sample shows the 3D Now! instructions that produce the initial
approximation of the reciprocal, then compute the fullprecision reciprocal
from this, and lastly complete the required division:
X0 = PFRCP(b)
X1 = PFRCPIT1(b,X0)
X2 = PFRCPIT2(X1,X0)
q = PFMUL(a,X2)
The 24bit final reciprocal value is X2.
The quotient is formed in the last step by multiplying the reciprocal by
the dividend a.
PFRSQRT returns a lowprecision estimate
of the reciprocal square root of the source operand. Increased accuracy
requires the use of two additional instructions (PFRSQIT1 and PFRCPIT2).
The general NewtonRaphson reciprocal
square root recurrence is:
X[i+1] = 0.5 * X[i] * (3  b*X[i]^2)
To reduce the number of iterations, X0
is an initial approximation read from a table. The 3D Now! reciprocal square
root approximation is accurate to at least 15 bits. Accordingly, to obtain
a single precision 24bit reciprocal square root of an input operand b,
one NewtonRaphson iteration is required using the following 3D Now! instructions:
X0 = PFRSQRT(b)
X1 = PFMUL(X0 ,X0)
X2 = PFRSQIT1(b,X1)
X3 = PFRCPIT2(X2 ,X0)
X4 = PFMUL(b,X3)
The 24bit final reciprocal square root
value is X3.

Integer instructions
These operations extend the MMX
instruction set. Developers should be aware that using these instructions
in generic MMX code will reduce the potential market to AMD 3D Now! processors
only.
PAVGUSB produces the rounded averages
of the eight unsigned 8bit integer values in the source operand and the
eight corresponding unsigned 8bit integer values in the destination operand.
This instruction can be used for pixel averaging in MPEG2 motion compensation
and video scaling operations.
PMULHRW multiplies the four signed 16bit
integer values in the source operand by the four corresponding signed 16bit
integer values in the destination, then adds 8000h to the lower 16 bits
of the 32bit result, which results in the rounding of the highorder 16bit
result. The PMULHRW instruction provides a numerically more accurate result
than the PMULH instruction, which truncates
the result instead of rounding.

Other instructions
In applications where a large number of
data sets must be processed, the PREFETCH instruction can preload the
next data set into the data cache while the processor is simultaneously
operating on the present set of data. When the present set of data values
is completed, the next set is already available in the data cache, greatly
reducing the delays due to cache misses. An example of a concurrent operation
is vertices processing in 3D transformations, where the next set of vertices
can be prefetched into the data cache while the present set is being transformed.
FEMMS offers a faster context switch at
the end of an MMX routine than EMMS.
Update:
there are rumours that the next Athlon core, codenamed Palomino, will
support SSE instructions. Current AMD processors already support integer
SSE instructions, but the adoption of floatingpoint SSE
instructions means the end of 3DNow!. AMD's market share is steadily rising,
due to its excellent price / performance ratio, but it still does not grab
more than one quarter of the market; this means that developer are more
likely to write SSE code than 3DNow! code. At the same time, SSE is a
more powerful instruction set, designed without the compatibility problems
that limited the level of parallelism of 3DNow!.
Update:
the Athlon XP processor supports SSE instructions (curiously named 3DNow!
Pro)

