
64bit floating point instructions
In the introduction we have outlined the applications
that require 64bit precision, scientific simulations and CAD/CAM being
notable examples. However, the transition from normal scalar code to 64bit
floatingpoint SSE2 code is complex and it may require some major design
changes. A more conservative approach would be moving to scalar SSE2 code,
i.e. using scalar instruction (you can easily identify them by the S postfix
instead of D) that work on a single 64bit datum. The top benefit of this
strategy is that no parallelism is exploited, so it will naturally fit
the existing scalar code, and that it does not require 16byte alignment
of memory operands; the major drawback is that it wastes a potential 2x
speedup. The tradeoff between development time and expected performance
determines which strategy is more sensible. It should be noted that handcoding
algorithms with SSE2 should be faster than with x87, as SSE2 offers directly
addressable registers instead of the unwieldy x87 register stack.
Pentium 4 processors show
really poor x87 performance, far below that of the current champion AMD
Athlon; it is therefore clear
that the route to fast floatingpoint computations passes through SSE2.
If Intel can get both Microsoft
and Borland to work on a vectorizing
compiler, the Pentium 4 may prove to be a winner, but if compiler support
will be lackluster (such as the current support for MMX
and SSE) it is likely that the Pentium 4 will suffer
from lack of optimized software.
Here is the list of SSE2 instructions that
extend SSE (adapted from Intel’s
preliminary documentation):

DATA MOVEMENT INSTRUCTIONS
MOVAPD (move aligned packed doubleprecision
floatingpoint) transfers a 128bit doubleprecision floatingpoint operand
from memory to an XMM register and vice versa, or between XMM registers.
The memory address must be aligned to a 16byte boundary, otherwise a general
protection exception (GP#) is generated.
MOVUPD (move unaligned packed doubleprecision
floatingpoint) transfers a 128bit doubleprecision floatingpoint operand
from memory to and XMM register and vice versa, or between XMM registers,
without any requirement of alignment of the memory address.
MOVSD (move scalar doubleprecision
floatingpoint) transfers a 64bit doubleprecision floatingpoint operand
from memory to the low 64 bits of an XMM register and vice versa, or between
XMM registers. Alignment of the memory address is not required.
MOVHPD (move high packed doubleprecision
floatingpoint) transfers a 64bit doubleprecision floatingpoint operand
from memory to the high 64 bits of an XMM register and vice versa. The
low quadword of the register is left unchanged. Alignment of the memory
address is not required.
MOVLPD (move low packed doubleprecision
floatingpoint) transfers a 64bit doubleprecision floatingpoint operand
from memory to the low quadword of an XMM register and vice versa. The
high quadword of the register is left unchanged. Alignment of the memory
address is not required.
MOVMSKPD (move packed doubleprecision
floatingpoint mask) extracts the sign bit of each of the two packed doubleprecision
floatingpoint numbers in an XMM register and saves them in a general purpose
register. This 2bit value can then be used as a condition to perform branching.
ADDPD (add packed doubleprecision
floatingpoint) and SUBPD (subtract packed doubleprecision floatingpoint)
add and subtract, respectively, two packed double precision floatingpoint
operands.
ADDSD (add scalar doubleprecision
floatingpoint) and SUBSD (subtract scalar double precision floatingpoint)
add and subtract, respectively, the low quadwords of two doubleprecision
floatingpoint operands; the high quadword of the source operand is passed
through to the destination operand.
MULPD (multiply packed doubleprecision
floatingpoint) multiplies two packed doubleprecision floatingpoint operands.
MULSD (multiply scalar doubleprecision
floatingpoint) multiplies the low quadwords of two packed doubleprecision
floatingpoint operands; the high quadword of the source operand is passed
through to the destination operand.
DIVPD (divide packed doubleprecision
floatingpoint) divides two packed doubleprecision floatingpoint operands.
DIVSD (divide scalar doubleprecision
floatingpoint) divides the low 64 bits of two packed doubleprecision
floatingpoint operands; the high quadword of the source operand is passed
through to the destination operand.
SQRTPD (square root packed doubleprecision
floatingpoint) returns the packed square roots of a packed doubleprecision
floatingpoint operand to the destination operand.
SQRTSD (square root scalar doubleprecision
floatingpoint) returns the square root of the low quadword of the packed
doubleprecision floatingpoint source operand to the low quadword of the
destination operand; the high quadword of the source operand is passed
through to the destination operand.
MAXPD (maximum packed doubleprecision
floatingpoint) compares the corresponding doubleprecision floatingpoint
values from two packed doubleprecision floatingpoint operands and returns
the numerically higher value from each comparison to the destination operand.
MAXSD (maximum scalar doubleprecision
floatingpoint) compares the lowdoubleprecision floatingpoint values
from two packed doubleprecision floatingpoint operands and returns the
numerically higher value from the comparison to the low quadword of the
destination operand; the high quadword of the source operand is passed
through to the destination operand.
MINPD (minimum packed doubleprecision
floatingpoint) compares the corresponding doubleprecision floatingpoint
values from two packed doubleprecision floating point operands and returns
the numerically lower value from each comparison to the destination
operand.
MINSD (minimum scalar doubleprecision
floatingpoint) compares the low doubleprecision floatingpoint values
from two packed doubleprecision floatingpoint operands and returns the
numerically lower value from the comparison to the low quadword of the
destination operand; the high quadword of the source operand is passed
through to the destination operand.
ANDPD (AND of packed doubleprecision
floatingpoint) returns a bitwise logical AND of two packed doubleprecision
floatingpoint operands.
ANDNPD (AND NOT of packed doubleprecision
floatingpoint) returns a bitwise logical AND NOT of two packed doubleprecision
floatingpoint operands.
ORPD (OR of packed doubleprecision
floatingpoint) returns a bitwise logical OR of two packed doubleprecision
floatingpoint operands.
XORPD (XOR of packed doubleprecision
floatingpoint) returns a bitwise logical XOR of two packed doubleprecision
floatingpoint operands.
These instructions compare packed and scalar
doubleprecision floatingpoint values and return the results of the comparison
either to the destination operand or to the EFLAGS register.
CMPPD (compare packed doubleprecision
floatingpoint) compares the corresponding doubleprecision floatingpoint
values from two packed doubleprecision floatingpoint operands, using
an immediate operand as a predicate, and returns a 64bit mask result of
all 1s or all 0s for each comparison to the destination operand. The value
of the immediate operand allows the selection of any of 12 compare conditions:
equal, less than, less than equal, greater than, greater than or equal,
unordered, not equal, not less than, not less than or equal, not greater
than, not greater than or equal, ordered.
CMPSD (compare scalar doubleprecision
floatingpoint) compares the low doubleprecision floatingpoint values
from two packed doubleprecision floatingpoint operands, using an immediate
operand as a predicate, and returns a 64bit mask result of all 1s or all
0s for the comparison to the low quadword of the destination operand; the
high quadword of the source operand is passed through to the destination
operand. The immediate operand selects the compare conditions as with the
CMPPD instruction.
COMISD (compare scalar doubleprecision
floatingpoint and set EFLAGS) and UCOMISD (unordered compare scalar
doubleprecision floatingpoint and set EFLAGS) instructions compare the
low quadwords of two packed doubleprecision floatingpoint operands and
set the ZF, PF, and CF flags in the EFLAGS register to show the result
(greater than, less than, equal, or unordered). These two instructions
differ as follows: the COMISD instruction signals a floatingpoint invalidoperation
(#I) exception when a source operand is either a QNaN or SNaN; the UCOMISD
instruction only signals an invalidoperation exception when a source operand
is an SNaN.
SHUFPD (shuffle packed doubleprecision
floatingpoint) places either of the two packed doubleprecision floatingpoint
values from first source operand in the low quadword of the destination
operand, and places either of the two packed doubleprecision floatingpoint
values from second source operand in the high quadword of the destination
operand.
UNPCKHPD (unpacked high packed
doubleprecision floatingpoint) performs an interleaved unpack of the
high doubleprecision floatingpoint values of the two source operands.
It ignores the low quadwords of the sources.
UNPCKLPD (unpacked low packed doubleprecision
floatingpoint) performs an interleaved unpack of the low doubleprecision
floatingpoint values of the two source operands. It ignores the high quadwords
of the sources.

