|
Intel Pentium 4 MMX
performance
 The
recent arrival of the Intel Pentium 4 processor has generated the usual
flurry of benchmarks and comments, most of them emphasizing that current
software does not fully exploit the power of this new architecture (click
here for an overview of the SSE2 instruction set).
However, until the Pentium 4 gains a significant
share of the market, most applications will not be tuned for it, so it
is interesting to analyze of the P4 runs current MMX/SSE code.
The following table summarizes the latencies
and throughputs of MMX/SSE instructions on the P4:
Instruction |
Latency |
Throughput |
Execution
Unit |
MOVD mm,r32 |
2 |
1 |
MMX_ALU |
MOVD r32,mm |
5 |
1 |
FP_MISC |
MOVQ mm,mm |
6 |
1 |
FP_MOV |
PACKSSWB / PACKSSDW / PACKUSWB mm,mm |
2 |
1 |
MMX_SHFT |
PADDB / PADDW / PADDD |
2 |
1 |
MMX_ALU |
PADDSB / PADDSW / PADDUSB / PADDUSW mm,mm |
2 |
1 |
MMX_ALU |
PAND / PANDN/ POR / PXOR mm,mm |
2 |
1 |
MMX_ALU |
PCMPEQB / PCMPEQW / PCMPEQD mm,mm |
2 |
1 |
MMX_ALU |
PCMPGTB / PCMPGTW / PCMPGTD mm,mm |
2 |
1 |
MMX_ALU |
PMADDWD mm,mm |
8 |
1 |
FP_MUL |
PMULHW / PMULLW / PMULHUW mm,mm |
8 |
1 |
FP_MUL |
PSLLW / PSLLW / PSLLQ mm,mm/imm8 |
2 |
1 |
MMX_SHFT |
PSRAW / PSRAD mm,mm/imm8 |
2 |
1 |
MMX_SHFT |
PSUBB / PSUBW / PSUBD mm,mm |
2 |
1 |
MMX_ALU |
PSUBSB / PSUBSW / PSUBUSB / PSUBUSW mm,mm |
2 |
1 |
MMX_ALU |
PUNPCKHBW / PUNPCKHWD / PUNPCKHDQ mm,mm |
2 |
1 |
MMX_SHFT |
PUNPCKLBW / PUNPCKLWD / PUNPCKLDQ mm,mm |
2 |
1 |
MMX_SHFT |
EMMS |
12 |
12 |
|
PAVGB / PAVGW mm,mm |
2 |
1 |
MMX_ALU |
PEXTRW r32,mm,imm8 |
7 |
2 |
MMX_SHFT,FP_MISC |
PINSRW mm,r32,imm8 |
4 |
1 |
MMX_SHFT,MMX_MISC |
PMAX / PMIN mm,mm |
2 |
1 |
MMX_ALU |
PMOVMSKB r32,mm |
7 |
2 |
FP_MISC |
PSADBW mm,mm |
4 |
1 |
MMX_ALU |
PSHUFW mm,mm,imm8 |
2 |
1 |
MMX_SHFT |
Latency |
the number of clock cycles that are required
to complete the execution of all of the µops that form an instruction. |
Throughput |
the number of clock cycles required to
wait before the issue ports are free to accept the same instruction again. |
Execution
Unit |
the names of the execution units in the
execution core that are utilized to execute the µops for each instruction. |
Assuming that we should have P4s running
at 2 GHz and more pretty soon, I would not worry about the doubling in
latency of most MMX instructions. But the multiply instructions' latency
(PMADDWD / PMULHW / PMULLW) jumped from 3 cycles in the P6 core to 8 cycles
in the Pentium 4! This will affect all convolutional kernel codes that
are widely used, for example, in audio applications. Another troublesome
latency is MOVQ's 6 cycles versus only 1 cycle on the P6 core, given that
is widely used to move memory blocks and copy results.

But troubles do not stop here. The image
above outlines how instructions are addressed to specific ports in the
P4 execution engine. All MMX instructions are queued in Port 1! This
is major drawback compared to the P6 core,
in which most MMX instructions could be issued to Port 0 or Port 1.
Intel's P4 Optimization Guide also reveals
that: "Floating-point, MMX technology, Streaming SIMD Extensions and
Streaming SIMD Extension 2 instructions with load operations require 6
more clocks in latency than the register-only version of the instructions",
i.e. twice the clocks required by the P6 core.
Summing up, the P4 can issue only one
MMX instruction per cycle, and the latency is at best twice that on the
older Pentium III processor. In pathological conditions, this adds up to
bring P4's SIMD performance down to about one third P-III's. Until the
P4 ramps up into the 2+ GHz frequency range, its integer SIMD execution
speed will simply lag behind the venerable P6 core.
Click here
for the latency of MMX/iSSE instructions on the Intel Pentium III and Pentium
4 and AMD Athlon processors.
|
|