Support
 

      Intel Pentium 4 MMX performance

      The recent arrival of the Intel Pentium 4 processor has generated the usual flurry of benchmarks and comments, most of them emphasizing that current software does not fully exploit the power of this new architecture (click here for an overview of the SSE2 instruction set).
      However, until the Pentium 4 gains a significant share of the market, most applications will not be tuned for it, so it is interesting to analyze of the P4 runs current MMX/SSE code.
      The following table summarizes the latencies and throughputs of MMX/SSE instructions on the P4:
       
       
      Instruction Latency Throughput Execution Unit
      MOVD mm,r32 2 1 MMX_ALU
      MOVD r32,mm 5 1 FP_MISC
      MOVQ mm,mm 6 1 FP_MOV
      PACKSSWB / PACKSSDW / PACKUSWB mm,mm 2 1 MMX_SHFT
      PADDB / PADDW / PADDD 2 1 MMX_ALU
      PADDSB / PADDSW / PADDUSB / PADDUSW mm,mm 2 1 MMX_ALU
      PAND / PANDN/ POR / PXOR mm,mm 2 1 MMX_ALU
      PCMPEQB / PCMPEQW / PCMPEQD mm,mm 2 1 MMX_ALU
      PCMPGTB / PCMPGTW / PCMPGTD mm,mm 2 1 MMX_ALU
      PMADDWD mm,mm 8 1 FP_MUL
      PMULHW / PMULLW / PMULHUW mm,mm 8 1 FP_MUL
      PSLLW / PSLLW / PSLLQ mm,mm/imm8 2 1 MMX_SHFT
      PSRAW / PSRAD mm,mm/imm8 2 1 MMX_SHFT
      PSUBB / PSUBW / PSUBD mm,mm 2 1 MMX_ALU
      PSUBSB / PSUBSW / PSUBUSB / PSUBUSW mm,mm 2 1 MMX_ALU
      PUNPCKHBW / PUNPCKHWD / PUNPCKHDQ mm,mm 2 1 MMX_SHFT
      PUNPCKLBW / PUNPCKLWD / PUNPCKLDQ mm,mm 2 1 MMX_SHFT
      EMMS 12 12
      PAVGB / PAVGW mm,mm 2 1 MMX_ALU
      PEXTRW r32,mm,imm8 7 2 MMX_SHFT,FP_MISC
      PINSRW mm,r32,imm8 4 1 MMX_SHFT,MMX_MISC
      PMAX / PMIN mm,mm 2 1 MMX_ALU
      PMOVMSKB r32,mm 7 2 FP_MISC
      PSADBW mm,mm 4 1 MMX_ALU
      PSHUFW mm,mm,imm8 2 1 MMX_SHFT

       
      Latency the number of clock cycles that are required to complete the execution of all of the µops that form an instruction.
      Throughput the number of clock cycles required to wait before the issue ports are free to accept the same instruction again.
      Execution Unit the names of the execution units in the execution core that are utilized to execute the µops for each instruction.

      Assuming that we should have P4s running at 2 GHz and more pretty soon, I would not worry about the doubling in latency of most MMX instructions. But the multiply instructions' latency  (PMADDWD / PMULHW / PMULLW) jumped from 3 cycles in the P6 core to 8 cycles in the Pentium 4! This will affect all convolutional kernel codes that are widely used, for example, in audio applications. Another troublesome latency is MOVQ's 6 cycles versus only 1 cycle on the P6 core, given that is widely used to move memory blocks and copy results.

      But troubles do not stop here. The image above outlines how instructions are addressed to specific ports in the P4 execution engine. All MMX instructions are queued in Port 1! This is major drawback compared to the P6 core, in which most MMX instructions could be issued to Port 0 or Port 1.
      Intel's P4 Optimization Guide also reveals that: "Floating-point, MMX technology, Streaming SIMD Extensions and Streaming SIMD Extension 2 instructions with load operations require 6 more clocks in latency than the register-only version of the instructions", i.e. twice the clocks required by the P6 core.
      Summing up, the P4 can issue only one MMX instruction per cycle, and the latency is at best twice that on the older Pentium III processor. In pathological conditions, this adds up to bring P4's SIMD performance down to about one third P-III's. Until the P4 ramps up into the 2+ GHz frequency range, its integer SIMD execution speed will simply lag behind the venerable P6 core.

      Click here for the latency of MMX/iSSE instructions on the Intel Pentium III and Pentium 4 and AMD Athlon processors.
       
       

      Home
Copyright Stefano Tommesani 2000/03 - All trademarks belong to their respective holders