Intel Pentium III microarchitecture
P6 core, introduced with the Pentium Pro processor and used in all current
Intel processors, features a RISC-like microarchitecture and an out-of-order
execution unit, representing a radical shift from previous designs.
The P6's new dynamic execution micro-architecture
removes the constraint of linear instruction sequencing between the traditional
fetch and execute phases. An instruction buffer opens a wide window on
the instructions that are not executed yet, allowing the execute phase
of the processor to have much more visibility into the instruction stream
so that a better scheduling policy may be adopted. Optimal scheduling requires
the execute phase to be replaced by decoupled dispatch/execute and retire
phases, so that instructions can start in any order that satisfies dependency
bounds, but must be completed and therefore retired in the original order.
This approach greatly increases performance as it more fully utilizes the
resources of the processor core.
The P6 core executes x86 instructions by
breaking them into simpler micro-instructions called micro-ops. This task
is performed by three parallel decoders in the D1 stage of the pipeline:
the first decoder is capable of decoding one x86 instruction of four or
fewer µops in each clock cycle, while the other two decoders can
each decode an x86 instruction of one µop in each clock cycle. Once
the µops are decoded, they will be issued from the in-order front-end
into the Reservation Station (RS), which is the beginning stage of the
out-of-order core. In the RS, the µops wait until their data operands
are available; once a µop has all data sources available, it will
be dispatched from the RS to an execution unit. Once the µop has
been executed it returns to the ReOrder Buffer and waits for retirement.
In this stage, all data values are written back to memory and all µops
are retired in-order, three at a time. The P6 core can schedule at a peak
rate of 5 micro-ops per clock, one to each resource port, but a sustained
rate of 3 micro-ops per clock is more typical.
The effective usage of the execution units
is even more troublesome. There are five execution units on the P6 core,
and each performs a well-defined set of operations: scheduling a large
bulk of instructions of the same kind will overcharge the required execution
unit that will impose long latencies, while all other execution units remain
idle. The key to fast performance is obtaining from the decoders a balanced
stream of micro-ops that evenly exploits all execution units, and this
often means that loops must be rearranged as most of them expose a great
locality (i.e. loads from memory at the beginning, computations in the
middle and stores to memory at the end).
Optimizing code for the P6 core is strikingly
different than on previous processors, such as the Pentium, that featured
in-order execution. The developer has no control over the sequence of execution,
but the goal is maximizing the efficiency of both the decoders and the
Pushing the decoding bandwidth to the
limit means scheduling instructions with a 4-1-1 pattern, where these numbers
refer to the count of micro-ops generated by each instruction. When working
with MMX instructions, all opcodes require
only 1 micro-op except for computations that have as source operand a memory
reference, and writes to memory. The MMX register set contains only 8 registers,
therefore there are many instructions that use a memory reference as source
operand, and the fact that this kind of instruction can only by translated
by decoder 0 leads to stalls in this stage of the pipeline. The only method
for relieving this problem is choosing a smart register allocation strategy
that minimizes the number of memory references.
Another key technique is minimizing dependency
bounds among micro-ops, so that they do not stall often waiting for data
operands: the easiest way to maximize the Instruction Level Parallelism
(ILP) is unrolling loops and scheduling two or more computing threads together.
While this is hardly a novel technique, actually implementing it is really
complex due to the limited number of MMX registers available, and a clever
register allocation strategy is mandatory.
It is therefore evident that writing high-performance
MMX code requires much more that the knowledge of the instruction set:
the developer should have a solid background on both traditional compiler
designs to devise an effective register allocation strategy, and on the
microarchitectures of current processors to avoid pitfalls in the hand-scheduled
Quexal implements an optimizing
compiler that exploits all these techniques. The source listing is
re-arranged to maximize the Instruction Level Parallelism (ILP), then the
instructions are scheduled so that:
1. they satisfy the 4-1-1 pattern to fully
use all decoders;
2. the resulting stream of micro-ops is
balanced and makes effective usage of available hardware resources;
3. the number of required registers does
not exceed that of MMX registers.
compiler outputs high-quality code that matches the speed of hand-optimized
samples. Performance benchmarks show that produced code usually makes optimal
usage of the decoders and achieves a typical 3 micro-ops per cycle rate,
without introducing excessive register spilling to memory.