Introduction to Pipelining

What is pipelining?

Pipelining is a method of implementing and arranging data processing elements to allow the execution of multiple instructions concurrently by overlapping their execution. This requires thought in the design of hardware components and how they should be connected to enable pipeling efficiently.
If implemented succesfully, this improves efficiency, performance, and time consumption by breaking down tasks into smaller, independent stages.

Why is pipelining useful?

To show the motivation for pipelining, imagine a routine like doing laundry. We can abstract 4 steps for processing a load of dirty laundry.

Place dirty laundry in washer
When washer is finished, place laundry in dryer
When dryer is finished, place laundry on table to fold
When folding is finished, put clothes away

A simple approach would be to run through all steps, then repeat with a new load of clothes. This routine also takes a fixed time, because all stages can only proceed if the one before is completed.

However, if you have multiple loads of laundry at the same time, this can take a long time. And actually, there is room for improvement of the strategy above. If we observe the steps above, every component (washer, dryer, table) is used in only one step and is idle in 3 steps.
If we put a new load of clothes in the washer instantly after we unloaded it and move every load one step further, we can fill the idle time of every step and speed up the overall routine. And this without needing more machines or people!

What we achieve with this scheduling is improving the throughput of the routine by starting the routine another time if it is possible to initiate the first step again.
But we can observe some aspects of our pipeline to be considered:

The total time to complete the routine once does not change.
An improvement can be only achieved if there are separate resources for each stage.

The effective speed-up of doing laundry using pipelining is equal to the number of stages in the pipeline if all stages take about the same time and there is enough work to do.
E. g. 20 pipelined loads of laundry take about five times as long as 1 load, while 20 sequential loads would take 20 times as long as 1 load.
This means pipelined laundry is potentially four times faster than non-pipelined!

Pipelining applied to processors

Pipeling in processors means pipelining instruction executions. Analogously to laundry, we can divide the execution of a single instruction into steps, called stages.
The RISC-V instruction set is designed in a way that allows splitting every instruction execution into the same stages:

Fetch instruction from memory
Read registers and decode the instruction
Execute an operation or calculate an address
Access an operand in data memory (if necessary)
Write the result into a register (if necessary)

Similar to the laundry approach, we consider each of these steps to use separate resources, so there occur no conflicts if different stage types are executed at the same time. The diagram illustrates the 5 stages of an instruction, to be read from left (Instruction Fetch from Instruction Memory) to right (Write to Register).

We go into detail on what is happening in each of the stages in a later chapter.

By having those 5 stages, we can now take a closer look on how the pipeline works and show that this method speeds up the instruction execution just like it speeds up doing laundry.

Difference between sequential and pipelined performance

A sequential processor design corresponds to the simple laundry approach. Every instruction is executed one after another. The stages can take different amounts of time since there is no need to overlap them, so strictly speaking, stages are not even needed. Let us assume that reading and writing to a register takes less time than the other steps (100 ps vs. 200 ps). Then the total time to finish one instruction is 200 ps + 100 ps + 200 ps + 200 ps + 100 ps = 800 ps.
For 3 instructions, this means the time between the start of the first and the end of the third instruction is 3 × 8000 = 2400 ps.

800 ps					800 ps					800 ps

Now we take a closer look at the same 3 instructions in a pipelined processor design. We want to overlap the instruction execution by starting the next instruction as soon as the previous one finished the first stage, which is equal to "leaving" the first resource, which would leave it idle in a sequential execution.
Any split into stages has to allow the slowest instruction part to always finish. The most convenient way is to define a uniform stage execution time that corresponds to the longest a stage can take. This is then called a clock cycle (CC) and serves as the unit of time in which we can safely assume one stage to be finished and the resource to be freed for the next instruction. In our case, we assume the slowest stage to take 200 ps, therefore a clock cycle takes 200 ps. One instruction takes 5 clock cycles or 1000 ps in total to finish, since it is composed of 5 stages. Note that this is longer than the sequential instruction execution time of 800 ps.

With these assumptions made, we can now take a closer look at the pipeline performance. Here, the time between the start of the first and the end of the third instruction is just 7 × 200 = 1400 ps. This is nearly half of the time needed by the sequential approach (2400 ps), regardless of the longer execution time for a single instruction!
For general estimations, the pipelining speed-up can be turned into a formula. If all stages take the same amount of time, the formula is roughly equal to:

time between instructions_pipelined = ^{time between instructions_{non-pipelined}} / _{number of pipeline stages}

This implies that, if pipelining a large amount of instructions, we can achieve a speed-up of nearly 4x, as an instruction is completed every 200 ps instead of every 800 ps!

200 ps	200 ps	200 ps	200 ps	200 ps	200 ps	200 ps

Overall, pipelining enhances the performance by increasing the instruction throughput, in contrast to decreasing the execution time of a single instruction as achievable by sequential execution. As we have shown, increasing instruction throughput is more important in the long run because of the amount of instructions programs execute.
However, often there are constraints, e. g. dependencies between instructions, that do not allow perfect "stacking" of instructions. Pipeline design decisions and strategies to handle these constraints are described in the following chapters.

« Home Datapath »