Microprocessor instruction pipelining is a hardware implementation that allows multiple instructions to be simultaneously processed through the instruction cycle. This is enabled by the instruction cycle itself as it divides the operations that have to be performed on each instruction into standalone phases (e.g decode, fetch, execute). In the context of the pipeline, we call these pipeline stages. These stages work simultaneously and each stage processes a separate instruction on each clock cycle.
Pipelining is not applicable only in microprocessor design. As a technique that allows running multiple stages of a process in parallel, it is used in assembly lines, fast food restaurants etc.
Principle of Operation
One of the most popular ways of explaining how pipeline works
The washing is performed using a washing machine, the drying using a dryer machine and the folding is performed manually by the person doing the laundry. For the purpose of the example let’s say the washing takes 60 minutes, the drying takes 30 minutes and the folding of the clothes takes 30 minutes. As each operation starts only after the previous has finished we have one load of laundry completed for 2 hours.
In Fig. 1 we can see the timeline of these operations. If we look closely at the process, one thing becomes obvious – once an operation is finished for the current load of clothes, the hardware used stays idle and waits for the next load. For example, the washer machine is idle while drying and folding operations are performed. This is certainly not the most efficient way of doing things and one way to improve it is shown in Fig. 2. There we can see the same process of doing laundry, but this time using a pipeline technique. As soon as the washing is completed, we can put the clothes of the 2nd load, so the washing machine keeps working while the drying and the folding of the 1st load are being executed.
Based on the examples above, we can conclude that using the pipeline technique does not have an impact on the time needed for completing a single load of laundry (it takes 2 hours). The improvement is visible when doing multiple loads. Without the pipeline, we can do two loads for a total of 4 hours. Using the pipeline as shown in Fig. 2 we can do 3 loads in the same time frame. The speedup would be even greater if all of the operations took the same time to complete, thus allowing better overlapping in the pipeline. This is applied in microprocessor pipeline implementations.
Microprocessor Pipeline Example
Now let’s see how the pipeline technique is applied to a microprocessor.
It should be noted that there are many different pipeline implementations and each can have a different number of pipeline stages. For this conceptual example, we will keep things simple and use RISC load and store CPU architecture with 5 stage pipeline. The stages are:
- Instruction Fetch (IF)
- Instruction Decode (ID)
- Execute (EX)
- Memory Access (MEM)
- Writeback (WB)
Each stage is executed by its own dedicated CPU functional unit and each takes one clock cycle to execute.
In Fig. 3 we can see how instructions are overlapped using the pipelining technique. In the first clock cycle, the first instruction is fetched. In the following clock cycle, that same instruction is decoded while at the same time, the second instruction is being fetched. During the third clock cycle, the first instruction is executed, the second instruction is decoded and the third instruction is being fetched. On the fifth clock cycle, we have the first instruction completed. From then on, we have an instruction completing on every clock cycle. The time for reaching the completion of the first instruction passed through the pipeline is called “time to fill“. It is dependent on the number of stages. In this example, it takes 5 cycles to fill the pipeline.
In order to properly identify the performance improvement that can be achieved using the pipelining technique, we need to use two terms – latency and throughput.
Latency is the time it takes for an operation to complete.
Throughput is the number of operations that are completed in a certain time frame.
Looking at the example shown in Fig. 3 we can make the following conclusions:
- The pipeline does not affect the latency. The instruction cycle consisting of 5 phases is the same whether pipelining is used or not. Each instruction takes 5 clock cycles to complete.
- The pipeline improves the throughput, with potential speedup equal to the number of pipeline stages.
Once the pipeline is filled and continuously fed, we can have an instruction completed on every clock cycle (Cycles Per Instruction (CPI) = 1). This is, of course, the ideal case not taking into account some of the pipeline limitations mentioned in the next chapter.
There are some known limitations to
- Data Hazard – Can occur when an instruction depends on the result of a previous instruction still being processed in the pipeline.
- Structural Hazard – Can occur when the available hardware does not support some instruction combinations.
- Control Hazards – Caused by jump and branch instructions. These instructions will usually require the flushing (emptying) of the pipeline and loading it with instructions from addresses pointed by the branch and jump instructions.