## Embedded Intelligent System and

## Novel Computer Architecture,

## Lecture 03(a) - Understanding Modern Processor :

 ILP and Optimization CodePengju Ren<br>Institute of Artificial Intelligence and Robotics<br>Xi'an Jiaotong University<br>http://gr.xjtu.edu.cn/web/pengjuren

## Outline

■ Instruction level parallel

- Pipeline
$\square$ Data hazards
$\square$ Control hazards
$\square$ Structure hazards
- Out-of-Order Execution
$\square$ Dataflow
- Optimization based on ILP
case study
$\square$ Throughput bound
$\square$ Latency bound
$\square$ Performance Optimization


## Many kinds of processors



## Why so many? What differentiates these processors?

## Why so many kinds of processors?

Each processor is designed for different kinds of programs

- CPUs
- "Sequential" code - i.e., single / few threads
- GPUs
- Programs with lots of independent work $\rightarrow$ "Embarrassingly parallel"
- Many others: Deep neural networks, Digital signal processing, Etc.


## Parallelism pervades architecture

- Speeding up programs is all about parallelism
- Find independent work
- Execute it in parallel
- Profit
- Key questions:
- Where is the parallelism?
- Whose job is it to find parallelism?


## Where is the parallelism?

Different processors take radically different approaches

- CPUs: Instruction-level parallélism (ILP)
- Implicit
- Fine-grain
- GPUs: Thread- \& data-level parallelism (TLP, DLP)
- Explicit
- Coarse-grain


## Whose job to find parallelism?

Different processors take radically different approaches

- CPUs: Hardware dynamically schedules instructions
- Expensive, complex hardware $\rightarrow$ Few cores (tens)
- (Relatively) Easy to write fastsoftware
- GPUs: Software makes parallelism explicit
- Simple, cheap hardware $\rightarrow$ Many cores (thousands)
- (Often) Hard to write fast software


## Visualizing these differences

## - Pentium 4 "Northwood" (2002)



## Visualizing these differences

- Pentium 4 "Northwood" (2002)
- Highlighted areas actually execute instructions
$\rightarrow$ Most area spent on Caches and Scheduling (not on executing the program)



## Visualizing these differences

- AMD Fiji (GPU@2015)




## Visualizing these differences

- AMD Fiji (GPU@2015)
- Highlighted areas actually execute instructions
$\rightarrow$ Most area spent executing the program
- (Rest is mostly I/O \& memory, not scheduling)



## Today you will review (or learn) ...

How CPUs exploit ILP to speed up sequential code

- Key ideas:
- Pipelining \& Superscalar: Work on multiple instructions at once
- Out-of-order execution: Dynamically schedule instructions whenever they are "ready"
- Speculation: Guess what the program will do next to discover more independent work, "rolling back" incorrect guesses
- CPUs must do all of this while preserving the illusion that instructions execute in-order, one-at-a-time


## In other words... Today is about:

Cortex-A77:
Microarchitecture overview


Independent ST-Data,
deeper MLP

Higher bandwidth
dispatch

## Example: Polynomial evaluation

```
\[
\text { value }=\sum_{j=0}^{\text {terms }} \operatorname{coef}[j] x^{j}
\]
int poly(int *coef,
    int terms, int x) {
    int power = 1;
    int value = 0;
    for (int j = 0; j<terms; j++) {
        value += coef[j] * power;
        power *= x;
    }
    return value;
}
```


## Example: Polynomial evaluation

$$
\text { value }=\sum_{j=0}^{\text {terms }} \operatorname{coef}[j] x^{j}
$$

- Compiling on ARM int poly(int *coef, int terms, int $x$ ) \{
int power = 1;
int value $=0$;
for (int $\mathbf{j}=\mathbf{0 ;} \mathbf{j}<$ terms; $\mathbf{j + +}$ ) \{
value += coef[j] * power;
power $\%=\mathbf{x}$;
\}
return value;
\}
add
movs
movs
.L3:
mu 7
bne
pop
bx
.L4:
r0: value
r1: \&coef[terms]
r2: x
r3: \&coef[0]
r4: power
r5: coef[j]
poly:
cmp
ble push $\{r 4, r 5\}$
1dr r5, [r3], \#4
cmp r1, r3
m7a r0, r4, r5, r0
movs r0, \#0
bx 1r


## Compilers Manage Memory and Registers

Compilers for languages like C/C++:
■Check that program is legal
-Translate into assembly code
■Optimizes the generated code
Compiler performs "register allocation" to decide when to load/store and when to reuse

## Example: Polynomial evaluation

```
r0: value
r1: &coef[terms]
r2: x
r3: &coef[j]
r4: power
r5: coef[j]
```

- Compiling on ARM

```
int poly(int *coef,
    int terms, int x) {
    int power = 1;
    int value = 0;
    for (int j = 0; jlUterms; j++) {
        value += coef[j] * power;
        power *= x;
    }
    return value;
}
```



## Example: Polynomial evaluation

r0: value
r1: \&coef[terms]
r2: $x$
r3: \&coef[j]
r4: power
r5: coef[j]

- Compiling on ARM

Iteration
for (int $\mathbf{j}=0 ; j<$ terms; j++) value += coef[j] * power;
power *= X;
$\}$
. L3:
1dr
cmp

m7a
mu7
r5, [r3], \#4
// r5 <- coef[j]; j++
(two operations)
// compare: j < terms?
bne
r0, r4, r5, r0
// value += r5 * power
(mul + add)
r4, r2, r4 // power $*=x$
// repeat?

## Example: Polynomial evaluation

- Executing poly(A, 3, x)

```
cmp r1, #0
ble .L4
push {r4, r5}
mov r3, r0
add r1, r0, r1, rsp#2
movs r4, #1
movs r0, #0
1dr r5, [r3], #4
cmp r1, r3
m7a
mul
    r0, r4, r5, r0
bne .L3
```

...

## Example: Polynomial evaluation

- Executing poly(A, 3, x)



## Example: Polynomial evaluation

- Executing poly (A, 3, x)



## Example: Polynomial evaluation

- Executing poly (A, 3, x)



## The software-hardware boundary

- The instruction set architecture (ISA) is a functional contract between hardware and software
- It says what each instruction does, but not how
- Example: Ordered sequence of $\times 86$ instructions
- A processor's microarchitecture is how the ISA is implemented

Arch : $\mu$ Arch :: Interface : Implementation

## Simple CPU model

- Execute instructions in program order
- Divide instruction execution into stages, e.g.:
- 1. Fetch - get the next instruction from memory
- 2. Decode - figureout what to do \& read inputs
- 3. Execute - perform the necessary operations
- 4. Commit - write the results back to registers / memory
- (Real processors have many more stages)


## Evaluating polynomial on the simple CPU model


r0, r4, r5, r0

$$
r 4, r 2, r 4
$$

.L3
r5, [r3], \#4 $r 1, r^{3}$ r0, r4, r5, r0
r4, r2, r4
.L3

## Evaluating polynomial on the simple CPU model



## Evaluating polynomial on the simple CPU model

ro, r4, r5, ro

## Evaluating polynomial on the simple CPU model

r1, r3
r0, r4, r5, r0 r4, r2, r4
.L3
r5, [r3], \#4
$r 1, r 3$
r0, r4, r5, r0
r4, r2, r4
.L3



Commit
3. Load memory at r3 and compute r3 + 4

## Evaluating polynomial on the simple CPU model

$$
\mathrm{ro}, \mathrm{r} 4, \mathrm{r} 5, \mathrm{r} 0
$$

$$
r 4, r 2, r 4
$$

## Evaluating polynomial on the simple CPU model



## Evaluating polynomial on the simple CPU model



## Evaluating polynomial on the simple CPU model



## Evaluating polynomial on the simple CPU model



## Evaluating polynomial on the simple CPU model



## Evaluating polynomial on the simple CPU model

How fast is this processor? Latency? Throughput?


## Simple CPU is very wasteful




## Pipelining keeps CPU busy through instruction-level parallelism

- Idea: Start on the next instr'n immediately

1 dr cmp m7a
mul bne

1dr cmp
m7a
mul bne
r5, [r3], \#4
r1, r3
r0, r4, r5, r0 r4, r2, r4
.L3
r5, [r3], \#4
$r 1, r^{3}$
r0, r4, r5, r0
r4, r2, r4
.L3


## Pipelining keeps CPU busy through instruction-level parallelism

- Idea: Start on the next instr'n immediately
$1 d r$
r5, [r3], \#4
cmp
m7a
mul
bne
1 dr
cmp
m7a
mul
bne
r1, r3
r0, r4, r5, r0
r4, r2, r4
.L3
r5, [r3], \#4
$r 1, r^{3}$
r0, r4, r5, r0
r4, r2, r4
.L3



## Pipelining keeps CPU busy through instruction-level parallelism

- Idea: Start on the next instr'n immediately

1 dr
r5, [r3], \#4
cmp
m7a
mul
bne
1 dr
cmp
m7a
mul
bne
r1, r3
r0, r4, r5, r0 r4, r2, r4
.L3
r5, [r3],\#4
$r 1, r^{3}$
r0, r4, r5, r0
r4, r2, r4
.L3


## Pipelining keeps CPU busy through instruction-level parallelism

- Idea: Start on the next instr'n immediately

1 dr
r5, [r3], \#4
cmp
m7a
mul
bne
1 dr
cmp
m7a
mul
bne
r1, r3
r0, r4, r5, r0
r4, r2, r4
.L3
r5, [r3] \#\#
r1, ${ }^{3} 3$
ro, r4, r5, r0
r4, r2, r4
.L3


## Pipelining keeps CPU busy through instruction-level parallelism

- Idea: Start on the next instr'n immediately



## Pipelining keeps CPU busy through instruction-level parallelism

- Idea: Start on the next instr'n immediately
1dr
r5, [r3], \#4
cmp
m7a
mul
bne
1 dr
cmp
m7a
mul
bne
r1, r3
r0, r4, r5, r0
r4, r2, r4
.L3
r5, [r3] ${ }^{2} \# 4$
r1, $r^{3}$
ro, r4, r5, r0
r4, r2, r4
.L3



## Evaluating polynomial on the pipelined CPU

How fast is this processor? Latency? Throughput?

|  |  |  |  |  |  | Latency? Throughput? |  |  |  |  |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| Fetch | 1dr | cmp | m7a | mu1 | bne | 7d | cmp | m7a | mu 1 | bne |
| Decode |  | 1dr | cmp | $\mathrm{m}^{\mathrm{Ta}}$ | mu 1 | bne | 1dr | cmp | m7a | mu 7 |
| Execute |  | $9$ | $1 \mathrm{dr}$ | cmp | m7a | mu 1 | bne | 1dr | cmp | m7a |
| Commit |  |  |  | 1 dr | cmp | m7a | mu7 | bne | 1dr | cmp |
| Throughput $=1 \mathrm{instr} / \mathrm{ns}$ 4X speedup! |  |  |  |  |  |  |  |  |  |  |

## Speedup achieved through pipeline parallelism

| $\sum$ TIME ${ }^{\text {a }}$ instructions atro |  |  |  |  |  |  |  |  |  |  |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| Fetch | 1dr | cmp | m7a | mul | bne | 7dr | ${ }_{c m p}$ | m 7 a | mu 1 | bne |
| Decode |  | 1dr | $\mathrm{cmp}$ | $m \times a$ | oful | bne | 1dr | cmp | m7a | mu 7 |
| Execute | $p$ | $n g$ | $1 \mathrm{dr}$ | cmp | m7a | mu1 | bne | 1dr | cmp | m7a |
| Commit |  |  |  | 1dr | cmp | m7a | mu7 | bne | 1dr | cmp |

## Limitations of pipelining

- Parallelism requires independent work
- Q: Are instructions independent ?
- A: No! Many possible hazards limit parallelism...
$\square$ Data hazards
$\square$ Structure hazards
-Control hazard


## Data hazards

1dr rx, [rm], \#4 // rx $\leftarrow$ Memory[rm]; rmఈ rm + 4 cmp ry, rn // compare ry and rn

Q: When can the CPU pipeline the cmp behind 1dr?

| Fetch | 1 dr | cmp | $\ldots$ | $\ldots$ | $\cap .$. | $\ldots$ |
| :--- | :--- | :--- | :--- | :--- | :--- | :--- |
| Decode |  | 1 dr | cmp | $\ldots$ | $\ldots$ | $\ldots$ |
| Execute |  |  | 1 dr | cmp | $\ldots$ | $\ldots$ |
| Commit |  |  |  | 1 dr | cmp | $\ldots$ |

- A: When they use different registers
- Specifically, when cmp does not read any data written by 1dr
- E.g.,
- rx != ry
- $r x!=r n$
- $r m!=r n$
- rm!=ry


## Dealing with data hazards: Stalling the pipeline

- Cannot pipeline cmp (1dr writes r3)

1 dr cmp m7a mul bne

1 dr cmp
m7a
mul bne
r5, [r3], \#4 r1, r3
r0, r4, r5, r0 r4, r2, r4 .L3
r5, [r3], \#4 $r 1, r^{3}$
r0, r4, r5, r0
r4, r2, r4
.L3

## Dealing with data hazards: Stalling the pipeline

- Cannot pipeline cmp (1dr writes r3)

1 dr r5, [r3], \#4 cmp m7a mu 1 bne

1dr cmp
m7a
mul bne
r1, r3
r0, r4, r5, r0 r4, r2, r4 .L3
r5, [r3],\#4 $r 1, r^{3}$
r0, r4, r5, r0
r4, r2, r4
.L3

## Dealing with data hazards: Stalling the pipeline

- Cannot pipeline cmp (1dr writes r3)

1 dr cmp m7a mul bne

1 dr cmp
m7a
mul bne
r5, [r3], \#4 r1, r3
r0, r4, r5, r0 r4, r2, r4
.L3
r5, [r3],\#4 $r 1, r^{3}$
r0, r4, r5, r0
r4, r2, r4
.L3

## Dealing with data hazards: Stalling the pipeline

- Cannot pipeline cmp (1dr writes r3)

1 dr cmp m7a mul bne

1dr cmp
m7a
mul bne
r5, [r3], \#4 r1, r3
r0, r4, r5, r0 r4, r2, r4 .L3
r5, [r3],\#4 $r 1, r 3$ r0, r4, r5, r0
r4, r2, r4
.L3


Inject a "bubble" (NOP) into the pipeline

## Dealing with data hazards: Stalling the pipeline

- Cannot pipeline cmp (1dr writes r3)

| 1dr | $r 5,[r 3], \# 4$ |
| :--- | :--- |
| cmp | r1, r 3 |
| m1a | r0, r4, r5, r0 |
| mu1 | r4, r2, r4 |
| bne | .$L 3$ |
| 1dr | r5, [r3], \#4 |
| cmp | r1, r3 |
| m7a | r0, r4, r5, r0 |
| mu1 | $r 4, r 2, r 4$ |
| bne | .$L 3$ |



## Stalling degrades performance



- But stalling is sometimes unavoidable
- E.g., long-latency instructions (divide, cache miss)


## Dealing with data hazards: Forwarding data

- Wait a second... data is available after Execute!

- Forwarding eliminates many (not all) pipeline stalls


## Speedup achieved through pipeline parallelism

| $\rangle$ TIME |  |  |  |  |  | Processor works on 4 instructions at af fime |  |  |  |  |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| Fetch | 1dr | cmp | m7a | mul | bne | 7dr | $\underset{c m p}{2}$ | m7a | mu 7 | bne |
| Decode |  | 1dr | cmp | $\mathrm{m}^{2}$ | mul | bne | 1dr | cmp | m7a | mu 1 |
| Execute | $p \theta$ |  | $1 \mathrm{dr}$ | cmp | m7a | mu 7 | bne | 1dr | cmp | m7a |
| Commit |  |  |  | 7dr | cmp | m7a | mu7 | bne | 1dr | cmp |

## Pipelining is not free!

- Q: How well does forwarding scale?
- A: Not well... many forwarding paths indeep \& complex pipelines



## Control hazards + Speculation

- Programs must appear to execute in program order $\rightarrow$ All instructions depend on earlier ones
- Most instructions implicitly continue at the next...
- But branches redirect execution to new location


## Dealing with control hazards: Flushing the pipeline

- What if we always fetch the next instruction?



## Dealing with control hazards: Flushing the pipeline

- What if we always fetch the next instruction?



## Dealing with control hazards: Flushing the pipeline

- What if we always fetch the next instruction?



## Dealing with control hazards: Flushing the pipeline



## Pipeline flushes destroy performance



- Penalty increases with deeper pipelines


## Dealing with control hazards: Speculation!

- Processors do not wait for branches to execute
- Instead, they speculate (i.e., guess) where to go next + start fetching
- Modern processors use very sophisticated mechanisms
- E.g., speculate in Fetch stage-before processor even knows instrs is a branch! (Branch Instrs can be detected by PC)
- >95\% prediction accuracy
- Still, branch mis-speculation is major problem (The wider and deeper the pipeline, the more serious the problem)


## Pipelining Summary

- Pipelining is a simple, effective way to improve throughput
- $N$-stage pipeline gives up to $N \times$ speedup
- Pipelining has limits
- Hard to keep pipeline busy because of hazards
- Forwarding is expensive in deep pipelines(critical path)
- Pipelineflushes are expensive in deep pipelines
$\rightarrow$ Pipelining is ubiquitous, but tops out at $N \approx 15$


## Software Takeaways

- Processors with a simple "in-order" pipeline are very sensitive to running "good code"
- Compiler should target a specific model of CPU
- Low-level assembly hacking
- ...But very few CPUs are in-order these days
- E.g., embedded, ultra-low-power applications
- Instead, चall modern CPUs are "out-of-order"
- Even in classic "low-power domains" (like mobile)


## Out-of-Order Execution

## Instruction Classes (as convention)

- Arithmetic and logical operations
- compute a result as a function of the operands
- update PC to the next sequential instruction
- Data "movement" operations (no compute)
- fetch operands from specifiedlocations
- store operand values to specifíed locations
- update PC to the next sequential instruction
- Control flow operations (affects only PC Sequential
- compute a branch condition and a target a
- if "branch condition is true" then PC <- target a else PC <- next seq. instruction

In-order
Atomic

## Superpipelined and SuperScalar Execution

$$
\begin{array}{llll}
\text { Code 1: } & r 1 \leftarrow r 2+1 \\
& r 3 \leftarrow r 1 * 2 & \text { Code } 2: & r 1 \leftarrow r 2+1 \\
& r 4 \leftarrow r 0-r 3 & & r 3 \leftarrow r 9 * 2 \\
& r 4 \leftarrow r 0-r 10
\end{array}
$$

Code1: ILP=1 i.e., must execute serially
Code2: ILP=3 i.e., can execute at the same time


Accessing ILP=2 requires:
(1) larger scheduling window and (2) out-of-order execution

## Superpipelined and SuperScalar Execution



Achieving full performance requires finding N "independent" instructions on every cycle

## Increasing parallelism via dataflow

- Parallelism limited by many false dependencies, particularly sequential program order
- Dataflow tracks how instructions actually depend on each other
- True dependence: read-after-write
- False dependence: write-after-write, write-after-read


## Dataflow increases parallelism by eliminating unnecessary dependences

## Example: Dataflow in polynomial evaluation

$7 d r$
cmp
m 7 a
mu 7
bne
1 dr cmp
m7a
mul bne
r5, [r3], \#4 r1, r3
r0, r4, r5, r0
r4, r2, r4
.L3
r5, [r3], \#4
r1, ${ }^{3} 3$
r0, r4, r5, r0
r4, r2, r4
.L3

## Example: Dataflow in po bne nial evaluation

| 1 dr | r5, [r3], \#4 |
| :---: | :---: |
| cmp | r1. r3 |
| m7a | r0, r4, r5, r0 |
| mul | r4, r2, r4 |
| bne | . 43 |
| 1 dr | , |
| cmp | $\mathrm{r} 1, \mathrm{r} 3$ |
| m7a | ro, r4, r5, |
| mul | r4, r2, r4 |
| bne | .L3 |



## Example: Dataflow polynomial execution

- Execution only, with perfect scheduling \& unlimited execution units
- 1dr, mul execute in 2 cycles
- cmp, bne execute in 1 cycle
- m1a executes in $\mathbf{3}$ cycles
- Q: Does dataflow speedup execution? By how much?
- Q: What is the performance bottleneck?

1dr r5, [r3], \#4
cmp r1, r3
m7a r0, r4, r5, r0
mul r4, r2, r4
bne






## Example: Dataflow polynomial execution

- Q: Does dataflow speedup execution? By how much?
- Yes! 3 cycles / loop iteration
- Instructions per cycle (IPC) $=5 / 3 \approx 1.67$ (vs. 1 for perfect pipelining)
- Q: What is the performance bottleneck?
- m1a: Each $m$ º depends on previous m1a \& takes $\mathbf{3}$ cycles
$\bullet$ This program is latency-bound


## Latency Bound

- What is the "critical path" of the computation?
- Longest path across iterations in dataflow graph
- E.g., m1 a in last slide (but could be multiple ops)
- Critical path limits maximum performance
- Real CPUs maynot achieve latency bound, but useful mental model + tool for program analysis


## Out-of-order ( OOO ) execution uses dataflow to increase parallelism

- Idea: Execute programs in dataflow order, but give the illusion of sequential execution
- This is a "restricted dataflow" model
- Restricted to instructions near those currently committing
- (Pure dataflow processors also exist that expose dataflow to software)


## High-level OoO microarchitecture



## OoO is hidden behind in-order frontend $\&$ commit



- Instructions only enter instruction queue(IQ) and leave reorder buffer(ROB) in program order; all bets are off in between!


## OoO is hidden behind in-order frontend $\&$ commit



- Instructions only enter instruction queue(IQ) and leave reorder buffer(ROB) in program order; all bets are off in between!


## OoO is hidden behind in-order frontend $\&$ commit



- Instructions only enter instruction queue(IQ) and leave reorder buffer(ROB) in program order; all bets are off in between!


## Example: OoO polynomial evaluation

- Q: Does OoO speedup execution? By how much?
- Q: What is the performance bottleneck?
- Assume perfect forwarding \& branch prediction


## Example: OoO polynomial evaluation

 pipeline diagramTIME
1 dr , mu 1 execute in 2 cycles cmp, bne execute in 1 cycle m 7 a executes in 3 cycles


## Example: OoO polynomial evaluation

 pipeline diagram1 dr , mu 1 execute in 2 cycles cmp, bne execute in 1 cycle m 7 a executes in 3 cycles


## Example: OoO polynomial evaluation

 pipeline diagram1 dr , mu 1 execute in 2 cycles cmp, bne execute in 1 cycle m 7 a executes in 3 cycles


## Example: OoO polynomial evaluation

 pipeline diagram1 dr , mu 1 execute in 2 cycles cmp, bne execute in 1 cycle m 7 a executes in 3 cycles


## Example: OoO polynomial evaluation pipeline diagram <br> 1 dr , mu 1 execute in 2 cycles cmp, bne execute in 1 cycle m 7 a executes in 3 cycles



## Example: OoO polynomial evaluation pipeline diagram <br> 1 dr , mu 1 execute in 2 cycles cmp, bne execute in 1 cycle m 7 a executes in 3 cycles



## Example: OoO polynomial evaluation

 pipeline diagram1 dr , mu 1 execute in 2 cycles cmp, bne execute in 1 cycle m 7 a executes in 3 cycles


- This isn't 000... or even faster than a simple pipeline!
- Q: What went wrong?
- A: We're throughput-limited: can only exec 1 instrn


## High-level Superscalar OoO microarchitecture

- Must increase pipeline width to increase ILP > 1 (2-way 3-issue)



## Focus on Execution, not Fetch \& Commit

- Goal of OoO design is to only be limited by dataflow execution
- Fetch and commit are over-provisioned so that they (usually) do not limit performance
$\rightarrow$ Programmers can (usually) ignore fetch/commit
- NOTEs: Programs with inherently unpredictable control flow will often be limited by fetch stalls (branch misprediction)
- E.g., branching based on random data


## Example: Superscalar OoO polynomial evaluation



## Example: Superscalar OoO polynomial evaluation




## Example: Superscalar OoO polynomial evaluation



## Example: Superscalar OoO polynomial evaluation



## Example: Superscalar OoO polynomial evaluation



## Example: Superscalar OoO polynomial evaluation



## Example: Superscalar OoO polynomial evaluation



## Example: Superscalar OoO polynomial evaluation



## Example: Superscalar OoO polynomial evaluation



## Example: Superscalar OoO polynomial evaluation



## Structural hazards: Other throughput limitations

- However, execution units are specialized
- Floating-point (add/multiply)
- Integer (add/multiply/compare)
- Memory (load/store)
- Processor designers must choose which execution units to include and how many
- Structural hazard: Data is ready, but instr'n cannot issue because no hardware is available


## Example: Structural hazards can severely limit performance

| Fetch \& | 1dr | m7a | bne | cmp | mu1 | 1dr | m7a | bne | cmp | mul | 1dr |  | bne | cmp | mu 1 | 7dr |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| Decode | cmp | mu 7 | 1dr | m7a | bne | cmp | mu1 | 1dr | $\mathrm{mla}$ | bne | cmp | mu 1 | 1dr | m7a | bne | cmp |
| Mem Execute |  | 1dr |  | 1dr |  |  |  | O | $1 d r$ |  |  | $1 d r$ |  | 1 dr |  |  |
| Int <br> Execute |  |  |  | cmp | bne | cmp | bne |  | cmp | bne | cmp | bne |  | cmp | bne | cmp |
| Mult Execute |  |  |  | m7a |  | mu 1 |  | m7a |  |  | mu1 |  |  | m7a |  | mu1 |
| Commit |  |  |  | 1 dr | cmp | m7a |  | mul | 1dr |  | m7a |  | mu7 | 1 dr |  | m7a |
|  |  |  |  |  |  |  |  | bne | - |  |  |  | bne | cmp |  |  |

## Throughput Bound

- Ingredients:
- Number of operations to perform (of each type)
- Number \& issue rate of "execution ports"/"functional units" (of each type)
- Throughput bound =ops / issue rate
- E.g., (1 mla +1 mul) / (2 + 3 cycles)
- Again, a real CPU might not exactly meet this bound


## Software Takeaway

- OoO is much less sensitive to "good code"
- Better performance portability
- Of course, compiler still matters
- OoO makes performance analysis much simpler
- Throughput bouind: Availability of execution ports
- Latencybound: "Critical path" latency
- Slowest gives good approximation of program perf


## Out-of-Order Execútion: Under the Hood

## Register Renaming

- "False dependences" can severely limit parallelism
- Write-after-read (WAR)
- Write-after-write(WAW)
- Read-after-read (RAR)
- OoO processors eliminate false dependences by transparently renaming registers
- CPU has many more "physical" than "architectural" registers
- Each time register is written, it is allocated to a new physical register
- Physical registers freed when instructions commit


## Register Renaming

| Original: | $r 1 \leftarrow r 2 / r 3$ | Renamed: $\mathrm{p} 1 \leftarrow \mathrm{p} 2 / \mathrm{p} 3$ |
| :---: | :---: | :---: |
|  | $\mathrm{r} 4 \leftarrow \mathrm{r} 1 * \mathrm{r} 5$ | $\mathrm{p} 4 \leftarrow \mathrm{p}]^{*} \mathrm{p} 5$ |
|  | $r 1 \leftarrow r 3+r 6$ | $p 8<p 3+p 6$ |
|  | $\mathrm{r} 3 \leftarrow \mathrm{r} 1-\mathrm{r} 5$ | $\mathrm{p} 9<\mathrm{p} 8-\mathrm{p} 5$ |

Architectural Registers
(ISA defined: r0...r31)


■ Maintain mapping from ISA reg. names to physical registers

- When decoding an instruction that updates ' $r x$ ':
- allocate unused physical register 'py' to hold inst result
- set new mapping from ' $r x$ ' to ' $p y$ '
- younger instructions using ' $r x^{\prime}$ as input finds ' $p y^{\prime}$

■ De-allocate a physical register for reuse
■ Need a place to hold free physical registers (Free list)

## Memory Disambiguation

- CPU must respect store $\rightarrow$ load ordering
- E.g., a later instruction reads a value from memory written by an earlier instruction, but the address might be implicit.

$$
\begin{aligned}
& \text { st X3 \#4 } \\
& \text { 1d X2 \#16 }
\end{aligned}
$$

- But what if the OoO CPU executes the load first?
- Must "rollback" + execute the load again (next slide)
- Corollary: OoO CPU must track the order of all loads \& stores, and only write memory when a store commits


## Store Buffer

allow younger LD to execute (out-of-order), must ensure ST target block not evicted

Memory dependence and forwarding

- younger LD must check against pending ST addresses in store buffer (CAM) for RAW dependence



## Rollback \& Recovery

- OoO CPUs speculate constantly to improve performance
- E.g., even guessing the results of a computation ("value prediction")
- Need mechanisms to "rollback" to an earlier point in execution when speculation goes wrong
- Complex: Need to recover old register names, flush pending memory operations, etc (Using Checkpoint, support fewer branch instructions on-the-fly, the \# of Checkpoint is limited)
- Very expensive: Up to hundreds of instrns of work lost! (width*depth + size_of_ROB)


## SuperScalar Speculative 000 All Together

For an example:



## Outline

■ Instruction level parallel

- Pipeline
$\square$ Data hazards
$\square$ Control hazards
$\square$ Structure hazards
- Out-of-Order Exeeution
- Datafow

■ Optimization based on ILP

- Case study
$\square$ Throughput bound
$\square$ Latency bound
$\square$ Performance Optimization


## Optimization Code

Why optimize code is this programmers' problem?

- In theory, compilers and hardware "understand" all this and can optimize your program; in practice they don't.
■ Understanding the capabilities and limitations of optimizing compliers
■ They won't know about a different algorithm that might be a muchbetter "match" to the processor


## Example : Limitation of Optimizing Compiler(1)

Compilers must be careful to apply only SAFE optimization to a program. Instead, the compiler assumes the worst case and programmers must put more effort into writing programs to assist compiler to generate efficient code.


The compiler knows nothing about how twiddle1 will be called, it must assume that arguments $x p$ and yp can be equal (memory aliasing).

## Example : Limitation of Optimizing Compiler(2)

Compilers must be careful to apply only SAFE optimization to a program. Instead, the compiler assumes the worst case and programmers must put more effort into writing programs to assist compiler to generate efficient code.

f() modifies some part of the global program state (counter). Changing the number of times it gets called changes the program behavior.

## Example Program

Compute $\sin (\mathrm{x})$ using Taylor Expansion: $\sin (x)=x-\frac{x^{3}}{3!}+\frac{x^{5}}{5!}-\frac{x^{7}}{7!}+\cdots$
For each element of an array of $\mathbf{N}$ floating-point numbers

```
void sinx(int N, int terms, float * x,
    float *result) {
    for (int i=0; i<N; i++) {
        float value = x[i];
        float numer = x[i]*x[i]*x[i];
        int denom = 6; // 3!
        int sign = -1;
    for (int j=1; j<=terms; j++) {
        value += sign* numer / denom;
        numer *= x[i] * x[i];
        denom*= (2*j+2)*(2*j+3);
        sign *= -1;
    }
    result[i] = value;
    }
}
    }
}
```


## Taylor expansion of $\sin (x)$

$$
\sin (x)=x-\frac{x^{3}}{3!}+\frac{x^{5}}{5!}-\frac{x^{7}}{7!}+\cdots
$$

```
void sinx(int N, int terms, float * x,
            float *result) {
    for (int i=0; i<N; i++) {
        float value = x[i];
        float numer = x[i]*x[i]*x[i];
        int denom = 6; // 3!
        int sign = -1;
        for (int j=1; j<=terms; j++) {
        value += sign* numer / denom;
        numer *=x[i] % x[i];
        denom *= (2*j+2) * (2*j+3);
        sign *= -1;
    }
    result[i] = value;
    }
}
}
```

$\qquad$


3 How fast is this code?

- Where should we focus optimization efforts?
- What is the bottleneck?


## Taylor expansion of $\sin (x)$

$$
\sin (x)=x-\frac{x^{3}}{3!}+\frac{x^{5}}{5!}-\frac{x^{7}}{7!}+\cdots
$$

```
void sinx(int N, int terms, float * x,
```

for (int $j=1 ; j<=$ terms; $j++$ ) \{
value += sign* numer / denom; numer $*=x[i] * x[i]$; denom $*=(2 * j+2) *(2 * j+3)$; $\operatorname{sign}=-1$;
\}
result[i] = value;
\}

```
for (int \(i=0 ; i<N ; i++\) ) \{
float value \(=x[i]\);
float numer \(=x[i] * x[i] * x[i]\); int denom \(=6 ; / / 3\) ! int sign = -1 ;
    result[i] = value;
    }
```

- Where should we focus optimization efforts?
- A: Where most of the time is spent


## Taylor expansion of $\sin (x)$

$$
\sin (x)=x-\frac{x^{3}}{3!}+\frac{x^{5}}{5!}-\frac{x^{7}}{7!}+\cdots
$$

```
void sinx(int N, int terms, float * x,
            float *result) {
    for (int i=0; i<N; i++) {
        float value = x[i];
        float numer = x[i]*x[i]*x[i];
        int denom = 6; // 3!
        int sign = -1;
            for (int j=1; j<=terms; j++) {
        value += sign* numer / denom;
        numer *=x[i]*x[i];
        denom*=(2*j+2)*(2*j+3);
        sign "= -1;
    }
    result[i] = value;
    }
}
```

    - What isthe bottleneck?
    
## Dataflow for a single iteration

```
for (int j=1; j<=terms; j++) {
value += sign * numer / denom;
numer *= x[i] * x[i];
    denom * = (2*j+2) * (2*j+3);
    sign *= -1;
```



OK, but how does this perform on a real machine?

## Superscalar OOO Processor

- What in microarchitecture should we worry



## OOO Processor Microarchitecture

- What in microarchitecture should we worry about?
- Fetch \& Decode?

NO. Any reasonable machine will have sufficient frontend throughput to keep execution busy + all branches in this code are easy to predict (not always the case!).

YES. This is where dataflow + most structural hazards willlimit our performance.

- Commit?

NO. Again, any reasonable machine will have sufficient commit throughput to keep execution busy.


## Intel Skylake Execution Microarchitecture

|  | Latency | Integer Pipelined? | Number | Latency | loating <br> Pipelined? | Number |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| Add | 1 | $\checkmark$ | 4 | 4* | $\checkmark$ | 2 |
| Multiply | 3 | $\checkmark$ |  | 4 | $\checkmark$ | 2 |
| Divide | 21-83 | $x$ |  | 3-15 | $x * *$ | 1 |
| Load | 2 |  | 2 |  |  |  |
|  |  | $y$ | * 3 cycles if using x87 instructions ** Can issue another operation after 4 cycles |  |  |  |

> Source: Search for "Skylake" in
> https://www.agner.org/optimize/microarchitecture.pdf https://www.agner.org/optimize/instruction tables.pdf

## What is our throughput bound?

```
value += sign * numer / denom;
numer *= x[i] * x[i];
denom *= (2*j+2) * (2*j+3);
    sign *= -1;
```



Throughput bound: Ignore data hazards, think only about max issue rate due to structural hazards


## What is our latency bound?

- Latency bound: Ignore structural hazards, think only about the critical path through data hazards



## Takeaways

- Observe performance of 23 cycles / element
- Latency bound dominates throughput bound $\rightarrow$ We are latency bound!
- Notes
- This analysis can often be "eyeballed" w/out full dataflow
- Actual execution is more complicated, but latency/throughput bounds are good approximation
- (Also, avoid division!!!)


## for (int $j=1 ;$ j $<=$ terms; $j++$ ) \{

$$
\sin (x)=x-\frac{x^{3}}{3!}+\frac{x^{5}}{5!}-\frac{x^{7}}{7!}+\cdots
$$

## value $+=$ sign * numer / denom; numer ${ }^{*}=x[i]{ }^{*} x[i]$; denom ${ }^{*}=\left(2^{\star} j+2\right)^{*}\left(2^{\star} j+3\right)$; <br> <br> ``` ``` sign *= - ;; ``` ``` <br> <br> ``` ``` sign *= - ;; ``` ``` <br> Speeding up $\sin (x)$ : Attempt \#1

- What if we eliminate unnecessary work?
float *result) \{
for (int $i=0 ; i<N ; i++$ ) \{
float value $=x[i]$;
float $x 2=x[i] * x[i]$;
float numer $=x 2 * x[i]$; int denom $=6$; // 3! int sign = -1;
for (int $j=1 ; j<=$ terms; $j++$ ) \{ value $+=$ sign * numer / denom; numer $*=x 2$; denom *= $(2 * j+2) *(2 * j+3)$; sign = -sign;
\}
result[i] = value;
\}
\}


A: Small improvement.

6 ns / element $\approx$ 18 cycles / element

Why not better?

## What is our latency bound?

- Find the critical path in the dataflow graph



## Attempt \#1 Takeaways

- First attempt didn't change latency bound
- To get real speedup, we need to focus on the performance bottleneck
- Q: Why did we get any speedup at all?
- A: Actual dynamic scheduling is complicated; would need to simulate execution in more detail (minus the usage of multiplier, therefore reduce the \% of structure harzard)


## Speeding up $\sin (x)$ : Attempt \#2

```
for (int j=1; j==terms; j++) {
```

$\sin (x)=x-\frac{x^{3}}{3!}+\frac{x^{5}}{5!}-\frac{x^{7}}{7!}+\cdots$
value $+=$ sign * numer / denom;
numer ${ }^{*}=x 2$;
denom ${ }^{*}=\left(2^{*} \mathrm{j}+2\right)^{*}\left(2^{*} \mathrm{j}+3\right)$;
sign $=-$ sign;

- Let's focus on that pesky division...

```
void sinx_predenom(int N, int terms, float * x, float *result) {
```

    float rdenom[MAXTERMS];
    int denom \(=6\);
    for (int \(j=1\); \(j<=\) terms; j++)
        rdenom[j] = 1.0/denom;
        denom \(*=(2 * j+2) *(2 * j+3) ;\)
    A: Big improvement!
for (int $i=0 ; i<N ; i++$ )
float value $=\times[i]$;
float $x 2=$ value * value;
float numer $=\times 2 *$ value;
int sign $=-1$;
for (int $j=1 ; j<=$ terms ; $j++$ ) \{
value $+=$ sign * numer * rdenom[j];
numer *= x2;
sign = -sign;
\}
result[i] = value;
\}

## What is our latency bound?

- Find the critical path in the dataflow graph



## Attempt \#2 Takeaways

- Attacking the bottleneck got nearly $3 \times$ !
- ...But performance is still near the latency bound, can we do better?


## Speeding up $\sin (x)$ : Attempt \#3

## - Don't need sign in inner-loop either

```
void sinx_predenoms(int N, int terms, float * x, float *result) {
    float rdenom[MAXTERMS];
    int denom = 6;
    float sign = -1.0;
    for (int j = 1; j <= terms; j++)
        rdenom[j] = sign/denom;
        denom *= (2*j+2) * (2*j+3);
        sign = -sign;
    }
    for (int i=0; i<N;i++) {
        float value = x[i];
            float x2 = value * value;
        float numer = x2 * value;
            for (int j=1; j<=terms; j++) {
            for (int j=1; j<=terms; j++) {
            numer *= x2;
        }
        result[i] = value;
    }
}
```

1.1 ns / element $\approx$
3.5 cycles / element

## What is our latency bound?

- Find the critical path in the dataflow graph



## Attempt \#3 Takeaways

- We're down to the latency of a single, fast operation per iteration
-     + Observed performance is very close to this latency bound, so throughput isn't limiting
$-\rightarrow$ We're done optimizing individual iterations
- How to optimize multiple iterations?
- Eliminate dependence chains across iterations
- A) Loop unrolling (ILP)
- B) Explicit parallelism (SIMD, threading)


#  <br> $$
\sin (x)=x-\frac{x^{3}}{3!}+\frac{x^{5}}{5!}-\frac{x^{7}}{7!}+\cdots
$$ 

- Compute multiple elements per iteratid Rutili value;

```
void sinx_unrollx2(int N, int terms, float * x, float *result) {
    // same predom stuff as before...
    for (int i=0; i<N; i++) {
        float value = x[i];
        float x2 = value * vaTue;
        float x4 = x2 * x2;
        float numer = x2 * value;
        for (int j=1; j<=terms; j+=2) {
            vałue += numer * rdenom[j];
            value += numer * x2 * redom[j+1];
            numer *= x4;
        }
        result[i] = value;
    }
}
```


## Speeding up $\sin (x)$ : Loop unrolling

$$
\sin (x)=x-\frac{x^{3}}{3!}+\frac{x^{5}}{5!}-\frac{x^{7}}{7!}+\cdots
$$

## - Compute multiple elements per iteration

```
void sinx_unrol1x2(int N, int terms, float * x, float*result) {
    // same predom stuff as before...
    for (int i=0; i<N; i++) {
        float value = x[i];
        float x2 = value * value;
        float x4 = x2 * x2;
        float numer = x2 * value;
        int j;
        for (j=1; j<=terms-1; j+=2) {
        value += numer * rdenom[j];
        value += numer * x2 * rdenom[j+1];
        numer *}=\times4
        }
        for (; j<=terms; j++) {
        value += numer * rdenom[j];
        numer *= x2;
    }
        result[i] = value;
    }
}
```


## What is our latency bound?

- Find the critical path in the dataflow graph



## Speeding up $\sin (x)$ : Loop unrolling \#2

```
for (j=1; j<=terms-1; j+=2) {
```

                                    value += numer * rdenom[j];
                                    value += numer * x2 * rdenom[j+1];
    
## - What if floating point associated + distributed?x4;

```
void sinx_unrollx2(int N, int terms, float * x, float
*result)}
    // same predom stuff as before...
```

    for (int \(i=0 ; i<N ; i++\) ) \{
        float value \(=x[i]\);
        float \(x 2=\) value * value;
        float x4 = x2 * x2;
        float numer \(=x 2\) * value;
        int j;
                                    0.69 ns / element \(\approx\)
                                    2.2 cycles / element
        for ( \(j=1 ; j<=\) terms \(-1 ; j++\) ) \(\{\)
        value += numer * (rdenom[j] + x2 * redom[j+1]);
        numer *= x4;
        \}
        for (; j<=terms; j++) \{
            value \(+=\) numer * rdenom[j];
            numer \(*=x 2\);
        \}
        result[i] = value;
    \}
    \}

## What is our latency bound?

- Find the critical path in the dataflow graph



## Loads do not limit $\sin (x)$

- Consider just the slice of the program that generates the subexpression: (rdenom[j] $+\mathrm{x} 2 \times \operatorname{rednom}[\mathrm{j}+1])$
- What is this program's latency + throughput bound?

- Latency bound: 1 cycle / iteration!
- Through j' computation, not the subexpression computation there is no cross-iteration dependence in the subexpression!)
- Throughput bound: also 1 cycle / iteration
- 1 add / 4 adders; 2 LDs / 2 LD units; 1 FP FMA / 1 FP unit
- (This will change to $\mathbf{2}$ cycles if we add the value FMA)


## Loads do not limit $\sin (x)$ : Visualization

- Consider just the slice of the program that generates the subexpression: (rdenom[j] + $\mathrm{x} 2 \times$ rednom[j +1 ])
- Subexpressions are off the critical path + we have enough throughput to produce next subexpression each cycle (excluding value FMA)



## Loads do not limit $\sin (x)$ : Example execution



FP
MUL

## Loop unrolling takeaways

- Need to break dependencies across iterations to get speedup
- Unrolling by itself doesn't help
- We are now seeing throughput effects
- Latency bound = 1.5 vs. observed = 2.2
- Can unroll loop $3 x, 4 x$ to improve further, but...
- ...Diminishing returns ( 1.65 cycles / element at 4 x )


## What if? \#1 Impact of structural hazards

- Q: What would happen to $\sin (x)$ if weronly had a single, unpipelined floating-point multiplier?
- A1: Performance will be much worse
- A2: We will hit throughput bound much earlier
- A3: Loop unrolling will help by reducing multiplies


## What if? \#2 Impact of structural hazards

- Q: What would happen to $\sin (x)$ if LDs (cache hits) took 2 cycles instead of $\mathbf{1}$ cycle?
- A: Nothing. This program is latency bound, and LDs are not on the critical path.


## SIMD (Single Instruction Multiple Data)



Instantiate k copies of the hardware unit foo to process $k$ iterations of the loop in parallel

## Speeding up $\sin (x)$ :Going parallel (explicitly)

- Use ISPC to vectorize the code

```
export void sinx_reference
            (uniform int N,
                uniform int terms,
                uniform float x[],
                uniform float result[]) {
    foreach (i=0
            N) {
        float value = x[i];
        float numer = x[i]*x[i]*x[i];
        uniform int denom = 6; // 3!
        uniform int sign = -1;
        for (uniform int j=1; j<=terms; j++) {
            value += sign * numer / denom;
            numer *= x[i]*x[i];
            denom *= (2*j+2)*(2*j+3);
            sign *= -1;
        }
        result[i] = value; result[i] = value;
    }
}
```

```
void sinx
```

void sinx
(int N,
(int N,
int terms,
int terms,
float *
float *
float *result) {
float *result) {
for (int i=0; i<N; i++) {
for (int i=0; i<N; i++) {
float value = x[i];
float value = x[i];
float numer = x[i]*x[i]*x[i];
float numer = x[i]*x[i]*x[i];
int denom = 6; // 3!
int denom = 6; // 3!
int sign = -1;
int sign = -1;
for (int j=1; j<=terms; j++) {
for (int j=1; j<=terms; j++) {
value += sign * numer / denom;
value += sign * numer / denom;
numer *= x[i] * x[i];
numer *= x[i] * x[i];
denom *= (2*j+2) * (2*j+3);
denom *= (2*j+2) * (2*j+3);
sign *= -1;
sign *= -1;
}
}
}
}
}

```
}
```


## Speeding up $\sin (x)$ : Going parallel (explicitly) + optimize

```
export void sinx_unrol1x2a(uniform int N, uniform int terms,
                                    uniform float x[],
                                    uniform float result[]) {
    uniform float rdenom[MAXTERMS];
    uniform int denom = 6;
    uniform float sign = -1;
    for (uniform int j = 1; j <= terms; j++) {
        rdenom[j] = sign/denom;
        denom *= (2*j+2) * (2*j+3);
        sign = -sign;
    }
    foreach (i=0 ... N) {
    float value = x[i];
    float x2 = value * value;
    float x4 = x2 * x2;
    0 . 1 4 \text { ns / element } \approx
    float numer = x2 * value;
    uniform int j;
0.45 cycles / element
    for (j=1; j<=terms-1; j+=2) {
        value += numer * (rdenom[j] + x2 * rdenom[j+1]);
        numer *= x4;
    }
    for (; j <= terms; j++) {
        value += numer * rdenom[j];
        numer *= x2;
    }
    result[i] = value;
    }
}
```


## SIMD takeaways

- Well, that was easy!
- Cycles per element:

|  |  |  |
| :---: | :---: | :---: |
|  | Scalar | Vector |
| Unoptimized | 23 | 3.2 |
| Unorolied | 2.2 | 0.45 |

- Speedup requires hand parallin + explicit paralle lism!



## Scaling Instruction-Level Parallelism

## Recall from last time: ILP \& pipelining tapped out... why?



## Superscalar scheduling is complex \& hard to scale

- Q: When is it safe to issue two instructions?
- A: When they are independent
- Must compare all pairs of input and output registers
- Scalability: $O\left(W^{2}\right)$ comparisons where $W$ is "issue width" of processor
- Not great!


## Limitations of ILP

- 4-wide superscalar $\times 20$-stage pipeline $=80$ instrns in flight
- High-performance OoO buffers hundreds of instructions
- Programs have limited ILP
- Even with perfect scheduling, $>8$-wide issue doesn't help
- Pipelines can only go so deep
- Branch misprediction penalty grows
- Frequency (GHz) limited by power
- Dynamic scheduling overheads are significant
- Out-of-order scheduling is expensive


## Limitations of ILP $\rightarrow$ SIMD\Multithread $\backslash$ Multicore

- ILP works great! ...But is complex + hard to scale
- From hardware perspective, multicore is much more efficient, but needs programmer's effort based on the knowledge about underlying architecture.
- Parallel software is hard!
- Industry resisted multicore for as long as possible
- When multicore finally happened, CPU $\mu$ arch simplified $\rightarrow$ more cores
- Many program(mer)s still struggle to use multicore effectively

Next Lecture: Understanding Modern Processor: DLP and TLP

