#### Preliminaries 1

Architecture: timing independent func- 3.0.1 Ratio of Means tion of computer MicroArchitecture: Implementation techniques to improve perf. Implicit Parallelism: increase ILP, pipelining, caching Explicit Parallelism: Data parallelism, TLP ISA: SW/HW Interface

### 2 Performance Metrics 2.1 Latency

Latency (execution time) is the time required to finish some fixed task. Processor A is X times faster than B:

Lat(P, A) = Lat(P, B)/XProcessor A is X% faster than B:

$$\operatorname{Lat}(P,A) = \operatorname{Lat}(P,B) / \left(1 + \frac{X}{100}\right)$$

Lat
$$(P_1 + P_2, A) = \text{Lat}(P_1, A) + \text{Lat}(P_2, A)$$

### 2.2 Throughput (Bandwidth)

Proc. A has X times TP relative to B: TP(P, A) = TP(P, B)/X

Proc. A has X% the TP relative to B: Т

$$\operatorname{TP}(P,A) = \operatorname{TP}(P,B) / \left(1 + \frac{100}{100}\right)$$

### 2.2.1 Adding Throughputs

$$TP(P_1 + P_2, A) = \frac{2}{\frac{1}{TP(P_1, A)} + \frac{1}{TP(P_2, A)}}$$

### 2.3 Speedup

Speedup =  $\frac{T_{R,i}}{T_i} = \frac{T_{old}}{T_{new}}$ 

- $T_{R,i}$  execution time on reference machine
- $T_i$  execution time on evaluated machine

### 2.4 Slowdown

Slowdown =  $F \cdot (R_{exe}) + (1 - F) \cdot 1$ 

- F fraction of instructions that experience slowdown
- $R_{exe}$  factor of how much slower the F instructions are

2.5 Execution Time  $T_{exe} = \text{Lat}(P) = IC \times CPI \times T_c$ 

- IC dynamic instruction count
- CPI is # of cycles per instr
- $T_c$  is the seconds per cycle (clock period)

### 2.5.1 CPI (Cycles per Instr) & IPC

$$CPI = \frac{T_{exe}}{IC \times T_c} \qquad IPC = \frac{1}{CP}$$

For 5-stage processor,

$$CPI = 1 + \sum f_{stall} \cdot c_{stall}$$

- *fstall* is the frequency of instructions that stall
- c<sub>stall</sub> is the # of stall cycles corresponding to instr with  $f_{stall}$  frequency

# 3 Analyzing Performance

 $\text{RoM} = \frac{\sum_{i=1,\dots,N} T_{R,i}}{\sum_{i=1,\dots,N} T_{R,i}}$ 

- *s<sub>i</sub>* is the speedup
- $T_{R,i}$  exec time on reference
- T<sub>i</sub> exec time on evaluated

### 3.1 Arithmetic Average

For units proportional to time (e.g. latency)

$$\overline{L} = \frac{1}{N} \times \sum_{i=1}^{N} T_i = \frac{1}{N} \times \sum_{P=1,\dots,N}^{N} \operatorname{Lat}(P)$$

Where  $\overline{L}$  is the average latency of all programs  $T_i, ..., T_N$ 

### 3.2 Harmonic Average

For units inversely proportional to time

$$\Gamma = \frac{1}{\sum_{P=1,\dots,N} \frac{1}{\text{TP}(P)}}$$

Where *T* is the average throughput

### 3.3 Geometric Average

For unitless quantities (e.g. speedup)

$$\sqrt[N]{\Pi_{P=1,...,N}}$$
SpdUp(P)

### $\sqrt[N]{\text{SpdUp}(P_1) * \text{SpdUp}(P_2) * ... * \text{SpdUp}(P_N)}$ 8 Dynamic Branch Prediction

F

S

o bits

4 Amdahl's Law Assume an enhancement *E*, which speeds up fraction *F* of computation by factor S:

$$T_{exe}(w/E) = T_{exe}(w/oE) \times \left[ (1-F) + \right]$$

$$\operatorname{SpdUp}(E) = \frac{T_{exe}(w/o E)}{T_{exe}(with E)} = \frac{1}{(1-F)+1}$$

#### Parallel Case 4.1

Let *P* be number of cores, and *F* fraction of code that can be parallelized on *P*:

$$S_p = \frac{I_1}{T_p} = \frac{1}{1 - F + \frac{F}{P}} = \frac{P}{F + P(1 - F)} < \frac{1}{1 - F}$$
5 ISA

### 5.1 ISA Condition Codes

conditional execution (e.g. for branches) is set by condition codes, which differ for various ISA's. A typical condition code register: Z: Zero, C: Carry, V: Overflow, X: Extend, N: Negative

### 6 Pipelining

The classic 5 stage pipeline has

- Fetch: Fetch instruction from PC
- Decode: Read reg, find instr type
- eXecute: Execute instr (ALU)
- Memory: Handle memory instr
- Writeback: write completed instr result to register file



# 8.2 History-Based Methods (BHR)

GA = Global BHR, PA = Private BHR 8.2.1 GAg



#### prediction (s.globul



Branch PC

store return address on Return Address Stack (RAS)

### 9 Exceptions

Interrupts, exceptions, page faults, illegal op

### 9.1 Handling Exceptions

Save processor state, restart execution. Instr in flight become NOP

### 10 Dynamic Scheduling

aka out-of-order execution. benefits:

- 1. reduce RAW stalls
- 2. increase pipeline, FU utilization
- 3. increase ILP

Both sides of the sheet may be used; must be printed on 8.5" x 11" paper. **Faculty of Applied Science** Examination Aid Sheet Qo Engineering

| Candidate's signature: | Candidate's name: | Subject: |  |
|------------------------|-------------------|----------|--|
|------------------------|-------------------|----------|--|

### 11 Tomasulo's Algorithm

Features: register renaming using tag's (avoid WAW, WAR). Reservation Station to buffer instructions (instr q). Common Data Bus (CDB) to broadcast completed instr to RS's.

### 11.1 Processor Structure in Tomasulo

- Fetch: Fetch instruction from PC
- Dispatch: Check for structural hazard (RS full), rename output reg to allocated RS, check input registers ready
- Issue: Waits for RAW and Struct. hazards. If reg ready, send to X
- eXecute: Execute instr (ALU)
- Memory: Handle memory instr
- Writeback: Broadcast on CDB (wait for structural hazard), clear RS entry and tag on tag match

### 11.2 Register Renaming

Storage locations referred to by RS# tags Tag == 0 - > val in reg table

 $Tag! = 0 \rightarrow val not rdy$  (being computed)

### 12 Precise State

Speculation requires ability to abort and restart. Tomasulo has ooo completion, hard to restore precise instr state.

### 12.1 Re-Order Buffer (ROB)

Register writes executed in dispatch order. ROB stores completion flag of instr, new and old register mapping. Enables in-order dispatch, ooo execution, inorder completion.

### 12.2 Load/Store Queue (LSQ)

Completed stores write to LSQ. LSQ writes to memory when store retires. Loads access LSQ and data cache in parallel if  $\exists$ older store with matching addr. Forward LSO value if exists.

13 MIPS R10K

### 13.1 Physical Register File

MIPS R10K has Physical Registers (PR) instead of named architectural registers. Conceptually, big bank of physical registers which can be associated via the ROB.

- 13.2 R10K Structures
  - ROB: *T<sub>old</sub>* PR prev. mapped to this instr, T PR corresponding to this instr's logical output
  - RS: T PR, S1, S2 PR tags corresponding to instr inputs, rdy ready bit
  - Map Table: T PR, rdy ready bit
  - Free List: PR#

### 13.3 Processor Structure

• Fetch: Fetch instruction from PC • Dispatch: In-order, Check for structural hazard (RS, ROB, PR#), Allocate RS + ROB entries, new PR#, Read PR tags for input regs (store in RS S1, S2)

- · Issue: Waits for RAW and struct. hazards. If reg ready, send to X
- eXecute: Execute instr (ALU). Can free RS entry at end of X since RS# is not a tag
- Complete: Write destination PR. Set inst output reg ready in map table and RS
- Retire: If instr at ROB head not complete, stall. Handle exceptions. If store, right value from LSQ into data. Free  $T_{old}$ , ROB and LSQ entries.

### 14 Recovering from Misspeculation Two ways to restore precise state

# **14.1** Serial Rollback using *T*, *T*<sub>old</sub>

Slow (serial), but simple and cheap (hardware)

### 14.2 Single Cycle Restore Checkpoint

Fast (single cycle), but very expensive 14.3 Hybrid

Checkpoint low confidence branches. Serial recovery for page faults.

### 15 Caches

15.1 Cache Organization

 $b_{addr} = b_{tag} + b_{index} + b_{offset}$ Blocks = C/BS #Sets = #Blocks/#Ways where C = capacity, BS = block size $b_{tag} = b_{addr} - \log_2(\#Sets) - \log_2(BS)$ 

### 15.1.1 Tag Overhead

C = Data + OH = Data +Where N = number of entries

# 15.2 Split *I*\$/*D*\$ Reasoning

Avoid structural hazard (read ports), store additional metadata for pred, exploit data locality for prefetch, *I*\$ can be RO. 15.3 Cache Performance Metrics

# # Misses

 $\%_{miss} = \frac{1}{\# \text{Accesses}}$ 

 $\%_{hit} = \frac{\# \text{ Hits}}{\# \text{ Accesses}} = 1 - \%_{miss}$ 

*t<sub>hit</sub>*: time to access cache.

- $t_{miss}$ : time to bring data into cache. 15.4 Cache Performance Equation

# $t_{avg} = t_{hit} + \%_{miss} \cdot t_{miss}$

### 15.5 Cache Misses: 3(4)C Hill Model 15.5.1 Cold Misses

Independent of the cache, equal to number of blocks in the trace

 $M_{cold} =$ #blocks used assuming cache initialized to 0

### 15.5.2 Capacity Misses

Independent of cache organization or replacement policy.

### $M_{cap} = M_{FA,LRU} - M_{cold}$

where  $M_{FA,LRU}$  is the number of misses in a FA LRU cache

### 15.5.3 Conflict Misses

Dependent on cache organization and replacement policy.

# $M_{conflict} = M_{total} - (M_{cap} + M_{cold})$

### 15.5.4 Coherence Misses

Miss due to external invalidation (only in shared memory multiprocessors)

- 15.6 Replacement Policies
  - 1. Random Replacement
  - 2. FIFO/FILO
  - 3. LRU (Least Recently Used): 2way=1 bit per set. N > 2way=counter per way, OR  $\log_2 N$ bits per set
  - 4. NMRU (Not Most Recent Use): 1 bit MRU set per line, random select NOT MRU to replace
  - 5. Belady's: Furthest used in future replaced first

## 15.7 Prefetching

### 15.7.1 Stride Prefetcher



*i*: initial. *s*: stable. *t*: trans., *n*: incorrect

## 15.8 Write Propagation

- write-through (WT): propagate value immediately to \$
- write-back (WB): write when block replaced (req. dirty bit)

### 15.9 Allocate

- Write-allocate: read from lower level, write value. Used with WB
- Write-non-allocate: write blk to next level. Used with WT

#### Multiprocessors 16

### 16.1 Coherence

## 16.1.1 VI (MI) Protocol

| 0 | Currer     | nt Proc    | Other Proc |       |  |  |  |  |  |  |
|---|------------|------------|------------|-------|--|--|--|--|--|--|
| 5 | Load       | Store      | Load       | Store |  |  |  |  |  |  |
| Ι | miss<br>/V | miss<br>/V | -          | -     |  |  |  |  |  |  |
| Μ | hit        | hit        | SD         | SD    |  |  |  |  |  |  |

### 16.1.2 MSI Protocol

SW r2 -> [&lock] Current Proc Other Proc Store Wasted BW (other P spins on lock) Load <u>Store</u> Load miss /S miss /M Ι -Invalid. LL [r1] -> r2 upgrade S hit SC r3 -> [r1]miss /M /Τ SĎ Μ hit SC returns 0 in r3 if value in r1 modified hit

# 16.1.3 MSI - Directory

Tracks the following per cache block:

- Owner
  - Sharers (bit vector)
  - · Home directory
  - State



16.3 Consistency

form view of all mem locs.

16.3.2 Ordering Rules

 $W \rightarrow R$ 

sections

banked instr blocks.

17.2.3  $N^2$  Bypassing

17.2.4 Clustering

necessary N > 4

2

13

Superscalar

**18 Multithreading** 

CGMT

lay).

17 Superscalar

16.3.1 Sequential Consistency

Coherence: globally uniform view of sin-

gle mem loc. Consistency: globally uni-

Proc's see own LD/ST in prog order, see

other Proc's LD/ST in prog order. All

 $R \to W, R \to R, W \to R, W \to W$ 

Total Store Ordering (TSO): aka

• Weak Ordering (WO). All rela-

 $CPI_{ideal} = \frac{1}{N}$   $IPC_{ideal} = \frac{N * c}{c} = \frac{instrs}{c}$ 

Multiple instr/cycle, but could need to

predict multiple branches/cycle. Ban-

ked IS: DRAM banked, simultaneous

read. Combining Network: Combine

Register R/W Ports: Nominal 2N read,

1N write. In reality, lower (not all in-

str have 2 src, values bypassed), stores/-

Full bypassing requires  $N^2$  dependence

checks. N + 1 input muxes at each ALU

Mitigates  $N^2$  bypassing. Group FU's into

K clusters. Limited bypass (1 cycle de-

 $\left(\frac{N}{K}+1\right)$  inputs/mux  $\left(\frac{N}{K}\right)^2$  by pass/cluster

17.2.5 Wide Execute/Memory Access

N ALU's ok, N FP expensive. Wide mem

acc depends on instr mix, probably only

ma

222

EGMT

N =number of issues/retires per c

17.2 Superscalar Challenges

17.2.1 Wide Instruction Fetch

17.2.2 Wide Instruction Decode

branch (25-35%) don't write regs

input. Routing can be expensive.

xed, acquire-release define critical

Processor Consistency, relaxes

proc's see same global LD/St order.

### BR (from P)=)SD sharers = EP3 WB=7-M Shares = Fi invalidate W, D BR= BR = ) Fetch data fro £P3



|                                    | Load                                               | Store                                            | Replacement                               | Fwd-GetS                                  | Fwd-GetM                                              | Inv                                          | Put-Ack       | Data from<br>Dir (ack=0) | Data from<br>Dir (ack>0)            | Data from<br>Owner | Inv-Ack           | 17  | 17 Supe<br>17.1 CPI           |
|------------------------------------|----------------------------------------------------|--------------------------------------------------|-------------------------------------------|-------------------------------------------|-------------------------------------------------------|----------------------------------------------|---------------|--------------------------|-------------------------------------|--------------------|-------------------|-----|-------------------------------|
| 1                                  | Send GetS<br>to Dir/IS <sup>D</sup>                | Send<br>GetM to<br>Dir/IM <sup>AD</sup>          |                                           |                                           |                                                       |                                              |               |                          |                                     |                    |                   |     | CPI <sub>ideal</sub> =        |
| IS <sup>D</sup><br>IM <sup>A</sup> | Stall                                              | Stall                                            | Stall                                     |                                           |                                                       | Stall                                        |               | -/S                      |                                     | -/S                |                   |     | iucui                         |
| IM <sup>A</sup>                    | D Stall<br>Stall                                   | Stall<br>Stall                                   | Stall                                     | Stall<br>Stall                            | Stall<br>Stall                                        |                                              |               | -/M                      | -/IM <sup>A</sup>                   | -/M                | ack<br>ack        | 0   | N.T. 1                        |
| S                                  | Hit                                                | Send<br>GetM to<br>Dir/SM <sup>AD</sup>          | Send PutS<br>to Dir/SIA                   | Stati                                     | Stati                                                 | Send Inv-<br>Ack to<br>Reg/I                 |               |                          |                                     |                    | ack               | -78 | N = numt<br>17.2 Sup          |
| SMA                                |                                                    | Stall                                            | Stall                                     | Stall                                     | Stall                                                 | Send Inv-<br>Ack to<br>Req/IM <sup>AD</sup>  |               | -/M                      | -/SM <sup>A</sup>                   |                    | ack               |     | 17.2.1 W                      |
| SM <sup>A</sup><br>M               | Hit<br>Hit                                         | Stall<br>Hit                                     | Stall<br>Send PutN<br>+data to<br>Dir/MIA | Stall<br>Send data<br>to Req and<br>Dir/S | Stall<br>Send data<br>to Req/I                        |                                              |               |                          |                                     |                    | ack               | -/N | Multiple i<br>predict m       |
| MIA                                | Stall                                              | Stall                                            | Stall                                     | Send data                                 | Send data<br>to Req/II <sup>A</sup>                   |                                              | -/1           |                          |                                     |                    |                   |     | ked I\$: D                    |
| SIA                                | Stall                                              | Stall                                            | Stall                                     | Dirot                                     |                                                       | Send Inv-<br>Ack to<br>Req/II <sup>A</sup>   | -/1           |                          |                                     |                    |                   |     | read. <b>Cor</b><br>banked in |
| Пv                                 | Stall                                              | Stall                                            | Stall                                     |                                           |                                                       |                                              | -/1           |                          |                                     |                    |                   |     | 17.2.2 W                      |
|                                    | GetS                                               | GetM                                             | Put:<br>Not                               |                                           | PutS-Last                                             | Put M-<br>data fre<br>Owner                  | m             |                          | M+data<br>n Non-<br>ner             |                    | ata               |     | Register F                    |
| I                                  | Send data to<br>Req, add Req<br>to Sharers/S       | Send data<br>Req, set O<br>to Req/M              |                                           |                                           | Send Put-Acl<br>to Req                                | c .                                          |               | Sent<br>to R             | l Put-Ac<br>eq                      | k                  |                   |     | str have 2                    |
| s                                  | Send data to<br>Req, add Req<br>to Sharers         | Send data<br>Req, send I<br>to Sharers,          | Inv from<br>sent                          | Sharers,<br>Put-Ack                       | Remove Req<br>from Sharers,<br>send Put-Ack           |                                              |               | from                     | ove Req<br>sharers,<br>Put-Ac       |                    |                   |     | branch (25                    |
|                                    |                                                    | clear Share<br>set Owner<br>Req/M                | to                                        |                                           | to Req/I                                              |                                              |               | to R                     | <u> </u>                            |                    |                   |     | <b>17.2.3</b> N<br>Full bypas |
| м                                  | Send Fwd-<br>GetS to<br>Owner, add<br>Req and      | Send Fwd-<br>GetM to<br>Owner, set<br>Owner to I | to Re                                     |                                           | Send Put-Acl<br>to Req                                | c Copy da<br>to memo<br>clear Ov<br>send Pur | ory,<br>vner, | Send<br>to R             | l Put-Ac<br>eq                      | k                  |                   |     | checks. N                     |
|                                    | Owner to<br>Sharers, clear<br>Owner/S <sup>D</sup> |                                                  |                                           |                                           |                                                       | Ack to I                                     |               |                          |                                     |                    |                   |     | input. Rot<br><b>17.2.4 C</b> |
| SD                                 | Stall                                              | Stall                                            | from                                      | Sharers,<br>Put-Ack                       | Remove Req<br>from Sharers,<br>send Put-Ack<br>to Req |                                              |               | from                     | ove Req<br>Sharers<br>Put-Acl<br>eq | , m                | opy dat<br>emory/ |     | Mitigates<br>K clusters       |
|                                    |                                                    |                                                  |                                           |                                           |                                                       |                                              | _             |                          |                                     | _                  |                   |     | ∧ crusters                    |

## 16.2 Synchronization 16.2.1 Exchange

16.2.3 Test and Set

LW [&lock] -> r1

ADDI r1, 1->r1

EXCH r1, [&lock]

BNEZ r1, Á0

A0

A1

EXCH r1, 0(&lock) MOV r1 -> r2 LW [&lock] -> r1

LL[r1] -> r2

BNEZ r2, A0

SC  $r_2 \rightarrow [r_1]$ 

ADDI r2, 1->r2

16.2.2 Load Locked/Store Conditional

EXCH r1, [&lock]

BNEZ r1, A0

16.2.4 Test and Test and Set