## Multiprocessors and Coherent Memory

## Erik Hagersten Uppsala University



2013

## **Goal for this course**

- Understand <u>how and why</u> modern computer systems are designed the way the are:
  - pipelines
  - memory organization
  - virtual/physical memory ...
- Understand <u>how and why</u> multiprocessors are built
  - Cache coherence
  - Memory models
  - Synchronization...
- Understand <u>how and why</u> parallelism is created
  - Instruction-level parallelism
  - Memory-level parallelism
  - Thread-level parallelism...
- Understand <u>how and why</u> multiprocessors of combined SIMD/MIMD type are built
  - GPU
  - Vector processing...
- Understand <u>how</u> computer systems are adopted to different usage areas
  - General-purpose processors
  - Embedded/network processors...
- Understand the physical limitation of modern computers
  - Bandwidth
  - Energy
  - Cooling...

Dept of Information Technology www.it.uu.se

#### > This batch of lectures

© Erik Hagersten user.it.uu.se/~eh



#### The era of the "Rocket Science Supercomputers" 1980-1995

The one with the most blinking lights wins
The one with the niftiest language wins
The more different the better!



AVDARK 2013



## The server market 1995

| Server<br>Size                                  | High-Perf.<br>Computing | Commercial<br>Computing |
|-------------------------------------------------|-------------------------|-------------------------|
| <\$10k                                          | 1%                      | 19%                     |
| <\$50k                                          | 5%                      | L2内K                    |
| <\$250k                                         | 5%                      | shared-<br>mem          |
| <\$1M                                           | 2%                      | servers                 |
| >\$1M                                           | 3%                      | 8%                      |
| The target of the rocket science supercomputers |                         |                         |

AVDARK 2013



## Multicore: Who has <u>not</u> got one?



AVDARK 2013

Dept of Information Technology www.it.uu.se



UNIVERSITE

#### UPPSALA UNIVERSITET

## Models of parallelism

- Processes (fork or & in UNIX)
  - ★ A parallel execution, where each process has its own process state, e.g., its own VA→PA mapping
- Threads (thread\_create in POSIX)
  - Parallel threads of control inside a process
  - There are some thread-shared state, e.g., VA→PA mapping.
- More: OpenMP, OpenACC, OpenCL, SILC, ...
- Common property: Each thread has its own PC (i.e. executes its own code independently)

More during lab lecture...

## What is: Coherent Shared Memory?

## Erik Hagersten Uppsala University



#### **Programming Model: Coherent shared memory**



AVDARK 2013



## Adding Caches. Gives the illusion of:

- Shorter Memory Latency
- Higher Memory Bandwidth



AVDARK 2013



2013

## **Our Generic Shared Memory Arch.**



Dept of Information Technology www.it.uu.se



## **Automatic Replication of Data**



AVDARK 2013



2013

### The Cache Coherent Memory System Coherent Write (Here: Write invalidate)





2013

### The Cache Coherent Memory System Coherent Read & Write-back

(Here: Cache to Cache Transfer)



© Erik Hagersten | user.it.uu.se/~eh



2013

## The Cache Coherent Memory System Coherent Read & Write-back





© Erik Hagersten | user.it.uu.se/~eh



2013

## The Cache Coherent Memory System Coherent Read & Write-back

(Here: Cache-to-Cache Transfer and Write Back)



© Erik Hagersten user.it.uu.se/~eh



# Symming up Coherence

*Good trong delighted too strong Chere can be many copies of a datum, but only one value* 

There is a single global order of value changes to each datum

After the computer stops, all copies should have the same value

Thread1 =  $\{1, 2, 3, 4, 5, 6, 7...\}$  Thread2 =  $\{1, 4, 7...\}$ 



Dept of Information Technology www.it.uu.se



## Where does coherence matter?





## **Summary Coherence**

- Coherent shared-memory programming model requires coherence
- (There is also non-coherent shared memory, e.g. single-sided MPI, PGAS)
- All threads can read and write shared data.
- Coherent view of the value of <u>a</u> datum
   Often: Coherence is kept per cache line.

## **Snooping Coherence**

## Erik Hagersten Uppsala University



## **Implementation options for coherence**

Two coherence options

- Snoop-based ("broadcast protocol")
- Directory-based ("point to point protocol")
- Different scalability
- Different latency

AVDARK 2013



2013

### "Upgrade" in snoop-based





### "Upgrade" in dir-based



AVDARK 2013

Dept of Information Technology www.it.uu.se









2013

## **Example: MOSI Bus Snoop**

#### STATES:

**M** – **Modified**: My dirty\* copy is the only cached copy

S – Shared: I have a clean copy, others may also have a copy

**O** – **Owner:** I have a dirty copy, others may also have a copy

I – Invalid: I have no valid copy in my cache (including cache miss)

#### **BUS TRANSACTIONS FROM OTHERS:**

**BUSrts** ReadtoShare. Reading the data

**BUSrtw:** ReadToWrite. Reading the data with the intention to modify it right away

**BUSinv:** Invalidating other caches copies

**BUSwb**: Writing data back to memory

#### \*Dirty: my value differs from the old value in mem MP 27

Dept of Information Technology www.it.uu.se

#### Why is there no BUSinv arrow from M?

No other can have a cached copy BUSinv is only used for writes



Input-signal/Reply-signal Meaning: If you are in state M and see BUSrts, goto state O and reply with Data





2013

## **Example: CPU access MOSI**



Dept of Information Technology www.it.uu.se



### "Upgrade" in snoop-based



**AVDARK** 

2013



## Summary

- Snooping was first suggested by Jim Goodman in ISCA, Stockholm 1984
- Effectively implements coherence through broadcast of "read and write misses"
- Best suited for on-chip coherence between a small number of caches

AVDARK 2013

## **MOSI Snooping Coherence Protocol Implementation**

## Erik Hagersten Uppsala University





#### **Upgrade: Readable →Writable**



AVDARK 2013

Dept of Information Technology www.it.uu.se



2013

## Upgrade – the requesting CPU

#### **Defines action to CPU events:**





BUSinv: Invalidating other caches copies of the cacheline Dept of Information Technology www.it.uu.se

© Erik Hagersten | user.it.uu.se/~eh

### UPPSALA UNIVERSITET

AVD/ 20

## More Cache Lingo

- Capacity miss too small cache
- Conflict miss limited associativity
- Compulsory miss accessing data the first time
- Coherence miss The cache would have had the data unless it had been invalidated by someone else
- Upgrade miss: (only for writes) The cache would have had a writable copy, but answered a read request and "downgraded" itself to read-only state
- False sharing: Coherence/downgrade is caused by a shared <u>cacheline</u> and not by shared data:

∼eh

| ARK<br>13 | False sharing<br>example:                   | Read A<br><br>Write A<br><br>Read A |       | <br>Read D<br><br>Write D | cacheline:<br>A, B, C, D          |
|-----------|---------------------------------------------|-------------------------------------|-------|---------------------------|-----------------------------------|
|           | Dept of Information Technology www.it.uu.se |                                     | MP 37 |                           | © Erik Hagersten  user.it.uu.se/^ |

## Implementing Snooping. One example (Sun E6000, ≈Intel P6,)

## Erik Hagersten Uppsala University





AVDARK

2013

### The Cache Coherent Memory System Coherent Write (Here: Write invalidate)







**AVDARK** 

2013

### **The Cache Coherent Cache-to-cache**



Dept of Information Technology www.it.uu.se



### Cache2cache – the requesting CPU





Dept of Information Technology www.it.uu.se







## Summary

- Dual tags enable bus snoops and CPU lookups in parallel
- A datum may actually have several values "at the same wall-clock time"
- but not in "logic time": No software can detect that there are different values

AVDARK The value-change order maintained

## Other Coherence Alternatives

## Erik Hagersten Uppsala University



## **Common Cache States**

- M Modified
  - My dirty (i.e. modified) copy is the only cached copy
- E Exclusive
  - My clean copy is the only cached copy
- O Owner
  - I have a dirty copy, others may also have a copy
- S Shared
  - I have a clean copy, others may also have a copy
- I Invalid
  - I have no valid copy in my cache



## **Some Coherence Alternatives**

- Our first target
- Leave one dirty copy in a cache on a cache2cache transfer
- MSI

MOSI

- Writeback to memory on a cache2cache.
- MOESI
  - The first reader will go to E and can later become a writer cheaply



## Example A = A + 1 Initially A is only in mem

| MOSI: |        |              | <b>MOESI:</b> |        |              |
|-------|--------|--------------|---------------|--------|--------------|
| CPU   | BUS    | <u>State</u> | CPU           | BUS    | <u>State</u> |
| LD A  | RTS(A) | ) S          | LD A          | RTS(A) | E            |
| ADD 1 | -      |              | ADD 1         | _      |              |
| ST A  | INV(A) | M            | ST A          | -      | Μ            |
| LD B  | RTS(B) | ) S          | LD B          | RTS(B) | E            |
| ADD 1 | -      |              | ADD 1         | -      |              |
| ST B  | INV(B) | M            | ST B          | -      | Μ            |

AVDARK 2013

. . .

. . .



**AVDARK** 

2013

### **Update-based MOSI protocol**



© Erik Hagersten user.it.uu.se/~eh



**AVDARK** 

2013

### **Update-based MOSI protocol: Next write**



© Erik Hagersten user.it.uu.se/~eh



## **Update-based Coherence**

- Write the new value to the other caches holding a shared copy (instead of invalidating...)
- Can avoid coherence misses
- May consume a large amount of snoop bandwidth
- Hard to implement some "memory models"
- Few commercial implementations: SPARCCenter2000, Xerox Dragon

## **Preparing for IRL**



### **Example during next IRL Class:**

All the three RISC CPUs in a **MOSI** shared-memory (sequentially consistent) multiprocessor executes the following code almost at the same time:

Initially, CPU1 has its local variable my\_id=1, CPU has my\_id=2 and CPU3 has my\_id=3 and the globally shared variables A is equal to 1 and B is equal to 0.

Assume that CPU3, 2 and 1 first make one memory reference (i.e, a load or a store) each and then repeats that interleaving.

The following four bus transaction types can be seen on the snooping bus connecting the CPUs:

- RTS: ReadtoShare (reading the data with the intention to read it)
- RTW, ReadToWrite (reading the data with the intention to modify it)
- WB: Writing data back to memory
- INV: Invalidating other caches copies

Show every state <u>change</u> and/or value <u>change</u> of A and B in each CPU's cache according to one possible interleaving of the memory accesses. After the parallel execution is done for all of the CPUs, the cache lines still in the caches will be replaced. These actions should also be shown. For each line, also state what bus transaction occurs on the bus (if any) as well as which device is providing the corresponding data (if any).



#### **Example of a state transition sheet:**

| CPU action | Bus<br>Transactio<br>n (if any) | State/value after the CPU action |             |     |             |     |          | Data is provided by<br>[Cache 1, 2, 3 or<br>Mem]<br>(if any) |
|------------|---------------------------------|----------------------------------|-------------|-----|-------------|-----|----------|--------------------------------------------------------------|
|            |                                 |                                  | CPU1<br>A B |     | CPU2<br>A B |     | PU3<br>B | (II ally)                                                    |
| Initially  |                                 | Ι                                | Ι           | Ι   | Ι           | Ι   | Ι        |                                                              |
| CPU3: LD A | RTS(A)                          |                                  |             |     |             | S/1 |          | Mem                                                          |
| CPU2: LD A | RTS(A)                          |                                  |             | S/1 |             |     |          | Mem                                                          |
| CPU1: LD A | RTS(A)                          | S/1                              |             |     |             |     |          | Mem                                                          |
| CPU3: LDA  |                                 |                                  |             |     |             |     |          |                                                              |
|            |                                 |                                  |             |     |             |     |          |                                                              |

#### AVDARK 2013

Dept of Information Technology www.it.uu.se

## What are Memory Models?

Erik Hagersten Uppsala University Sweden





## **Dekker's Algorithm** (mutual exclusion)





## **Memory Ordering**

- Coherence defines a per-datum valuechange order
- Memory model defines the valuechange order for all the data.

AVDARK 2013



## **Memory Ordering**

- Defines the guaranteed memory ordering
- Is a "contract" between the HW and SW guys
- Without it, you may not be able to say much about the result of a parallel execution



## **Observing order in SW** In which order did A and B change value?

### Value order

LD newA; LD oldB  $\rightarrow$  newA before newB ST newA; LD oldB  $\rightarrow$  newA before newB ST newA; ST newB  $\rightarrow$  newA before newB

### Program order

The order of program statements of each thread



# In which global order were these statements executed?

(A' denotes a modified value to the data at addr A)

Thread 1

Thread 2





## **One possible Another possible** observed order observed order

Thread 1 Thread 2





### "The intuitive memory order" Sequential Consistency (Lamport)



- Global order achieved by *interleaving* <u>all</u> memory accesses from different threads
- SW should not be able to detect contradictive orders
- "Programmer's intuition is maintained"
- Unnecessarily restrictive ==> performance penalty





## **Dekker's Algorithm** (mutual exclusion)



AVDARK 2013

Dept of Information Technology www.it.uu.se

Undefined



### **Proving Dekker's Algorithm under SC**





2013

### **Can the case "both win" happen under SC?**

<u>Acess graph</u>





## **One thread wins under SC**

Only Right wins  $\rightarrow$  SC is OK





## No thread wins under SC

No thread wins  $\rightarrow$  SC is OK





# Summary

- Gives the "illusion" of one global order between all memory accesses
- If two threads can observe two contratictive [partial] orders, SC is broken.
- Maintains human intuition

at the cost of performance (or complexity)

# **Other Memory Models**

### Erik Hagersten Uppsala University

### "Almost intuitive memory model" Total Store Ordering [TSO] (P. Sindhu)



- Global *interleaving* [order] for <u>all</u> stores from different threads (own stores excepted)
- "Programmer's intuition is almost maintained"
  - Flag synchronization? Yes
  - Store causality? Yes
  - Does Dekker work? No
- Unnecessarily restrictive ==> performance penalty

Dept of Information Technology www.it.uu.se



# **TSO HW Model**



AVDARK 2013

Stores are moved off the critical path Coherence implementation can be the same as for SC

Dept of Information Technology www.it.uu.se

MP 76





## Dekker's Algorithm (mutual exclusion)





# **Dekker's Algorithm for TSO**



Memory barrier: Tells the HW to not start the LD until all previous stores have been "globaly ordered"
→ behaves like SC
→ Dekker's algorithm works!



### Weak/release Consistency (M. Dubois, K. Gharachorloo)



- Most accesses are unordered
- "Programmer's intuition is not maintained"
  - Flag synchronization? No
  - Causal correctness? No
  - Dekker? No
- Global order <u>only</u> established when the programmer explicitly inserts <u>memory barrier</u> instructions

AVDARK 2013

- ++ Better performance!!
- -- Interesting bugs!!

Dept of Information Technology www.it.uu.se



2013

# Weak/Release consistency

- New flag synchronization needed
  - while (flag != 1) {}; A := data;membarrier; membarrier; flag := 1; X := A;
- Dekker's: same as TSO
- Causal correctness provided for this code







### Processor Consistency [PC] (J. Goodman)



- PC: The stores from a processor appears to others in program order.
  - Flag synchronization? Yes
  - Causal correctness? Not clearly defined by Goodman. (yes, for PC "with causal correctness")
  - Dekker? No

# Learning more about memory models

Shared Memory Consistency Models: A Tutorial by Sarita Adve, Kouroush Gharachorloo in IEEE Computer 1996 (in the "Papers" directory)

RTFM: Read the F\*\*\*\*n Manual of the system you are working on! (Different microprocessors and systems supports different memory models.)

### **Issue to think about:**

What code reordering may compilers really do? What does "volatile" declarations in C mean?



# X86's new memory model

- Processor consistency with causual correctness for non-atomic memory ops
- TSO for atomic memory ops
- Academia says the x86 mem model is TSO
- Link to the Intel manual (Section 8.2)

http://download.intel.com/products/processor/manual/325462.pdf

Video presentation:

AVDARK 2013 http://www.youtube.com/watch?v=WUfvvFD5tAA&hI=sv

# Synchronization

Erik Hagersten Uppsala University Sweden





AVDARK 2013

### Need to introduce synchronization

Locking primitives are needed to ensure that only one process can be in the critical section:

```
LOCK(lock_variable) /* wait for your turn */
if (sum > threshold) {
    sum := my_sum + sum
}
UNLOCK(lock_variable) /* release the lock*/
Critical Section
```

```
if (sum > threshold) {
   LOCK(lock_variable) /* wait for your turn */
   sum := my_sum + sum
   UNLOCK(lock_variable) /* release the lock*/
```

#### **Critical Section**

Dept of Information Technology www.it.uu.se



### **Components of a Synchronization Event**

- Acquire method
  - Acquire right to the synchronization (enter critical section, go past sync event)
- Waiting algorithm
  - Wait for synch to become available when it isn't
- Release method
  - Enable other processors to acquire right to the synch

AVDARK 2013



### **Atomic Instruction to Acquire** Atomic example: test&set "TAS R1, (location)"

The value at Mem(location) is loaded into R1, and the constant "1" atomically stored into Mem(location) (Other constant could be implemented, e.g., SPARC: "FF")

# Looks like a "store" to the coherence protocol Implementation:

- 1. Get a writable exclusive copy of the cache line (state M in MOSI)
- 2. Make the atomic modification to that cached copy

### **Examples of other atomic primitives:**

**SWAP R1, (location):** atomically swap the values of R1 with Mem(location) **AVDARK** 2013

Dept of Information Technology www.it.uu.se



# Waiting Algorithms

### Blocking

- Waiting processes/threads are de-scheduled
- High overhead
- Allows processor to do "other things"

### **Busy-waiting**

- Waiting processes repeatedly test a lock\_variable until it changes value
- Lower overhead, but consumes processor resources
- Can cause coherence network traffic

### Hybrid methods: busy-wait a while, then block



# **Release Algorithm**

Typically just a store "0"
 More complicated locks may require a conditional store or a "wake-up".



| A Bad Example: "POUNDING"                                                                                                                                                                      |                                                                                                               |
|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------|
| <ul> <li>How is TAS treated by the coherence protocol?</li> <li>Like a CPU read operation</li> <li>Like a CPU write opreration</li> <li>By performing the "SWAP" atomically in DRAM</li> </ul> |                                                                                                               |
| <u>proc</u> lock(lock_variable) {<br>while (TAS[lock_variable]==1) {} /* pound on the lock until free */                                                                                       |                                                                                                               |
| }                                                                                                                                                                                              | If two threads are waiting for the lock                                                                       |
| <u>proc</u> unlock(lock_variable) {<br>lock_variable := 0                                                                                                                                      | <ul> <li>They will both spin locally in their cache</li> <li>They will create coherence traffic by</li> </ul> |
| <pre>}</pre>                                                                                                                                                                                   | <ul><li>invalidating each other</li><li>They will both block and need to be</li></ul>                         |
|                                                                                                                                                                                                | woken up by the OS                                                                                            |
| Assume: The function TAS[addr] returns the current memory value at                                                                                                                             |                                                                                                               |
| addr and <i>atomically</i> writes the busy pattern "1" to the memory                                                                                                                           |                                                                                                               |

AVDARK 2013

### **Spinning threads produce traffic!**

Dept of Information Technology www.it.uu.se



### **Optimistic Test&Set Lock "spinlock"**

```
proc lock(lock_variable) {
   while true {
    if (TAS[lock_variable] ==0) break; /* pound on the lock once, done if TAS==0 *,
                                  /* spin locally in your cache until "0" observed*/
    while(lock_variable != 0) { }
 }
}
                              If two threads are waiting for the lock
                              □ They will mostly spin locally in their cache
proc unlock(lock_variable) {
                              □ They will create coherence traffic all the time
   lock variable := 0
                                 by invalidating each other
}
                              They will both block and need to be woken
                                 up by the OS
                     Much less coherence traffic!!
```

Much less coherence traffic!! -- still lots of traffic at lock handover! More on this during Scalable Synchronization



**AVDARK** 

2013

### **Pesimistic Test&Set Lock "spinlock"**

```
proc lock(lock_variable) {
    while true {
        while(lock_variable != 0) {} /* spin locally in your cache until "0" observed*/
        if (TAS[lock_variable] == 0) break; /* pound on the lock once, done if TAS==0
    }
}
proc unlock(lock_variable) {
    lock_variable := 0
}
```

#### Slightly less traffic than Optimistic for contended locks -- still lots of traffic at lock handover! More solutions during Scalable Shared Memory





2013

# ...messy (part 2)



```
potentially: \sim N*N/2 reads :-(
```

Problem1: Contention on the interconnect slows down the CS execution Problem2: The lock hand-over time is N\*read\_throughput Fix1: Some back-off strategy, bad news for hand-over latency AVDARK Fix2: Queue-based locks

Dept of Information Technology www.it.uu.se

# **Barrier Synchronization**

### Erik Hagersten Uppsala University



2013

# **Barrier Synchronization**





# **Barriers:** Make the first threads wait for the last thread to reach a point in the program

- Software algorithms implemented using locks, flags, counters
- 2. Hardware barriers
  - "Wired-AND" line separate from address/data bus
  - Set input high when arrive, wait for output to be high to leave
  - (In practice, multiple wires to allow reuse)
  - Difficult to support arbitrary subset of processors

AVDARK 2013



### **A Naiive Centralized Barrier**

```
BARRIER (bar_name, p) {
  LOCK(bar_name.lock) {
    if (bar_name.counter == p) bar_name.counter = 0; /* init count*/
    bar_name.counter++; /* globally increment the barrier count *,
    }
  UNLOCK(bar_name.lock)
  while (bar_name.counter < p) {}; /* wait for the last thread */</pre>
```

```
AVDARK
2013
```

}



**AVDARK** 

2013

### **A More Complicated Centralized Barrier**

```
BARRIER (bar_name, p) {
int loops;
loops = 0;
local sense = !(local sense) ;
                                       /* toggle private sense variable
                                       each time the barrier is used */
LOCK(bar_name.lock);
  bar_name.counter++;
                                      /* globally increment the barrier count *,
  if (bar_name.counter == p) { /* everybody here yet ? */
        bar_name.flag = local_sense; /* release waiters*/
       UNLOCK(bar_name.lock)
     }
  else
    { UNLOCK(bar_name.lock);
       while (bar_name.flag != local_sense) { /* wait for the last guy */
         if (loops++ > UNREASONABLE) report_warning(pid)}
     }
```