# Multiprocessors and Coherent Memory

## Erik Hagersten Uppsala University



# Symming up Coherence

*Good trong delighted too strong Chere can be many copies of a datum, but only one value* 

There is a single global order of value changes to each datum

After the computer stops, all copies should have the same value

AVDARK 2013

Thread1 =  $\{1, 2, 3, 4, 5, 6, 7...\}$  Thread2 =  $\{1, 4, 7...\}$ 



Dept of Information Technology www.it.uu.se



## Where does coherence matter?





# **Summary Coherence**

- Coherent shared-memory programming model requires coherence
- (There is also non-coherent shared memory, e.g. single-sided MPI, PGAS)
- All threads can read and write shared data.
- Coherent view of the value of <u>a</u> datum
   Often: Coherence is kept per cache line.

# **Snooping Coherence**

## Erik Hagersten Uppsala University





Dept of Information Technology www.it.uu.se

**IRL-Coherence** 7

© Erik Hagersten | user.it.uu.se/~eh



Dept of Information Technology www.it.uu.se





## Example: MOSI Bus Snoop

STATES: M – Modified:

S – Shared:

O – Owner:

I – Invalid:

### **BUS TRANSACTIONS FROM OTHERS:**

**BUSrts** 

**BUSrtw:** 

BUSinv:

BUSwb

AVDARK 2013

\*Dirty: my value differs from the old value in mem. It is my job to update mem. Dept of Information Technology www.it.uu.se IRL-Coherence 10



2013

# Example: MOSI Bus Snoop

#### **STATES:**

M – Modified: My dirty\* copy is the only cached copy
S – Shared: I have a clean copy, others may also have a copy
O – Owner: I have a dirty copy, others may also have a copy
I – Invalid: I have no valid copy in my cache (including cache miss)

#### **BUS TRANSACTIONS FROM OTHERS:**

BUSrts ReadtoShare. Reading the data

**BUSrtw:** ReadToWrite. Reading the data with the intention to modify it right away

BUSinv: Invalidating other caches copies

**BUSwb**: Writing data back dirty data to memory

\*Dirty: my value differs from the old value in mem. It is my job to update mem and Dept of Information Technology www.it.uu.se IRL-Coherence 11







Input-signal/Reply-signal Meaning: If you are in state M and see BUSrts, goto state O and reply with Data



## **Example: CPU access MOSI**

#### FROM MY CPU:

CPUread Caused by a Load instruction

**CPUwrite**: Caused by a Store or Atomic instruction

**CPUrepI**: Caused by a replacement of this cachline (caused by murphy ③)

M CpuT/BusT O

Input-signal/Reply-signal Meaning: If you are in state M and see BUSrts, goto state O and reply with Data

AVDARK 2013

**IRL-Coherence 12** 

S



## The MOSI Bus Snoop

#### **STATES:**

M – Modified: My dirty\* copy is the only cached copy
S – Shared: I have a clean copy, others may also have a copy
O – Owner: I have a dirty copy, others may also have a copy
I – Invalid: I have no valid copy in my cache (including cache miss)

### **BUS TRANSACTIONS FROM OTHERS:**

BUSrts ReadtoShare. Reading the data

**BUSrtw:** ReadToWrite. Reading the data with the intention to modify it right away

**BUSinv**: Invalidating other caches copies

BUSwb: Writing data back to memory



\*Dirty: my value differs from the old value in mem Dept of Information Technology www.it.uu.se





2013

## The MOSI CPU access



#### FROM MY CPU:

CPUread Caused by a Load instruction

**CPUwrite**: Caused by a Store or Atomic instruction

**CPUrepI**: Caused by a replacement of this cachline (caused by murphy ③)

Dept of Information Technology www.it.uu.se

# **Coherence exercises**



#### **Example of a state transition sheet:**

| CPU action | Bus<br>Transactio<br>n (if any) | State/value after the CPU action |         |         |          |         |          | Data is provided by<br>[Cache 1, 2, 3 or<br>Mem]<br>(if any) |  |
|------------|---------------------------------|----------------------------------|---------|---------|----------|---------|----------|--------------------------------------------------------------|--|
|            |                                 | CP<br>A                          | U1<br>B | CP<br>A | PU2<br>B | CF<br>A | PU3<br>B |                                                              |  |
| Initially  |                                 | Ι                                | Ι       | Ι       | Ι        | Ι       | Ι        |                                                              |  |
| CPU3: LD A | RTS(A)                          |                                  |         |         |          | S/1     |          | Mem                                                          |  |
| CPU2: LD A | RTS(A)                          |                                  |         | S/1     |          |         |          | Mem                                                          |  |
| CPU1: LD A | RTS(A)                          | S/1                              |         |         |          |         |          | Mem                                                          |  |
| CPU3: LDA  |                                 |                                  |         |         |          |         |          |                                                              |  |
|            |                                 |                                  |         |         |          |         |          |                                                              |  |

AVDARK 2013

Dept of Information Technology www.it.uu.se



### Example during IRL Class:

All the three RISC CPUs (e.g. cores in a multicore) in a MOSI shared-memory sequentially consistent multiprocessor execute the following code almost at the same time:

CPU1: A = 1; CPU2:

while (A == 0) { }; B = 1;

CPU3A = 0:

```
A = 0;
while (B == 0){};
printf ("%d", A);
```

CPU3 will start slightly ahead of the others and will execute its first memory instruction (LD/ST) before CPU2 starts its first memory instruction, while CPU1 will start last. After this, the threads will take turns performing one memory instruction each in that order.

Initially A and B reside in memory only and both have the value 0.

After a long time, show the effects of replacement.



| UPPSALA    | - |
|------------|---|
| UNIVERSITE | 1 |

| Mem Instr           | Bus    | 1:A | 1:B | 2:A | 2:B | 3:A | 3:B | Data |
|---------------------|--------|-----|-----|-----|-----|-----|-----|------|
| 3:ST A              | RTW(A) |     |     |     |     | M/0 |     | Mem  |
| 2:LD A              | RTS(A) |     |     | S/0 |     | O/0 |     | \$3  |
| 1:ST A              | RTW(A) | M/1 |     | 1   |     | 1   |     | \$3  |
| 3:LD B              | RTS(B) |     |     |     |     |     | S/0 | Mem  |
| 2:LD A              | RTS(A) | O/1 |     | S/1 |     |     |     | \$1  |
| 3:LD B              |        |     |     |     |     |     |     |      |
| 2:ST B              | RTW(B) |     |     |     | M/1 |     | 1   |      |
| 3:LD B              | RTS(B) |     |     |     | O/1 |     | S/1 | \$2  |
| 3:LD A              | RTS(A) |     |     |     |     | S/1 |     | \$1  |
| LONG TIME<br>ELAPSE |        |     |     |     |     |     |     |      |
| 1:REPL A            | WB(A)  | 1   |     |     |     |     |     | \$1  |
| 2:REPL A            |        |     |     |     |     |     |     |      |
| 2:REPL B            | WB(B)  |     |     |     | 1   |     |     | \$2  |
| 3:REPL A            |        |     |     |     |     |     |     |      |
| 3:REPL B            |        |     |     |     |     |     |     |      |
|                     |        |     |     |     |     |     |     |      |

# **IRL Memory Models**



## Dekker's Algorithm (mutual exclusion)



#### AVDARK 2013



## Sequential Consistency (Lamport)





AVDARK 2013



2013

### Can the case "both win" happen under SC?

<u>Acess graph</u>







Loads(each thread can have at most one outstanding load)

 Global *interleaving* [order] for <u>all</u> stores from different threads (own stores excepted)

**AVDARK** 2013

Dept of Information Technology www.it.uu.se



2013

#### Q: What happens if the SB gets full?

# **TSO HW Model**





|                        | A Bad Examp                                                                                                                                                                                    | ble: "POUNDING"                                                                                                 |  |  |  |  |  |  |  |  |
|------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------|--|--|--|--|--|--|--|--|
| UPPSALA<br>UNIVERSITET | <ul> <li>How is TAS treated by the coherence protocol?</li> <li>Like a CPU read operation</li> <li>Like a CPU write opreration</li> <li>By performing the "SWAP" atomically in DRAM</li> </ul> |                                                                                                                 |  |  |  |  |  |  |  |  |
|                        | <pre>proc lock(lock_variable) {     while (TAS[lock_variable]==1) {} /* pound on the lock until free */</pre>                                                                                  |                                                                                                                 |  |  |  |  |  |  |  |  |
|                        | } If two threads are waiting for the lock                                                                                                                                                      |                                                                                                                 |  |  |  |  |  |  |  |  |
|                        | <pre>proc unlock(lock_variable) {     lock_variable := 0</pre>                                                                                                                                 | <ul> <li>They will both spin locally in their cache</li> <li>They will create coherence traffic by</li> </ul>   |  |  |  |  |  |  |  |  |
|                        | }                                                                                                                                                                                              | <ul> <li>invalidating each other</li> <li>They will both block and need to be<br/>woken up by the OS</li> </ul> |  |  |  |  |  |  |  |  |
|                        | Assume: The function TAS[addr] returns the current memory value at addr and <u>atomically</u> writes the busy pattern "1" to the memory                                                        |                                                                                                                 |  |  |  |  |  |  |  |  |
|                        |                                                                                                                                                                                                |                                                                                                                 |  |  |  |  |  |  |  |  |
|                        | Bus<br>CPU:                                                                                                                                                                                    |                                                                                                                 |  |  |  |  |  |  |  |  |





Dept of Information Technology www.it.uu.se



## **Optimistic Test&Set Lock "spinlock"**

```
proc lock(lock_variable) {
    while true {
        if (TAS[lock_variable] == 0) break; /* pound on the lock once, done if TAS==0 *
        while(lock_variable != 0) {} /* spin locally in your cache until "0" observed*/
    }
}
If two threads are waiting for the lock
    They will mostly spin locally in their cache
```

```
proc unlock(lock_variable) {
    lock_variable := 0
```

 They will mostly spin locally in their cache
 They will create coherence traffic all the time by invalidating each other

They will both block and need to be woken up by the OS

Much less coherence traffic!! -- still lots of traffic at lock handover! More on this during Scalable Synchronization

AVDARK 2013 }

# IRL: More Problem solving



### Extra Example not used in IRL Class:

All the three RISC CPUs in a **MOSI** shared-memory (sequentially consistent) multiprocessor executes the following code almost at the same time:

Initially, CPU1 has its local variable my\_id=1, CPU has my\_id=2 and CPU3 has my\_id=3 and the globally shared variables A is equal to 1 and B is equal to 0.

Assume that CPU3, 2 and then 1 first make one memory reference (i.e, a load or a store) each, after which they repeats that memory access interleaving.

The following four bus transaction types can be seen on the snooping bus connecting the CPUs:

- RTS: ReadtoShare (reading the data with the intention to read it)
- RTW, ReadToWrite (reading the data with the intention to modify it)
- WB: Writing data back to memory
- INV: Invalidating other caches copies

AVDARK 2013 Show every state <u>change</u> and/or value <u>change</u> of A and B in each CPU's cache according to one possible interleaving of the memory accesses. After the parallel execution is done for all of the CPUs, the cache lines still in the caches will be replaced. These actions should also be shown. For each line, also state what bus transaction occurs on the bus (if any) as well as which device is providing the corresponding data (if any).

| UPPSALA<br>UNIVERSITET |
|------------------------|
|                        |

| Mem Instr | Bus    | 1:A | 1:B | 2:A | 2:B | 3:A | 3:B | Data |
|-----------|--------|-----|-----|-----|-----|-----|-----|------|
| 3:LD A    | RTS(A) |     |     |     |     | S/1 |     | MEM  |
| 2:LD A    | RTS(A) |     |     | S/1 |     |     |     | MEM  |
| 1:LD A    | RTS(A) | S/1 |     |     |     |     |     | MEM  |
| 3,2:LD A  |        |     |     |     |     |     |     |      |
| 1:LD B    | RTS(B) |     | S/0 |     |     |     |     | MEM  |
| 3,2:LD A  |        |     |     |     |     |     |     |      |
| 1:ST B    | INV(B) |     | M/1 |     |     |     |     |      |
| 3,2:LD A  |        |     |     |     |     |     |     |      |
| 1:ST A    | INV(A) | M/2 |     | 1   |     | I   |     |      |
| 3:LD A    | RTS(A) | 0/2 |     |     |     | S/2 |     | \$1  |
| 2:LD A    | RTS(A) |     |     | S/2 |     |     |     | \$1  |
| 1,3:LD A  |        |     |     |     |     |     |     |      |
| 2:LD B    | RTS(B) |     | 0/1 |     | S/1 |     |     | \$1  |
| 1,3:LD A  |        |     |     |     |     |     |     |      |
| 2:ST B    | INV(B) |     | I   |     | M/2 |     |     |      |
| 1,3:LD A  |        |     |     |     |     |     |     |      |
| 2:ST A    | INV(A) | 1   |     | M/3 |     | I   |     |      |

| UPPSALA<br>UNIVERSITET |
|------------------------|
| OINTERDITET            |

| Mem Instr  | Bus    | 1:A | 1:B | 2:A | 2:B | 3:A | 3:B | Data |
|------------|--------|-----|-----|-----|-----|-----|-----|------|
| TRANSPORT  |        |     |     | M/3 | M/2 |     |     |      |
| 1:LD A     | RTS(A) | S/3 |     | O/3 |     |     |     | \$2  |
| 3:LD A     | RTS(A) |     |     |     |     | S/3 |     | \$2  |
| 2,1:LD A   |        |     |     |     |     |     |     |      |
| 3:LD B     | RTS(B) |     |     |     | 0/2 |     | S/2 | \$2  |
| 2,1:LD A   |        |     |     |     |     |     |     |      |
| 3:ST B     | INV(B) |     |     |     | 1   |     | M/5 |      |
| 2,1:LD A   |        |     |     |     |     |     |     |      |
| 3:ST A     | INV(A) | I   |     | I   |     | M/4 |     |      |
| 2:LD A     | RTS(A) |     |     | S/4 |     | O/4 |     | \$3  |
| 1:LD A     | RTS(A) | S/4 |     |     |     |     |     | \$3  |
| 3:LD A     |        |     |     |     |     |     |     |      |
| LONG TIME  |        |     |     |     |     |     |     |      |
| 1,2:REPL A |        | I   |     | 1   |     |     |     |      |
| 3:REPL A   | WB(A)  |     |     |     |     | 1   |     | \$3  |
| 3:REPL B   | WB(B)  |     |     |     |     |     | 1   | \$3  |
|            |        |     |     |     |     |     |     |      |

### **Using MSI**

UPPSALA JNIVERSITET

## All the three RISC CPUs in a **MSI** shared-memory (sequentially consistent) multiprocessor executes the following code almost at the same time:

Initially, CPU1 has its local variable my\_id=1, CPU has my\_id=2 and CPU3 has my\_id=3 and the globally shared variables A is equal to 1 and B is equal to 0.

Assume that CPU3, 2 and then 1 first make one memory reference (i.e, a load or a store) each, after which they repeats that memory access interleaving.

The following four bus transaction types can be seen on the snooping bus connecting the CPUs:

- RTS: ReadtoShare (reading the data with the intention to read it)
- RTW, ReadToWrite (reading the data with the intention to modify it)
- WB: Writing data back to memory
- INV: Invalidating other caches copies

AVDARK 2013 Show every state <u>change</u> and/or value <u>change</u> of A and B in each CPU's cache according to one possible interleaving of the memory accesses. After the parallel execution is done for all of the CPUs, the cache lines still in the caches will be replaced. These actions should also be shown. For each line, also state what bus transaction occurs on the bus (if any) as well as which device is providing the corresponding data (if any).

| UPPSALA     |  |
|-------------|--|
| UNIVERSITET |  |

| Mem Instr | Bus    | 1:A | 1:B | 2:A | 2:B | 3:A | 3:B | Data |
|-----------|--------|-----|-----|-----|-----|-----|-----|------|
| 3:LD A    | RTS(A) |     |     |     |     | S/1 |     | MEM  |
| 2:LD A    | RTS(A) |     |     | S/1 |     |     |     | MEM  |
| 1:LD A    | RTS(A) | S/1 |     |     |     |     |     | MEM  |
| 3,2:LD A  |        |     |     |     |     |     |     |      |
| 1:LD B    | RTS(B) |     | S/0 |     |     |     |     | MEM  |
| 3,2:LD A  |        |     |     |     |     |     |     |      |
| 1:ST B    | INV(B) |     | M/1 |     |     |     |     |      |
| 3,2:LD A  |        |     |     |     |     |     |     |      |
| 1:ST A    | INV(A) | M/2 |     | I   |     | I   |     |      |
| 3:LD A    | RTS(A) | S/2 |     |     |     | S/2 |     | \$1  |
| 2:LD A    | RTS(A) |     |     | S/2 |     |     |     | MEM  |
| 1,3:LD A  |        |     |     |     |     |     |     |      |
| 2:LD B    | RTS(B) |     | S/1 |     | S/1 |     |     | \$1  |
| 1,3:LD A  |        |     |     |     |     |     |     |      |
| 2:ST B    | INV(B) |     | I   |     | M/2 |     |     |      |
| 1,3:LD A  |        |     |     |     |     |     |     |      |
| 2:ST A    | INV(A) | I   |     | M/3 |     | I   |     |      |

| UPPSALA     |
|-------------|
| UNIVERSITET |

| Mem Instr  | Bus    | 1:A | 1:B | 2:A | 2:B | 3:A | 3:B | Data |
|------------|--------|-----|-----|-----|-----|-----|-----|------|
| TRANSPORT  |        |     |     | M/3 | M/2 |     |     |      |
| 1:LD A     | RTS(A) | S/3 |     | S/3 |     |     |     | \$2  |
| 3:LD A     | RTS(A) |     |     |     |     | S/3 |     | MEM  |
| 2,1:LD A   |        |     |     |     |     |     |     |      |
| 3:LD B     | RTS(B) |     |     |     | S/2 |     | S/2 | \$2  |
| 2,1:LD A   |        |     |     |     |     |     |     |      |
| 3:ST B     | INV(B) |     |     |     | 1   |     | M/5 |      |
| 2,1:LD A   |        |     |     |     |     |     |     |      |
| 3:ST A     | INV(A) | 1   |     | 1   |     | M/4 |     |      |
| 2:LD A     | RTS(A) |     |     | S/4 |     | S/4 |     | \$3  |
| 1:LD A     | RTS(A) | S/4 |     |     |     |     |     | MEM  |
| 3:LD A     |        |     |     |     |     |     |     |      |
| LONG TIME  |        |     |     |     |     |     |     |      |
| 1,2:REPL A |        | 1   |     | 1   |     |     |     |      |
| 3:REPL A   |        |     |     |     |     | I   |     |      |
| 3:REPL B   | WB(B)  |     |     |     |     |     | 1   | \$3  |
|            |        |     |     |     |     |     |     |      |

h

### **Using MOESI**

UPPSALA JNIVERSITET

### All the three RISC CPUs in a **MOESI** shared-memory (sequentially consistent) multiprocessor executes the following code almost at the same time:

Initially, CPU1 has its local variable my\_id=1, CPU has my\_id=2 and CPU3 has my\_id=3 and the globally shared variables A is equal to 1 and B is equal to 0.

Assume that CPU3, 2 and then 1 first make one memory reference (i.e, a load or a store) each, after which they repeats that memory access interleaving.

The following four bus transaction types can be seen on the snooping bus connecting the CPUs:

- RTS: ReadtoShare (reading the data with the intention to read it)
- RTW, ReadToWrite (reading the data with the intention to modify it)
- WB: Writing data back to memory
- INV: Invalidating other caches copies

AVDARK 2013 Show every state <u>change</u> and/or value <u>change</u> of A and B in each CPU's cache according to one possible interleaving of the memory accesses. After the parallel execution is done for all of the CPUs, the cache lines still in the caches will be replaced. These actions should also be shown. For each line, also state what bus transaction occurs on the bus (if any) as well as which device is providing the corresponding data (if any).

| Mem Instr | Bus    | 1:A | 1:B | 2:A | 2:B | 3:A | 3:B | Data |
|-----------|--------|-----|-----|-----|-----|-----|-----|------|
| 3:LD A    | RTS(A) |     |     |     |     | E/1 |     | MEM  |
| 2:LD A    | RTS(A) |     |     | S/1 |     | S/1 |     | MEM  |
| 1:LD A    | RTS(A) | S/1 |     |     |     |     |     | MEM  |
| 3,2:LD A  |        |     |     |     |     |     |     |      |
| 1:LD B    | RTS(B) |     | E/0 |     |     |     |     | MEM  |
| 3,2:LD A  |        |     |     |     |     |     |     |      |
| 1:ST B    | INW(B) |     | M/1 |     |     |     |     |      |
| 3,2:LD A  |        |     |     |     |     |     |     |      |
| 1:ST A    | INV(A) | M/2 |     | T   |     | 1   |     |      |
| 3:LD A    | RTS(A) | 0/2 |     |     |     | S/2 |     | \$1  |
| 2:LD A    | RTS(A) |     |     | S/2 |     |     |     | \$1  |
| 1,3:LD A  |        |     |     |     |     |     |     |      |
| 2:LD B    | RTS(B) |     | 0/1 |     | S/1 |     |     | \$1  |
| 1,3:LD A  |        |     |     |     |     |     |     |      |
| 2:ST B    | INV(B) |     | 1   |     | M/2 |     |     |      |
| 1,3:LD A  |        |     |     |     |     |     |     |      |
|           |        |     |     |     |     |     |     |      |

UPPSALA UNIVERSITET

h



#### Sync example

All the three RISC CPUs in a **MOSI** shared-memory (sequentially consistent) multiprocessor executes the following code almost at the same time:

#### barrier(B, 3);

Assume that CPU3, 2 and then 1 first make one memory reference (i.e, a load, store or atomic) each, after which they repeats that memory access interleaving. Fill out the state transition sheet for the entire execution including cache evictions at the and.



## **Barrier Synchronization**



IRL-Coherence 38



2013

#### Sync example

All the three RISC CPUs in a **MOSI** shared-memory (sequentially consistent) multiprocessor executes the following code almost at the same time:

```
barrier(B, 3);
```

Assume that CPU3, 2 and then 1 first make one memory reference (i.e, a load, store or atomic) each, after which they repeats that memory access interleaving. Fill out the state transition sheet for the entire execution including cache evictions at the and. Initially B.ctr = 0 and B.L = 0. Both reside in memory only.

```
DEF barrier (bar name, p) \{ /* we know that this barrier is way too simple,
   right? */
  lock(bar name.L);
                                       /* globally increment the barrier count
  bar name.ctr++ ;
   */
  unlock(bar name.L);
  while (bar name.ctr < p) {}; /* wait for the last thread */
DEF lock(lock variable) {
                                                   /* This is an optimistic
   spinlock */
  while true {
       if (TAS[lock variable] == 0) break; /* Pound once. Done if TAS returns 0
   */
                                                  /* Spin locally */
       while(lock variable != 0) {}
   }
}
```

DEF unlock (lock variable) Dept of unformation Technology www.it.uu.se

IRL-Coherence 39

| UPPSALA     |   |
|-------------|---|
| UNIVERSITET | 1 |

| Mem Instr     | Bus      | 1:ctr | 1:L  | 2:ctr | 2:L  | 3:ctr | 3:L  | Data |
|---------------|----------|-------|------|-------|------|-------|------|------|
| 3:TAS L in cs | RTW(L)   |       |      |       |      |       | M/FF | Mem  |
| 2:TAS L       | RTW(L)   |       |      |       | M/FF |       | 1    | 3    |
| 1:TAS L       | RTW(L)   |       | M/FF |       | 1    |       |      | 2    |
| 3:LD CTR      | RTS(ctr) |       |      |       |      | S/0   |      | Mem  |
| 2:LD L spin   | RTS(L)   |       | O/FF |       | S/FF |       |      | 1    |
| 1:LD L spin   |          |       |      |       |      |       |      |      |
| 3: ST CTR     | INV(ctr) |       |      |       |      | M/1   |      |      |
| 2:LD L spin   |          |       |      |       |      |       |      |      |
| 1:LD L spin   |          |       |      |       |      |       |      |      |
| 3:ST L unlock | RTW(L)   |       | 1    |       | 1    |       | M/0  | 1    |
| 2:LD L spin   | RTS(L)   |       |      |       | S/0  |       | 0/0  | 3    |
| 1:LD L spin   | RTS(L)   |       | S/0  |       |      |       |      | 3    |
| 3:LD CTR      |          |       |      |       |      |       |      |      |
| 2:TAS L in cs | INV(L)   |       | I    |       | M/FF |       | I    |      |
| 1:TAS L       | RTW(L)   |       | M/FF |       | 1    |       |      | 2    |
| 3:LD CTR      |          |       |      |       |      |       |      |      |
| 2: LD CTR     | RTS(ctr) |       |      | S/1   |      | O/1   |      | 3    |

h



| Mem Instr     | Bus      | 1:ctr | 1:L  | 2:ctr | 2:L | 3:ctr | 3:L | Data |
|---------------|----------|-------|------|-------|-----|-------|-----|------|
| Transport     |          |       | M/FF | S/1   | 1   | O/1   | I   |      |
| 1:LD L spin   |          |       |      |       |     |       |     |      |
| 3:LD CTR      |          |       |      |       |     |       |     |      |
| 2:ST CTR      | INV(ctr) |       |      | M/2   |     | I     |     |      |
| 1:LD L spin   |          |       |      |       |     |       |     |      |
| 3:LD CTR      | RTS(ctr) |       |      | 0/2   |     | S/2   |     | 2    |
| 2:ST L unlock | RTW(L)   |       | I    |       | M/0 |       |     | 1    |
| 1:LD L spin   | RTS(L)   |       | S/0  |       | O/0 |       |     | 2    |
| 3+2:LD CTR    |          |       |      |       |     |       |     |      |
| 1:TAS L in cs | INV(L)   |       | M/FF |       | I   |       |     |      |
| 3+2:LD CTR    |          |       |      |       |     |       |     |      |
| 1:LD CTR      | RTS(ctr) | S/2   |      |       |     |       |     | 2    |
| 3+2:LD CTR    |          |       |      |       |     |       |     |      |
| 1:ST CTR      | INV(ctr) | M/3   |      | I     |     | I     |     |      |
| 3:LD CTR done | RTS(ctr) | 0/3   |      |       |     | S/3   |     | 1    |
| 2:LD CTR done | RTS(ctr) |       |      | S/3   |     |       |     | 1    |
|               |          |       |      |       |     |       |     |      |



#### **A Naiive Centralized Barrier**

```
BARRIER (bar_name, p) {
  LOCK(bar_name.lock) {
    if (bar_name.ctr == p) bar_name.ctr = 0; /* init count*/
    bar_name.ctr++; /* globally increment the barrier count */
    }
  UNLOCK(bar_name.lock)
  while (bar_name.ctr < p) {}; /* wait for the last thread */
}</pre>
```

```
AVDARK
2013
```



2013

#### **A More Complicated Centralized Barrier**

```
BARRIER (bar_name, p) {
int loops;
loops = 0;
local sense = !(local sense) :
                                       /* toggle private sense variable
                                        each time the barrier is used */
LOCK(bar_name.lock);
   bar_name.counter++;
                                       /* globally increment the barrier count *,
   if (bar_name.counter == p) { /* everybody here yet ? */
        bar_name.flag = local_sense; /* release waiters*/
        UNLOCK(bar_name.lock)
     }
   else
    { UNLOCK(bar_name.lock);
        while (bar_name.flag != local_sense) { /* wait for the last guy */
          if (loops++ > UNREASONABLE) report_warning(pid)}
     }
```

## Microbenchmarks



## Stepping through the array



Vector size [Bytes]







### Micro Benchmark Signature

for (times = 0; times < Max; times++) /\* many times\*/</pre>

for (i=0; i < ArraySize; i = i + Stride)</pre> dummy = A[i]; /\* touch an item in the array \*/



**AVDARK** 

2013







## **Micro Benchmark Signature**







AVDARK 2013 CL

#### **Micro Benchmark Signature**

for (times = 0; times < Max; time++) /\* many times\*/</pre>

for (i=0; i < ArraySize; i = i + Stride)
 dummy = A[i]; /\* touch an item in the array \*/</pre>





#### Twice as large L2 cache ???

for (times = 0; times < Max; time++) /\* many times\*/</pre>

for (i=0; i < ArraySize; i = i + Stride)
 dummy = A[i]; /\* touch an item in the array \*/</pre>





### Twice as large TLB...

for (times = 0; times < Max; time++) /\* many times\*/</pre>

for (i=0; i < ArraySize; i = i + Stride)
 dummy = A[i]; /\* touch an item in the array \*/</pre>



# Survey Coh & Mem mods





## My take-home message

- The correct answers were sometime wrong
- Sound quality
- Even shorter videos
- More questions during videos
- This part was harder to follow, use more examples
- More "hover-over hints"
- Question where the screen goes blank is no hit
- Didn't learn much from the first part of the lab
- Problem-solving in IRL is better the PPT
- More meat during lab lectures
- Please share some questions/stats with us

2013

1

Try to make the **sound** as clear as possible. Questions are good, more questions. Maybe even shorter videos?

2 I think this course is perfect. It is nice to get the information from different channels like videos, labs and classes. The only thing that I can come up with is that it can be **more questions** in the videos. They help you stay focused through the whole video, and also it's better when the video is around 20 minutes rather than 60. If it is 20 minutes you can watch it while eating breakfast for example.

The questions from the last batch had some problems. In one question there was not a **correct** 3 answer and on another one the correct one was not right. The last lecture from the previous batch had really low volume.

Frågorna där skärmen blankades till vitt så att all hela sammanhanget försvann var ingen hit. 4 More hands on examples we can all work through in class like today when the power went off. 5 It's bringing in the active learning aspect as opposed to running through the examples on PowerPoint. I for one, get more out of it, I'm sure there are others that feel the same way.

I think that it was better in the first videos where it was given some explanation to the answers 6 when hovering the mouse.

More questions in the videos 7

The IRL classes are very helpful when Erik holds them. It really sums up the important parts of 8 the videos and gives a deeper understanding of the concepts in the course. The one lecture we had with the lab assistants did not do much for understanding the course or understanding the lab assignment. I would really appreciate a good lab preparation lecture. The labs would be so much more helpful for understanding the course when you actually understand the assignment.

Can't say that learned anything new during the first lab. It was more bothersome since the terms 9 during the lectures were different from the terms used in the skeleton code. So it took awhile to understand what was what. However, the bonus hand-in was a great way to gain a better understanding of the course material.

I would like more questions during the web lectures. That way we need to use our heads more 10 frequently.

The first part of the lab was mostly just understanding and getting into the code written in 11 the file. I didn't feel like I gained any particular insights into caching or somesuch. Also the hints were not really hints when we mostly only had to change one function (access), they just confused us as to where the changes needed to be. The video lectures can be a chore to get through but they explain things well. I prefer regular lectures though.

Can't come up with anything for you to improve 12

- 13 I don't know. I think it is good!
- 14 This time it was **not clear.** : (

Dept of Information Technology www.it.uu.se IRL Coherence 55



15

#### Can't think of anything, except another Lab soon would be nice!

16 With the exception of some technical problems (i.e. **bad sound** in one of videos and the "order the consistency models" question didn't work correctly), I have no complaints.

17 Would be nice to have a **cheatsheet** or something for all the terms being thrown around so that if we what something means we can easily go back and check without going through a bunch of videos. All in all very good and understandable! The cache lab really made me get it!

Maybe I've misunderstood some recaps to the slides to be answered question, or people aren't asking anything during video lectures. If not any of these, I would like to hear what people are asking from the videos during the IRL classes. **A few words about answering statistics** could also be in place and spending a little more time where a lot of people got them wrong. Also: please hint in the videos that a quiz is coming. I've watched a few videos directly through Youtube and swapped to scalable-learning when a question comes (in one of those red boxes). However, some times a question is not hinted for - it just pops up. This may make me miss some quiz.

19 Compared to last time, the length of the video was perfect.

#### 20 More focused videos

21 **This lecture was a bit confusing**, consider giving more examples to concepts rather than glossing over them OR explain that you are glossing over them before you begin to explain the concept itself. There were several times where I would rewind to understand something that I later found out was not as necessary.

22 **Shorter videos** were definitely better, it is however difficult to judge how much the labs help as we have only done one lab.

The videos are a good idea, but sometimes they are too long, I think that if we could read some of the things said in the video, some videos wouldn't be that necessary. I also think that we should try more excercises in class (or maybe more hours of class per week)





Dept of Information Technology www.it.uu.se IRL Coherence 57