# **Multiprocessors** • why would you want a multiprocessor? what things can it do well? **Brief Introduction to Multiprocessing** • What things can't it do well? • Multicore vs. big uniprocessor? more is better? CSE 240A Dean Tullsen CSE 240A Dean Tullsen What's wrong with the uniprocessor? **Uniprocessor Complexity** • Complexity • The complexity/size of many functional blocks scale quadratically with issue width. • Power • When IW = 2 or 4, no big deal. Starts to hurt at 8+. • Lack of Instruction Level Parallelism • Rename table has *O x W* ports • Marginal gains of incremental logic - O = # operands - W = fetch width • Issue queue must do *Q* x *O* x *W* comparisons. - Q = size of IQ (typically grows as W grows) • Bypass logic is a W x W interconnect.

CSE 240A





#### **Interconnection Network**

- Bus
- Network
- pros/cons?





### **Memory Topology**

- UMA (Uniform Memory Access)
- NUMA (Non-uniform Memory Access)
- pros/cons?



# **Programming Model**

- Shared Memory -- every processor can name every address location
- Message Passing -- each processor can name only it's local memory. Communication is through explicit messages (multicomputer).
- pros/cons?



• find the max of 100,000 integers on 10 processors.

CSE 240A

Dean Tullsen

# **Parallel Programming**



- Shared-memory programming requires synchronization to provide mutual exclusion and prevent race conditions
  - locks (semaphores)
  - barriers

# Multiprocessor Caches (Shared Memory)

- the problem -- cache coherency
- the solution?



### What Does Coherence Mean?

- Informally:
  - Any read must return the most recent write
  - Too strict and very difficult to implement
- Better:
  - A processor sees its own writes to a location in the correct order.
  - Any write must eventually be seen by a read
  - All writes are seen in order ("serialization"). Writes to the same location are seen in the same order by all processors.
- Without these guarantees, synchronization doesn't work.

CSE 240A

Dean Tullsen

# **Cache Coherency**

• write-update

CSE 240A

- on each write, each cache holding that location updates its value
- *write-invalidate* <= most common
  - on each write, each cache holding that location invalidates the cache line.



- both schemes MUCH easier on a bus-based multiprocessor
- potentially requires a LOT of messages, but...

#### **Cache Coherency**

- A good cache coherency protocol can avoid sending unnecessary (and expensive) invalidate or update messages.
- Allows each cache line to be in one of several states.
- MESI (Illinois)
  - modified
  - exclusive
  - shared
  - invalid

Dean Tullsen

CSE 240A

### **Cache Coherency**

- How do you know when an external read/write occurs?
- Snooping protocols
- Directory protocols





CSE 240A

Dean Tullsen

### **Potential Solutions**

- Snooping Solution (Snoopy Bus):
  - Send all requests for unknown data to all processors
  - Processors snoop to see if they have a copy and respond accordingly
  - Requires "broadcast", since caching information is at processors
  - Works well with bus (natural broadcast medium)
  - Dominates for small scale machines
- Directory-Based Schemes
  - Keep track of what is being shared in one centralized place
  - Distributed memory => distributed directory (avoids bottlenecks)
  - Send point-to-point requests to processors
  - Scales better than Snoop
  - Actually existed BEFORE Snoop-based schemes

```
CSE 240A
```

Dean Tullsen

# An Example Snoopy Protocol – MESI (or Illinois) protocol

- Invalidation protocol, assumes write-back cache
- Each block of memory is in one state:
  - Clean in all caches and up-to-date in memory
  - Dirty in exactly one cache
  - Not in any caches
- Each cache block is in one state:
  - (M)odified: cache has only copy, its writeable, and dirty
  - (E)xclusive: cache has only copy, but it's clean
  - (S)hared: block can be read
  - (I)nvalid: block contains no data
- Read (and write) misses: cause all caches to snoop
- Writes to shared line are treated as misses

#### **MESI Protocol**



### **Other protocols**

MESI protocol

CSE 240A

- Big advantage over 3-state protocol (no shared private state) because doesn't require synch messages for private data.
- MOESI = Modified, Owned, Exclusive, Shared, Invalid
  - Owned (dirty in multiple caches, owned in one) => owner responsible for writing back shared, dirty line.
- What traffic does MOESI avoid?

### **Multicore Architectures**

- What is unique/different about multicore architectures?
- Bus or network?
- Shared memory or message passing?
- Need coherence?



Low latency

communication. Cores close,

memory far away.

CSE 240A

Dean Tullsen

Dean Tullsen

### A case study – Intel Nehalem (Core i7)



### Single-ISA Heterogeneous Multicore Architectures

- <u>Single-ISA Heterogeneous Multi-Core Architectures: The Potential for</u> <u>Processor Power Reduction</u>, Rakesh Kumar, Keith Farkas, Norm P. Jouppi, Partha Ranganathan, Dean M. Tullsen, In *36th International Symposium on Microarchitecture*, December, 2003.
- If you are putting a bunch of cores on a single processor, why make them all the same?
- Having heterogeneous cores greatly increases the chance that a thread running on the processor finds a core well suited to its execution needs.

CSE 240A

# **Multiprocessors – Key Points**

- Network vs. Bus
- Message-passing vs. Shared Memory
- Shared Memory is more intuitive, but creates problems for both the programmer (memory consistency, requiring synchronization) and the architect (cache coherency).

CSE 240A

Dean Tullsen