

# SCIMA: Software Controlled Integrated Memory Architecture for HPC

## Background

- Memory wall problem
- Conventional Cache is not good in HPC
  - unwilling line conflict
  - fixed size of Off-Chip Memory access

# Solution: SCIMA (Software Controlled Integrated Memory Architecture)

- Strategy: software controllability
- Addressable On-Chip Memory in addition to conventional cache
  - On-Chip Memory and cache are reconfigurable
- Explicit data transfer between On-Chip Memory and Off-Chip Memory by page-load/page-store instruction
  - Burst transfer and stride transfer are supported
  - SMP configuration is supported



#### **Advantages of SCIMA**

| Tb: CPU busy time Tl: Latency stall time Tt: Throughput stall time |          |          |          |
|--------------------------------------------------------------------|----------|----------|----------|
| On-Chip Memory Features                                            | Tb       | TI       | Tt       |
| software controllability                                           | -        | -        | <b>↓</b> |
| page-load/page-store(burst)                                        | <b>†</b> | <b>↓</b> | -        |
| page-load/page-store(stride)                                       | <b>†</b> | <b>↓</b> | <b>↓</b> |
| scheduling for page-load/page-store                                | -        | <b>↓</b> | -        |
| Latency Tolerating Techniques of Cache                             | Tb       | TI       | Tt       |
| larger cache line                                                  | -        | <b>↓</b> | <b>†</b> |
| lock-up free cache                                                 | -        | <b>↓</b> | <b>†</b> |
| cache prefetching                                                  | <b>†</b> | <b>↓</b> | <b>†</b> |

### **Evaluation Results**

QCD



5.0E+07
4.0E+07
3.0E+07
1.0E+07
0.0E+07
32B 128B 32B 128B

4CPUs

1CPU

Throughput ratio = 2:1 Latency = 40 cycles Throughput Ratio = ratio between On-Chip and Off-Chip memory throughput Latency = memory access latency for Off-Chip memory (latency for first data)

The performance of SCIMA scales much better than Cache in SMP configuration because of reduced Bus Stall.