

Center for Computational Sciences, University of Tsukuba www.ccs.tsukuba.ac.jp

# **PACS-CS Performance Evaluation**



## Linpack Performance

| P×Q    | Ν      | Rmax<br>(Tflops) | Efficiency<br>(%) | 3D routing |
|--------|--------|------------------|-------------------|------------|
| 16×160 | 706560 | 10.33            | 72.05             | No         |
| 32×80  | 722944 | 10.35            | 72.20             | Yes        |

- 10.35 TFLOPS with 2560 nodes #34 at 2006/June TOP500 list (#2 in Japanese manufactured machine)
  Performance differs by 2-D array configuration
- on HPL with routing overhead on 3-D HXB
- 6.7 hours of running time

## Performance of PM/Ethernet-HXB network layer



- HXB(n) = Sustained bandwidth with n links of GbE in PM/Ethernet-HXB communication layer
- Aggregated bandwidth with 6 GbE links is

much wider than MyrinetXP and comparable with InfinibandSX

- High performance with low-latency software technology by PM/Ethernet-HXB
- Trunk network technology is applicable for wide variety of networks including 10GbE and Infiniband

#### **3-D Simultaneous Burst Transfer Performance**

| Bandwidth/node [MB/s] | 256 node      | 512 node      |
|-----------------------|---------------|---------------|
| average               | 586.8 (78.2%) | 582.0 (77.6%) |
| max.                  | 619.3 (82.6%) | 629.6 (84.0%) |
| min.                  | 559.2 (74.6%) | 434.0 (57.9%) |

### Bisection bandwidth on a single dimension



- Single-sided simultaneous burst transfer between two groups of nodes separated by the half point on one dimension
- MPICH user level sustained bandwidth
- Bisection bandwidth scales with the number of

#### node pairs and it is never degraded The same performance on any dimension of

3-D Hyper-Crossbar network

## CPU performance on QCD kernel

| Data<br>buffering | SSE3<br>assembler | SSE3<br>intrinsic | Fortran90 (SSE3 auto-vectorization) | Computation efficiency |
|-------------------|-------------------|-------------------|-------------------------------------|------------------------|
| NO                | 1872.94           | 1910.04           | 1451.25                             | 34%                    |
| YES               | 1619.53           | 1645.64           | 1291.08                             | 29%                    |

About 30% of peak performance with SSE3 instructions (data buffering for MPI communication included)
High sustained performance even on heavily-loaded memory access application
The best performance with SSE3 intrinsic rather than SSE3 hand assembler (SSE3 intrinsic helps the best usage of 16 of EM64T XMM registers.)