## HA-PACS/TCA: Tightly Coupled Accelerators for Low-Latency Communication ## Overview of Tightly Coupled Accelerators (TCA) Architecture and HA-PACS/TCA GPGPU is now widely used for accelerating scientific and engineering computing to improve performance significantly with less power consumption. However, I/O bandwidth bottleneck causes serious performance degradation on GPGPU computing. Especially, latency on inter-node GPU communication significantly increases by several memory copies. To solve this problem, TCA (Tightly Coupled Accelerators) enables direct communication among multiple GPUs over computation nodes using PCI Express. PEACH2 (PCI Express Adaptive Communication Hub ver. 2) chip is developed and implemented by FPGA (Field Programmable Gate Array) for flexible control and prototyping. PEACH2 board is also developed as an PCI Express extension board. HA-PACS/TCA, which is an extended part of HA-PACS base cluster, was installed with PEACH2 board in each node on Oct. 2013. HA-PACS/TCA is operated with HA-PACS base cluster, and entire HA-PACS system becomes over 1.1 PFLOPS GPU cluster. Currently we develop programming APIs for CUDA programmer, evaluate several benchmarks and applications, such as QUDA (Lattice QCD library), Himeno benchmark (3D Poisson Solver), HPCG etc. We also have a poster in technical session, which shows block-stride transfer performance and Himeno benchmark results. Block diagram of computation node of HA-PACS/TCA **Block diagram of PEACH2 Chip** Entire HA-PACS System Including HA-PACS/TCA (5 racks x 2 rows) TCA Communication Board (PCIe CEM Spec., double height) ## **HA-PACS/TCA Specification** | CRAY 3623G4-SM | |------------------------------------------------------------------------------------| | SuperMicro X9DRG-QF | | Intel Xeon E5 2680 v2<br>(Ivy Bridge 2.8 GHz, 10 core) x 2 socket | | DDR3-1866MHz 4ch. 128 GB (119.4 GB/s) | | 448 GFLOPS/node | | NVIDIA Tesla K20X x 4 GPU | | GDDR5 2600MHz, 6 GB/GPU (250 GB/s/GPU) | | 5.24 TFLOPS/node | | IB QDR x 2 rails (Mellanox Connect X-3) | | PEACH2 (FPGA: Altera Stratix IV 530GX) | | 64 | | 364 TFLOPS (CPU: 28.7 TF, GPU: 335.3 TF) | | 277 TFLOPS (Efficiency: 76%)<br>3.52 GFLOPS/W (3 <sup>rd</sup> Nov. 2013 Green500) | | | ## Basic Performance of TCA Communication HA-PACS Project is partially supported by the JST/CREST program entitled "Research and Development on Unified Environment of Accelerated Computing and Interconnection for Post-Petascale Era" in the research area of "Development of System Software Technologies for post-Peta Scale High Performance Computing."