PACS-CS computer

Table 1 Design specifications PACS-CS

Theoretical peak speed

14.3TFLOPS(*) (64bit double precision , 2560 nodes)

Total main memory

5.0TByte

Total distributed disk

0.41PByte

Node

processor

Intel Low Voltage Xeon 2.8GHrz

memory

2GByte DDR2/400

FSB

800MHz

Local disk

160GByte x 2 (RAID-1 mirror)

Network

Topology

3-dimensional Hyper Crossbar (16x16x10)

Data transfer bandwidth

250MByte/sec (per direction)
750MByte/sec(node aggregate on 3-D)

Method

Specially developed driver

Shared disk storage

10 TByte RAID-5 disk

External connection

Trunked Gigabit Ethernet

Theoretical peak bandwidth 250Mbyte/sec

System size

59 racks/100m^2 foot print

Power dissipation

545 kW maximum

(*)TFLOPS : unit of one trillion floating point operations per second.


Design considerations of PACS-CS

In Table 1 we list the current design specification of the PACS-CS computer. For a variety of reasons including keeping the development period as short as possible as well as cost, we have decided to pursue a commodity approach to build the next computer system for our applications. Standard clusters which have become widely available over the last several years, however, are unsatisfactory in the following points:

(i) with the standard n-way SMP configuration in which each node is equipped with dual or more processors, the memory to processor bandwidth is very low so that only a fraction of the peak performance is actually realized, and similarly

(ii) the tree-type interconnect with a small number of large switches is problematical in that the bandwidth is too low if inexpensive Gigabit Ethernet is used, while the cost, particularly that of switches, increases rapidly if a faster SAN interconnect such as Myrinet or Infiniband is adopted.

Our choice to resolve these issues has led to the current PACS-CS design, which is essentially an MPP system built out of commodity components. We equip each node with a single processor and connect it with the fastest bus available to the main memory with a matched I/O speed. Specifically, we choose FSB800 for the bus and two banks of PC3200 DDR2/400 memory, which means the memory to processor bandwidth of 6.4GByte/s. For the processor, we use Intel Low Voltage Xeon for a better error check and correction as well as wider variety of chipset to carry enough I/O bandwidth than the Pentium4 series. The frequency of CPU clock is 2.8GHz (5.6Gflops peak) which is not the highest one in today's technology since higher rates will not provide higher effective performance, and the low-voltage version to suppress power consumption.

For the interconnection network, we adopt the three-dimensional Hyper-Crossbar Network used for the CP-PACS computer. This network provides a versatile 3-dimensional connection, much more flexible than the mesh, with a medium-sized switches, albeit large in number, and so helps reduce the switch cost. This network topology is suitable for any dimension of orthogonal space for the physical model of problems which require nearest neighboring communications. Additionally, it also provides a moderate bandwidth even for random traffics.

Fig.1 3-dimensional hypercrossbar network of PACS-CS

 

In Fig.1 we show a schematic view of the Hyper-Crossbar Network of the PACS-CS system. The actual system has 2560 nodes arranged in a 16x16x10 orthogonal space. The network link in each direction consists of dual Gigabit Ethernet. With the trunking of two links of Ethernet, the peak bandwidth equals 250MByte/s/direction for each of the three dimensions, and the total bandwidth of each node sums up to 750MByte/s/direction. This is competitive with high-throughput interconnects such as InfiniBand which allows 1 GByte/s bandwidth.

In order to keep the packaging density equal to the standard 2-way SMP node which is a typical configuration of today's HPC clusters, we put two nodes on a single 1U board. Each node has to have at least 6 Gigabit Ethernet ports. In fact we put two more ports, one for generic I/O and the other for system diagnostics and control. In addition, the chipset and supporting I/O bridge chips are carefully arranged so that a sufficient bandwidth is ensured between memory and each of the eight Gigabit Ethernet Interface. A schematic view of our mother borad is given in Fig.2.

Fig.2 Schematic view of board design


For temporary data storage, each node is equipped with a 160 GByte of disk space, which is duplicated to work in the RAID-1 mirroring mode. The external I/O is made via a three-stage Gigabit Ethernet tree where the inter-stage connection bandwidth is enhanced by link aggregation (standard LACP). The file server at the top of the tree network for shared disk storage is connected to a 10 TByte RAID-5 disk.

The operating system of PACS-CS is Linux, and SCore is used as the cluster middleware to manage 2560 of nodes. We need to develop the driver for data communication between nodes over the Hyper-Crossbar Network. Work is in progress based on the PMv2 driver available from SCore.

The programming language is Fortran90, C and C++. The programming for large scale parallel processing is explicitly done by MPI specially tuned for Hyper-Crossbar Network explained above.

The system is housed in the standard 42U 19-inch racks. Two processor nodes are placed on an 1U chassis and 32 of 1U chassis are placed in a rack. There will be 40 racks for the planned 2560 node system. A 48-port Layer-2 Gigabit Ethernet switch is used for data network (Hyper-Crossbar Network) and generic I/O (tree) to realize the high density implementation. 18 racks will be assigned for the collection of these switches. An additional rack is required to install a front-end server, 3 SCore management servers, and a file server with 10TByte of RAID5 disk. In total the full system will consist of 59 racks spread over an area of about 100 square meters.

The estimated power dissipation is 545 kW when the system is in full operation.

Fig.3 PACS-CS floor layout