PACS-CS computer
Table 1 Design specifications PACS-CS
Theoretical peak speed
|
14.3TFLOPS(*) (64bit double precision , 2560 nodes)
| |
|
Total
main memory
|
|
Total
distributed disk
|
0.41PByte
|
Node
|
processor
|
Intel
Low Voltage Xeon 2.8GHrz
|
|
memory
|
2GByte
DDR2/400
|
FSB
|
800MHz
|
Local
disk
|
160GByte
x 2 (RAID-1 mirror)
|
Network
|
Topology
|
|
Data
transfer bandwidth
|
250MByte/sec
(per direction)
750MByte/sec(node
aggregate on 3-D)
|
Method
|
Specially
developed driver
|
Shared disk storage
|
10
TByte RAID-5 disk
|
External connection
|
Trunked Gigabit Ethernet
|
Theoretical peak bandwidth 250Mbyte/sec
|
System size
|
59 racks/100m^2 foot print
|
Power dissipation
|
545 kW maximum
|
(*)TFLOPS :
unit of one trillion floating point operations per second.
Design considerations of PACS-CS
In Table 1 we list the current design specification of the PACS-CS
computer. For a variety of reasons including keeping the development
period as short as possible as well as cost, we have decided to pursue a
commodity approach to build the next computer system for our applications.
Standard clusters which have become widely available over the
last several years, however, are unsatisfactory in the following
points:
(i) with the standard n-way SMP configuration in which each node is
equipped with dual or more processors, the memory to processor bandwidth
is very low so that only a fraction of the peak performance is actually
realized, and similarly
(ii) the tree-type interconnect with a small number of large switches is
problematical in that the bandwidth is too low if inexpensive Gigabit
Ethernet is used, while the cost, particularly that of switches,
increases rapidly if a faster SAN interconnect such as Myrinet or
Infiniband is adopted.
Our choice to resolve these issues has led to the current PACS-CS
design, which is essentially an MPP system built out of commodity
components. We equip each node with a single processor and connect it
with the fastest bus available to the main memory with a matched I/O
speed. Specifically, we choose FSB800 for the bus and two banks of
PC3200 DDR2/400 memory, which means the memory to processor bandwidth of
6.4GByte/s. For the processor, we use Intel Low Voltage Xeon for a
better error check and correction as well as wider variety of chipset
to carry enough I/O bandwidth than the Pentium4 series. The frequency of
CPU clock is 2.8GHz (5.6Gflops peak) which is not the highest one in
today's technology since higher rates will not provide higher effective
performance, and the low-voltage version to suppress power consumption.
For the interconnection network, we adopt the three-dimensional
Hyper-Crossbar Network used for the CP-PACS computer. This network
provides a versatile 3-dimensional connection, much more flexible than
the mesh, with a medium-sized switches, albeit large in number, and so
helps reduce the switch cost. This network topology is suitable for any
dimension of orthogonal space for the physical model of problems which
require nearest neighboring communications. Additionally, it also
provides a moderate bandwidth even for random traffics.
Fig.1 3-dimensional hypercrossbar
network of PACS-CS
In Fig.1 we show a schematic view of the Hyper-Crossbar Network of the
PACS-CS system. The actual system has 2560 nodes arranged in a 16x16x10
orthogonal space. The network link in each direction consists of dual
Gigabit Ethernet. With the trunking of two links of Ethernet, the peak
bandwidth equals 250MByte/s/direction for each of the three dimensions,
and the total bandwidth of each node sums up to 750MByte/s/direction.
This is competitive with high-throughput interconnects such as
InfiniBand which allows 1 GByte/s bandwidth.
In order to keep the packaging density equal to the standard 2-way SMP
node which is a typical configuration of today's HPC clusters, we put
two nodes on a single 1U board. Each node has to have at least 6
Gigabit Ethernet ports. In fact we put two more ports, one for generic
I/O and the other for system diagnostics and control. In addition, the
chipset and supporting I/O bridge chips are carefully arranged so that a
sufficient bandwidth is ensured between memory and each of the eight
Gigabit Ethernet Interface. A schematic view of our mother borad is
given in Fig.2.
Fig.2 Schematic view of board design
For temporary data storage, each node is equipped with a 160 GByte of
disk space, which is duplicated to work in the RAID-1 mirroring mode.
The external I/O is made via a three-stage Gigabit Ethernet tree where
the inter-stage connection bandwidth is enhanced by link aggregation
(standard LACP). The file server at the top of the tree network for
shared disk storage is connected to a 10 TByte RAID-5 disk.
The operating system of PACS-CS is Linux, and
SCore is used as the cluster
middleware to manage 2560 of nodes. We need to develop the driver for
data communication between nodes over the Hyper-Crossbar Network. Work
is in progress based on the PMv2 driver available from
SCore.
The programming language is Fortran90, C and C++. The programming
for large scale parallel processing is explicitly done by MPI specially
tuned for Hyper-Crossbar Network explained above.
The system is housed in the standard 42U 19-inch racks.
Two processor nodes are placed on an 1U chassis and 32 of 1U chassis are placed in a rack.
There will be 40 racks for the planned 2560 node system.
A 48-port Layer-2 Gigabit Ethernet switch is used for data network
(Hyper-Crossbar Network) and generic I/O (tree) to realize the high
density implementation. 18 racks will be assigned for the collection of
these switches. An additional rack is required to install a front-end
server, 3 SCore management servers, and a file server with 10TByte of
RAID5 disk.
In total the full system will consist of 59 racks
spread over an area of about 100 square meters.
The estimated power dissipation is 545 kW when the system is in full
operation.
Fig.3 PACS-CS floor layout