Featured White Papers
- Oct. 14th: Simplified IT with Software-as-a-Service (SaaS) (ZDNet)
- PCI DSS therapy for the smaller retailer (McAfee)
- The rise of Web commuting (Citrix Online)
Overview of the Blue Gene/L system architecture
IBM Journal of Research and Development, Mar-May 2005 by Gara, A, Blumrich, M A, Chen, D, Chiu, G L-T, Et al
The Blue Gene®/L computer is a massively parallel supercomputer based on IBM system-on-a-chip technology. It is designed to scale to 65,536 dual-processor nodes, with a peak performance of 360 teraflops. This paper describes the project objectives and provides an overview of the system architecture that resulted. We discuss our application-based approach and rationale for a low-power, highly integrated design. The key architectural features of Blue Gene/L are introduced in this paper: the link chip component and five Blue Gene/L networks, the PowerPC® 440 core and floating-point enhancements, the on-chip and off-chip distributed memory system, the node- and system-level design for high reliability, and the comprehensive approach to fault isolation.
Introduction
A great gap has existed between the cost/performance ratios of existing supercomputers and that of dedicated application-specific machines. The Blue Gene*/L (BG/L) supercomputer was designed to address that gap by retaining the exceptional cost/performance ratio between existing supercomputer offerings and that obtained by dedicated application-specific machines. The objective was to retain the exceptional cost/performance levels achieved by application-specific machines, while generalizing the massively parallel architecture enough to enable a relatively broad class of applications. The goal of excellent cost/performance meshes nicely with the additional goals of achieving exceptional performance/ power and performance/volume ratios.
The last term in this expression, watt/rack, is determined by thermal cooling capabilities and can be considered a constant of order 20 kW for an air-cooled rack. Therefore, it is the performance/watt term that determines the rack performance. This clearly illustrates one of the areas in which electrical power is critical to achieving rack density.
We have found that in terms of performance/watt, the low-frequency, low-power, embedded IBM PowerPC* core consistently outperforms high-frequency, high-power microprocessors by a factor of 2 to 10. This is one of the main reasons we chose the low-power design point for BG/L. Figure 1 illustrates the power efficiency of some recent supercomputers. The data is based on total peak floating-point operations per second divided by total system power, when that data is available. If the data is not available, we approximate it using Gflops/chip power.
Using low-power, low-frequency chips succeeds only if the user can achieve more performance by scaling up to a higher number of nodes (processors). Our goal was to address applications that have good scaling behavior because their overall performance is enhanced far more through parallelism than by the marginal gains that can be obtained from much-higher-power, higher-frequency processors.
The importance of low power can be seen in a number of ways. The total power of a 360-Tflops computer based on conventional high-performance processors would exceed 10 megawatts, possibly approaching 20 megawatts. For reference, 10 megawatts is approximately equal to the amount of power used in 11,000 U.S. households [3]. Clearly, this is a fundamental problem that must be addressed. This power problem, while easy to illustrate for a 360-Tflops system, is also of great concern to customers who require high-performance computing at almost all scales. The rate of electrical infrastructure improvements is very slow and the cost is high, compared with those of the computing performance enhancements that have been achieved over the last four decades. Across the industry, technology is leading to further density improvements, but scaling improvements in the power efficiency of computing are slowing dramatically. This portends a difficult future, in which performance gains will have to be made through enhancements in architecture rather than technology. BG/L is an example of one approach to achieving higher performance with an improved power/performance ratio.
A number of challenges had to be overcome to realize good performance using many processors of moderate frequency. These were addressed by assessing the impact on application performance for a representative set of applications. The BG/L networks were designed with extreme scaling in mind. Therefore, we chose networks that scale efficiently in terms of both performance and packaging. The networks support very small messages (as small as 32 bytes) and include hardware support for collective operations (broadcast, reduction, scan, etc.), which will dominate some applications at the scaling limit.
The other critical issue for achieving an unprecedented level of scaling is the reliability, availability, and serviceability (RAS) architecture and support. A great deal of focus was placed on RAS support for BG/L so that it would be a reliable and usable machine, even at extreme scaling limits. Dealing with the sheer scale of supercomputers, whether based on clusters or on custom solutions, has long been one of the most difficult challenges for the entire industry and is likely to become more difficult as the scale of these systems grows. Since we were developing BG/L at the application-specific integrated circuit (ASIC) level, we were able to integrate many features typically found only on high-performance servers. This is an area in which BG/L can be clearly differentiated from commodity cluster solutions based on nodes that were not designed to reach the levels of scalability of supercomputers and therefore do not have the necessary RAS support for extreme scaling.