Last week’s Top500 list crowned the Riken K supercomputer as the world’s fastest supercomputer with over 8 Petaflops of performance. There were quite a few articles written on K and Fujitsu had one of the K racks in their booth on the ISC show floor. The full system is housed in 672 computer racks equipped with a current total of 68,544 CPU chips. K is a wonderful achievement, albeit using a very specialized architecture from its custom SPARC CPUs to its custom interconnect to the custom water-cooled system boards. It started me thinking about what a 10 PF HP supercomputer would look like if built out of our ProLiant SL390s 4U server and Nvidia’s M2090 GPU.
Before doing the math, however, lets take a look at what sort of real science is being done today by GPU equipped SL390s servers on Tokyo Tech’s TSUBAME2.0 system, currently the world’s 5th fastest supercomputer.
Weather forecasting has always been a big consumer of supercomputer technology. Today’s weather forecasting systems have ever increasing demands driven from increased interest in global climate change research to an increasingly instant-on weather consumer who expects to have ever more accurate weather alerts delivered to their smart phone in real time. On TSUBAME2.0, researchers have used 3990 Tesla GPUs to run the ASUCA weather model to achieve an 80x performance boost.
Back on the ground, TSUBAME2.0 researchers have also been using 1024 Tesla GPUs to run Navier Stokes equations for computational fluid dynamics (CFD) simulations like the one pictured below.
Of course, if you are pedaling that fast, your heart rate is bound to be elevated and that is going to change your blood flow. TSUBAME2.0 can use GPUs to model that too, in fact 4000 Tesla GPUs were used to produce this blood flow simulation.
Of course at the same time that these scientific breakthroughs were being enabled by TSUBAME2.0, the system also provided invaluable real-world feedback to HP and Nvidia on how to improve future GPU accelerated systems. Nvidia customer surveys of CUDA users showed that users modifying only 1-5% of their code were able to achieve 2x application speedup with 3 man-months of effort. With 6 man-months of effort the speedup increased to over 8x. CUDA 4.0, released earlier this year, added many additional ease of use enhancements making it simpler than ever to program GPUs.
CUDA 4.0 hasn’t stood still on performance either. New functionality simplifies overlapping data transfers and compute, and GPUDirect gets data directly into the GPU from the InfiniBand interface without having to first pass through the host GPU. These features have been used extensively by developers in the oil and gas industry, among others, to maximize performance of the new HP ProLiant SL390s 4U with eight Tesla GPUs.
Now with a 10 Petaflop system, researchers could not only do all of the above, but start to run even more complicated models to answer ever more challenging problems. So lets take a look at what a 10 Petaflop system would look like.
The HP ProLiant SL390s 4U with two Intel x5675 CPUs and eight Nvidia M2090 GPUs provides a peak performance of 5.466 TF, packing 109.32 TF into 40RU of industry standard rack space. That means a 10 PF Peak system would need 1830 SL390s 4U servers, about 92 racks of computer gear, using technology you can purchase today. If we assume the same 41% Linpack efficiency achieved on TSUBAME2.0, that means a 10 PF Linpack sustained performance would require 4464 systems with 35,712 Nvidia M2090 GPUs, still only 223 racks. A dual rail, fully non-blocking QDR InfiniBand fabric for this cluster would require 16 648 port InfiniBand switches, which would take up 16 racks, for a grand total of 239 racks. Throw in another rack for management servers and call it 240 racks even. Don’t have room for 240 high density racks in your data center, no problem, six HP EcoPODs will house your 10 PF system, and still have room to spare for 24 racks of storage.
As for cost, lets just say it would be a fraction of the reported $1B spent to build the K system. What is perhaps more important, if you ordered a system today, you could very well have the fastest supercomputer in the world come the unveiling of the November 2011 Top500 list. There of course have been some other well publicized contenders aiming to hit the 10 PF mark, using other vendor’s proprietary CPU and/or interconnect technologies. Like the Riken K system, I’m sure all the systems that make it to the top of the Top500 list provide some scientific value. The question is, at what cost, and how does the science those systems hope to accomplish compare with the breadth of real-life scientific work being done today by HP ProLiant SL390s and other GPU accelerated systems? It comes as no surprise that Nvidia GPUs already power the the fastest supercomputers in China, India, Italy, Russia, and Spain.
Don’t have the power to run a 10 PF system? Don’t worry, HP can deliver a system exceeding TSUBAME2.0’s 5th place Top500 ranking using a single EcoPOD and using less than 1.5 MW of power. That is likely to still be good for a Top10 ranking come SC11 in Seattle this November.