The end of the year is always a good time to reflect on the year and think about some of the key events shaping the HPC industry. Here is my take:
#1 Intel Xeon Phi Co-Processor Launch
At the SC12 show in November, Intel officially introduced the Xeon Phi Co-Processor. This of course came as no surprise as Intel has been talking up this chip since at least the ISC’11 show in June 2011 when it was referred to as MIC or Many Intel Core. The promise of the Xeon Phi co-processor is to deliver better performance/watt/$ for HPC applications than traditional x86 processors while still being close enough in architecture to Xeon as to not require re-writing your applications. For HPC customers, perhaps the best thing about Xeon Phi is the old adage, “a rising tide lifts all boats”. With chip giant Intel introducing a co-processor to market, the mainstream of HPC developers will now start to think seriously about how to structure their code to express the parallelism in their algorithms. Ultimately the performance of plain old x86, Xeon Phi, Nvidia GPU, AMD Fusion, or other multi-core processors will improve both as developers write more parallel code and as compiler writers, including the very good bunch at Intel, improve the capabilities of their compilers. Early deployments of Xeon Phi, from the TACC Stampede system to HP workstations running animation software at Dreamworks, have received glowing praises from their users and no doubt many more customers will want to try out the Xeon Phi co-processor as it becomes more widely available in 2013.
#2 Nvidia Kepler
Given my choice of #1, the #2 should not surprise. But I list Kepler for different reasons. Nvidia is not new to this space, their Tesla GPU family has been shipping for about five years and the CUDA programming environment has been around for the same amount of time. In addition, the OpenACC standard means multiple OpenACC compiler writers are now helping you get performance/watt/$ out of Nvidia GPUs without having to re-write parallel sections of your code in CUDA. The reason Kepler rises to #2 on my list is the advanced architectural features of Nvidia’s latest GPU family that significantly improve efficiency and performance/watt of Kepler GPUs compared to their predecessor Fermi brethren. Top on the list of key Kepler features is the updated SMX building block. SMX is the lego-like building block of Kepler GPUs and in part by reducing clock frequency (which is somewhat counter-intuitive for an HPC chip) Nvidia was able to pack in 192 cores into an SMX block vs 32 in earlier Fermi chips leading to a 3x double precision FLOPS performance increase compared to Fermi (1.22 TF DGEMM on K20X vs 0.40 TF on M2090).
#3 Nvidia Kepler Hyper-Q
Hyper-Q is not a product but a key architectural feature of Nvidia’s Kepler GPUs. Earlier Fermi GPUs could handle only a single work queue, for instance an MPI task, at once. Hyper-Q lets Kepler GPUs handle 32 simultaneous work queues. This doesn’t speed up a single work queue but lets the GPU get a lot more work done overall in most real-world applications.
#4 Nvidia Kepler Dynamic Parallelism
OK, I’m sure I’ll get some complaints for giving Nvidia 3 spots on my top 10 list, but the 3rd major new feature of Kepler which is just too cool not to meantion is Dynamic Parallelism. Simply put, Dynamic Parallelism lets the GPU make work for itself without going back to the host CPU. To a coder, the best way to think about this is that a CUDA thread can launch another CUDA thread without having to go back to the host CPU. For recursive applications like Quicksort, this can lead to a 2x performance increase with 50% less code. So in summary, SMX, Hyper-Q, and Dynamic Parallelism are all architectural features of Nvidia’s Kepler GPUs that help you get more performance/watt/$ out of a fixed number of transistors.
#5 NREL Energy Systems Integration Facility
The US Department of Energy’s National Renewable Energy Lab is visited by 1000′s of scientists and engineers from across the country and across the world each year, to learn about everything from advanced photovoltaic materials to biofuel burning engines. That research takes a lot of HPC compute power to carry out so it seemed only natural to NREL HPC director Steve Hammond to build an energy efficient data center. As the name tries to imply, the Energy Systems Integration Facility, or ESIF for short, isn’t just an energy efficient data center, it is an integration facility for showcasing energy efficiency in every component of the data center, from the high efficiency 480V power distribution to the novel warm water cooling system to the heat recapture and reuse. In 2012, NREL made a $10M award to HP to build a petaflop supercomputer using Intel Xeon x86 processors as well as Xeon Phi co-processors. NREL, HP, and Intel are gaining valuable experience from the new system, which has already started deployment and is scheduled to be completed in 2013. The technologies being perfected at the ESIF promise to save countless megawatts of power, mega-gallons of water, and mega-tons of carbon as they are deployed in the future not only at high end HPC research centers but across data centers of all shapes and sizes.
#6 Lustre Revival
The 2010 Lustre User Group was held in a beautiful setting at the Seascape Resort in Monterey Bay, California but most attendees left talking not about the resort but about the changes unfolding with Lustre’s new owner. As Oracle mostly abandoned Lustre, many long-time users wondered what would happen to the open source parallel file system originally developed by Peter Braam and team at Cluster File Systems which was later acquired by Sun Microsystems and ultimately Oracle. Oracle’s disinterest in Lustre proved to be one of the best things that ever happened to it. From new companies like Whamcloud formed to offer commercial support (Intel acquired Whamcloud and formed their new High Performance Data Division in 2012) to established storage companies like Xyratex introducing new products like Clusterstor based on Lustre, the Lustre ecosystem is more dynamic and vibrant than ever. Because of its mix of use in commercial storage solutions from Xyratex, DDN, and others to its broad open source base across untold number of storage platforms, it is hard to get good statistics on Lustre use but it is hard to believe that the numbers won’t go anywhere but up in 2013, especially with its new backing from Intel’s High Performance Data Division.
While a few pre-production PCIeGen3 based systems like the Purdue Carter cluster based on HP SL230s servers and Mellanox FDR Infiniband did ship in 2011, the official launch of PCIeGen3 systems supported by Intel’s Sandybridge processor came in 2012. PCIeGen3 represents not just a speedbump from earlier ‘Gen2 systems but a fundamental architectural change. Earlier x86 systems connected all I/O through an I/O hub and then on to the processor. Starting with ‘Gen3, PCIe is handled directly by the CPU and the I/O hub is relegated to handling a few legacy low speed devices. Certain types of HPC applications now can be optimized on a two-socket server if the server architecture supports the right type of independent PCIe lanes going to each of the CPUs. For instance, some applications benefit from independent network interfaces going to each CPU. Works great at long as your server supports it. It has been a steep learning curve for the entire PCIeGen3 ecosystem and look for improvements and additional PCIeGen3 devices in 2013.
#8 Intel Networking Acquisitions
No doubt in part spurred on by lessons learned in PCIeGen3-land (see #7, above), Intel in 2012 acquired the networking assets of both Qlogic and Cray. The march of networking closer and closer to the CPU, as witnessed by the PCIe networking on-ramp moving into the CPU in ‘Gen3, marches on. Intel has not said much in public about their networking plans, but you don’t need to be an HPC expert to make guesses. Already some ARM processor vendors like Calxeda are shipping SOCs (System on Chip) with built-in networking and that trend is likely to continue climbing up the processor ladder.
#9 HPC in the Cloud
No top 10 list can be compete this year without some mention of cloud. While Amazon Web Services has offered Cluster and GPU compute instances for some time, Windows Azure arguably took the lead at SC12 with their new HPC instances, sporting up to 16 cores, 120 GB RAM, and unlike AWS, even supporting Infiniband. There are not a lot of Top500 systems running Windows, so I’m excited to see Microsoft making such an effort in the HPC space. Just like Intel with Xeon Phi, when Microsoft gets serious and puts their weight behind a technology, good things usually follow. I don’t expect we will see HPC centers flocking to replace their beloved Linux with Windows, but with its scale, Azure HPC promises to be a growing and driving force on the HPC landscape in 2013.
#10 US Reawakens To The Need for HPC
I saved the most controversial topic for last. Simply measured by the Top500 list, the US retook the supercomputer crown with DOE’s Titan system at Oak Ridge National Labs. Equally as interesting was the fact that NCSA decided not to list their new Blue Waters supercomputer on the Top500 list, and this has been much discussed in the press. While Titan for now holds the Top500 crown, the Blue Waters Lustre file system is rumored to be significantly faster than Titan’s file system and in fact ORNL is in the middle of a procurement as the year ends for their own 1TB/sec Lustre file system. The growth of HPC in the US this year, however, went far beyond national research labs, from animating features at Dreamworks to looking for oil in the Gulf of Mexico to improving the quality and time to market of new cars at Ford, industry statistics show an increase in HPC spending, despite the less than certain economic times. From researching global climate change to helping produce new life-saving drugs to improving product quality and time to market, organizations are finding they simply can’t afford to not invest in HPC. That is making a difference in the world we live in.
Happy New Years!