ARM Processors for Scientific Computing

Two years ago the European HPC community surprised many in the HPC world by announcing the Mont-Blanc Project, an approach towards energy efficient high performance using ARM processors. Since then, many major server vendors have announced plans for ARM based servers, including HP’s Moonshot System, with energy efficiency typically as one of the major selling points. However, many in the HPC world have been waiting for the upcoming availability of 64-bit ARM processors before starting to experiment with ARM for HPC applications. That makes the results reported earlier this month by a team at the Large Hadron Collider (LHC) at CERN in Geneva all the more exciting.

According to this article, a team of LHC researchers ported the entire CMS software stack, including 125 external support packages, to an ARMv7 (32 bit) based system. The results: an amazing 4x the events/minute/watt compared to two reference Xeon x86 systems. While many smaller experiments and test results on ARM have been completed, few if any experiments on software systems as large as CMS have previously been reported. In fact the only software the CERN team reported not being able to run on ARM was some Oracle libraries, although they noted, “no standard Grid-capable CMS applications depend on Oracle”.

A number of vendors have announced plans to ship 64-bit ARM processors over the next 12 months and the availability of those processors should spur ever more HPC work on ARM. At the same time, Intel is not standing still and continues to improve the energy efficiency of the Xeon processor. But ultimately, due the the very laws of physics that engineers at CERN study, the two-socket processor is headed to the Computer History Museum.. Today’s modern processors using on the order of 20 pico joules (pJ) of energy for a 64-bit floating-point operation. A 256 big on-die SRAM access uses about 50 pJ. But an off-die link, even an efficient one like you might use to connect the processors in a two-socket server, consumes on the order of 500 pJ. Increasingly, HPC architectures, whose design was for decades dominated by optimizing floating point performance, will need to focus on minimizing data movement. Future HPC systems are likely to be at the forefront of single socket server adoption, be they ARM or x86 based, in the years ahead.

About these ads

About Marc Hamilton

Marc Hamilton – Vice President, Solutions Architecture and Engineering, NVIDIA. At NVIDIA, the Visual Computing Company, Marc leads the worldwide Solutions Architecture and Engineering team, responsible for working with NVIDIA’s customers and partners to deliver the world’s best end to end solutions for professional visualization and design, high performance computing, and big data analytics. Prior to NVIDIA, Marc worked in the Hyperscale Business Unit within HP’s Enterprise Group where he led the HPC team for the Americas region. Marc spent 16 years at Sun Microsystems in HPC and other sales and marketing executive management roles. Marc also worked at TRW developing HPC applications for the US aerospace and defense industry. He has published a number of technical articles and is the author of the book, “Software Development, Building Reliable Systems”. Marc holds a BS degree in Math and Computer Science from UCLA, an MS degree in Electrical Engineering from USC, and is a graduate of the UCLA Executive Management program.
This entry was posted in Uncategorized. Bookmark the permalink.

2 Responses to ARM Processors for Scientific Computing

  1. Glenn K. says:

    Data locality has been a core issue in HPC for a long time now, and I don’t think that it will ultimately drive everything to a single-socket solution. Following that line of logic suggests that interconnects and MPI will go away because moving data over a fabric is also too energy expensive, but I don’t think anyone actually believes that will help. Rather, the nature of high-performing software is evolving to match the architecture, and awareness of data locality is being built into algorithms. Data locality is at the heart of all scalable software and is critical for massively parallel architectures like GPGPU and MICs.

    I don’t think 64-bit ARM is the only thing holding back ARM for HPC, either. If you look at all high-performance processors, the commonality is the presence of a wide vector unit. AVX2 has 256-bit wide units (with FMA), MICs are 512-bit wide (+FMA), and even CUDA schedules in 32-thread warps that behave like a vector unit. All someone has to do is strap a similarly large vector unit onto an ARM core and it’ll be (arguably) suitable for HPC. Anything less than that, though, and I don’t think it will be able to post the numbers required to be taken seriously.

  2. Glenn, I agree with you 100% that serious HPC work will probably not get done by ARM instruction set processors alone. Your idea of strapping a large vector unit like a GPU onto an ARM chip is exactly what many vendors are pursuing. The ARM chip used by CERN actually contained an ARM Mali-400 quad-core GPU accelerator although the article states it wasn’t used in the work described. Nvidia has discussed Project Denver that will bring together ARM + GPU cores and in fact Nvidia’s Tegra-4 mobile processor already contains ARM + GPU cores on the same die. But ultimately since many HPC apps are 64-bit, and 64-bit ARM is on the horizon, most commercial software vendors are simply not willing to port to 32-bit ARM today and then have to port again to 64-bit ARM in a few months.
    As far as interconnects and MPI, no I don’t think they will go away and single-socket doesn’t imply they will. What will happen is that by the end of the decade interconnects will transition from electrical to optical precisely because of the need for energy savings. This will require future advances in silicon photonics that a host of companies, from Intel to HP to many startups are working on. One can argue about when silicon photonics will become common in HPC, but few argue that it won’t happen.

Comments are closed.