The Top500 list has for years focused on peak FLOPS (floating point operations per second), and companies like HP take great pride in pointing out our 159 systems in the Top500, so what happens when FLOPS become free? That was certainly one of many intriguing topics being discussed by speakers such as Nvidia’s Chief Scientist (and Stanford University Professor) Bill Daly at this week’s Salishan Conference on High Speed Computing.
Bill’s talk this morning focused on what he called the new challenge in HPC systems design, memory locality. Memory locality is directly related to power consumption. Accessing nearby memory on chip, say in a processor core’s L1 cache, takes an average of 2pJ (picojoules). Accessing memory on the far side of a chip, say in the L2 cache of a different core, can take up to 150 pJ, while accessing off-chip memory such as DRAM can take up to 2nJ (nanojoules) or 1000x as much energy as local on-chip memory.
Bill went on to estimate that many HPC systems today use 1-2 nJ/FLOP. To get to Exascale by the end of the decade will require improvements in system design and in software design to get to 20 pJ/FLOP. Simple Moore’s law extrapolation and anticipated integrated circuit advances to 10 nm process technology will not by themselves yield usable exascale systems in this decade. Speaker after speaker echoed the theme that software, as well as the underlying algorithm design, will need to change to focus more on work done per unit of data movement, i.e. on data locality, than on FLOPS.
So there you have it, FLOPS become free and memory locality becomes king. So while today you can continue to expect to pay more for a server with a 2.5 GHz processor than one with the same processor running at 2.0 GHz, who knows, maybe by the end of the decade we will price servers based on new memory locality metrics rather than on FLOPS. But for the software developer, the message is clear, consider FLOPS to be free and focus on optimizing for memory locality.