Today marks the long anticipated deadline to submit proposals for the DOE CORAL procurement. The CORAL systems, DOE plans to purchase three systems from two vendors, mark a major milestone on the path to Exascale, both on a technology front and from a business model perspective. Planned for deployment in 2017 and likely achieving over 100 petaFLOPs/s (100 x 1015 double-precision floating point operations per second) of peak performance at a peak power of 20 Megawatts, the CORAL systems are expected to be precursors of future DOE Exascale systems.
The CORAL Statement of Work (SOW) gives some clues as to what to expect from these systems. First off, the performance requirements for the systems are stated in terms of a series of scalable science benchmarks and another set of throughput benchmarks. Scalable science benchmarks are those full applications expected to scale to a large fraction of the CORAL system. Throughput benchmarks represent particular subsets of applications that are expected to be used as part of the everyday workload of science applications. The minimum requirements of some of these codes are such that it would be nearly impossible to deliver the required performance within the 20 MW power limit with expected “Moore’s Law” scaling of any existing multi-core processor (i.e, Power, x86, ARM). Some sort of accelerated computing component, such as GPUs, is likely to be part of every vendor’s submission. Using the 100 PF peak performance number 20 MW translates into 200 picojoules per floating point operation. While NVIDIA’s current top-of-the-line K40 GPU uses slightly less than 200 picojoules per peak FLOP today, that doesn’t take into account other system components or application overhead that must be factored into the CORAL 20MW requirement. But rest assured, we do have performance/watt improvements in our roadmap!
Another key requirement of CORAL is the 4PB total memory requirement, along with additional requirements for a minimum of 1GB per MPI task. In fact, optimizing the CORAL system architectures around memory and memory movement are likely to be even more critical challenges than the peak FLOP power requirements. Dynamic Parallelism is an example of how NVIDIA GPUs today help minimize data movement within memory. Dynamic Parallelism enables GPU threads to automatically spawn new threads. By adapting to the data without going back to the CPU, this greatly simplifies parallel programming and can help minimize power-intensive and unnecessary data movement. At the software level, Unified Memory, which I discussed in my last post, opens up the path for future hardware features, some of which will no doubt be implemented in time for CORAL deployments, which conserve power by minimizing data movement.
As much as the CORAL proposals are likely to represent vendor’s best future thinking on the technology front, they are likely to have a few surprises as well on the business model front. In addition to technologies not visible on the Top500 list today, one is likely to see new HPC players included in the CORAL proposals as well as existing players taking new roles. Evaluating the CORAL proposals promises to be no easy task for the DOE. But I couldn’t be prouder of NVIDIA’s participation in CORAL. Just like we harnessed the power of visual computing to help TITAN usher in a new era of accelerated computing, the technologies we have put forth for CORAL, will continue to do same.