Big Data Meets Big Compute … Again

A recent CIO poll indicated over half of CIOs where not quite sure what “big data” meant even though they considered it important to their organizations. Welcome to the club. Five years ago, “big data” at HPC centers meant breaking the 1PB boundary with a parallel file system like Lustre. These days, given the advances in Lustre solutions, especially the new Lustre sys admin tools rapidly coming to market, a 1+PB parallel file system is relatively straightforward to build and manage. Talk of “big data”, even in HPC centers, increasingly conjures up the notion of Hadoop, NoSQL databases like HP Vertica, and other technologies coming out of the web world. Parallel file systems like Lustre are still important in HPC architectures to handle the ever increasing data being generated by HPC simulations and advanced scientific instruments. The promise of Hadoop is in providing a long term place to be able to not only store but continue to manipulate and use all that data once it no longer needs to reside in your parallel file system.
Just as the number of sites that can afford a 1PB Lustre parallel file system have quickly grown, so too have the number of sites that can quickly fill a 1PB file system. A few dozen Nvidia GPUs running simulations of thin-film solar arrays can generate as much data today as 100’s or even 1000’s of servers did a few years ago and so too can today’s scientific instruments such as the latest advanced genome sequencers. The challenge today is not just generating or capturing massive amounts of data but what to do with it afterwards. Many large HPC sites no longer automatically back up their parallel file systems and rely on users to save the data they need, often in legacy tape silos with their own obfuscated indexing and management software, never to be made useful again. Herein lies the opportunity for “big data” solutions for HPC.
When it took your system a year to generate 1PB of data, you could get by with forcing users to manually backup to a 3 PB tape library that might hold 4 or 5 years data. With the same HPC centers now generating 1 PB in a month or maybe in as little as a few days, new approaches are needed. Today, HPC users aren’t content to file away their data on a tape drive where it may never be seen or used again. Users routinely want to access, search, combine, and continue processing their data. While Hadoop is relatively simple to install and manage on a small scale, running multi-PB Hadoop clusters is at least as difficult as running a PB Lustre parallel file system was a few years ago, and those with Hadoop skills and HPC skills are not often found working in the same centers.
While early Hadoop implementations where mashed up using low cost x86 servers with a dozen or so disk drives hanging off the back, that approach has started to reach its limits. On the web side, operators want to manage 1000’s of servers where they once managed 10’s, and that simply doesn’t scale by adding more racks of shared-nothing Hadoop servers. One has to start thinking about integrated approaches to the design of servers sharing not only storage but networking, power, cooling, and management. HP’s Project Moonshot is a great example of that approach.
For simple Hadoop file storage or processing, even entry level x86 CPUs today are more powerful and too power-hungry to scale with storage demands. At the same time, other classes of data will require much more powerful processing engines located close to the data than even the fastest x86 CPU can provide. I expect before too long we will see Hadoop clusters that integrate Nvidia GPUs, AMD APUs, or Intel’s MIC technology into the Hadoop node. Extreme energy efficiency isn’t just about low absolute power of the processor, it is about high performance/watt on the workloads that are important to a particular user. Thus it is likely that we will see continued evolution of processors across multiple axis, with ARM and other low power processors at one end and ever broader types of GPUs and accelerators along the other axis. Both will provide high performance/watt on their respective workloads. And when integrated into solutions like HP Project Moonshot that provide shared storage, networking, power, cooling, and management, will form the basis for the most scalable “big data” solutions to come.


About Marc Hamilton

Marc Hamilton – Vice President, Solutions Architecture and Engineering, NVIDIA. At NVIDIA, the Visual Computing Company, Marc leads the worldwide Solutions Architecture and Engineering team, responsible for working with NVIDIA’s customers and partners to deliver the world’s best end to end solutions for professional visualization and design, high performance computing, and big data analytics. Prior to NVIDIA, Marc worked in the Hyperscale Business Unit within HP’s Enterprise Group where he led the HPC team for the Americas region. Marc spent 16 years at Sun Microsystems in HPC and other sales and marketing executive management roles. Marc also worked at TRW developing HPC applications for the US aerospace and defense industry. He has published a number of technical articles and is the author of the book, “Software Development, Building Reliable Systems”. Marc holds a BS degree in Math and Computer Science from UCLA, an MS degree in Electrical Engineering from USC, and is a graduate of the UCLA Executive Management program.
This entry was posted in Uncategorized. Bookmark the permalink.