HPC Friday

Fridays are typically quite busy, so here are a few random HPC related notes.

One of my favorite IT journalist, Ashlee Vance, has moved over to Bloomberg Businessweek and published his first BW story titled, The Cloud: Battle of the Tech Titans. Highly recommended reading. Interesting to note that Ashlee has about 4x the article comments on his Facebook page than on the BW site. Sign of the cloud.

On the surface, HPC workloads would seem to be one of the easiest to move to a cloud environment. In HPC you have a job, or many jobs, that typically get scheduled by a job scheduler to run across multiple compute nodes. Used to be called Grid. Cross out Grid and re-label it Cloud and your done. Well, almost. Lets look at some of the barriers. A great starting point is Ian Foster’s IEEE paper Cloud Computing and Grid Computing 360-Degree Compared. Despite being first published in 2008, it is still currently the #3 most popular IEEE download! Want someone else to build your HPC Cloud? Check out the company Ian Foster co-founded and remains on the advisory board of, Univa.

One of the barriers to adoption of cloud computing for large scale HPC is not necessarily technical but economic. Most cloud computing business models are based on some sort of oversubscription. If you operate a cloud with 10,000 two socket quad core servers, you are probably fairly safe in selling more than 80,000 compute core-hours of time to users with typical cloud workloads as not every user is going to use 100% of the compute cycles they purchase 100% of the time. Of course Amazon, Google, Microsoft and other large cloud providers don’t tell you how much they are oversubscribing their servers, and you really shouldn’t care, at their scale with 100,000’s of servers, the laws of statistics come into play and oversubscription is a safe bet.

Some of the big HPC sites, however, do publish their usage numbers, like these from TACC’s Ranger system published after its first full year of operation. Combine the usage numbers with readily available information on Ranger’s size, and you get some very high utilization numbers, in the 90+ percent range. If I was a cloud provider and had an 60,000 core cluster (approximate size of Ranger), and I was offering HPC services, I wouldn’t want to oversubscribe the system much if any knowing it had been 90%+ utilized over the last year. HPC jobs tend to consume all available HPC cycles nearly 100% of the time while they run, offering little opportunity for a cloud provider to make money by oversubscribing their compute resources, at least for large scale HPC jobs that use a majority of the cloud’s resources.

So why does Amazon’s HPC Offerings make business sense for them? Again, simple statistics. Amazon’s largest standard HPC instance is 64 cores, not 60,000 cores like are used by some of the largest jobs on Ranger. It is much simpler to statistically manage workloads and oversubscription at 64 cores than 60,000 cores. But as HPC use on Amazon grows, I expect Amazon will increase the size of their HPC resources and offer larger and larger HPC instances. It is simple statistics.

In the meantime, we see other special purpose HPC cloud providers filling the niche for cloud HPC cycles for jobs larger than 64 cores. One such provider is R Systems. Never heard of R Systems? I’ll give you a hint, they are located in Champaign, Illinois, just down the road from NCSA. A smart bunch of folks.

Out of time for today, but I’ll be back next week.


About Marc Hamilton

Marc Hamilton – Vice President, Solutions Architecture and Engineering, NVIDIA. At NVIDIA, the Visual Computing Company, Marc leads the worldwide Solutions Architecture and Engineering team, responsible for working with NVIDIA’s customers and partners to deliver the world’s best end to end solutions for professional visualization and design, high performance computing, and big data analytics. Prior to NVIDIA, Marc worked in the Hyperscale Business Unit within HP’s Enterprise Group where he led the HPC team for the Americas region. Marc spent 16 years at Sun Microsystems in HPC and other sales and marketing executive management roles. Marc also worked at TRW developing HPC applications for the US aerospace and defense industry. He has published a number of technical articles and is the author of the book, “Software Development, Building Reliable Systems”. Marc holds a BS degree in Math and Computer Science from UCLA, an MS degree in Electrical Engineering from USC, and is a graduate of the UCLA Executive Management program.
This entry was posted in Uncategorized. Bookmark the permalink.