Managing High Performance GPU Clusters

Seems like every vendor has multiple GPU offerings on their price list today, yet despite the rapidly growing use of GPUs in HPC clusters, knowledge of how to monitor and manage GPUs in a cluster to ensure optimal system performance is lagging.

Before I jump into management issues, its appropriate to say a few words about balanced design. Adding a GPU like Nvidia’s M2070 into a standard 2-socket x86 server is akin to trying to drop a 400 horsepower BMW M3 engine into an entry level economy car, much of the performance can be wasted. HP packs up to three Nvidia M2070’s into its 2-socket SL390 server, and then fits four of these SL390’s into a super-compact 4RU SL6500 chassis. With HP’s balanced system design, the results speak for themselves.

The key to the SL390’s balanced system design starts by paying attention to one specific line in the Nvidia M2070’s spec sheet that says “System Interface PCIe x16 Gen2”. Virtually every modern x86 based server supports PCIe, the challenge (or the problem for other vendors) is that the Intel Tylersburg IOH (IO Hub) chip used by most vendors to integrate PCIe into modern Intel based servers only supports 40 “lanes” of PCIe. Doing some quick math, if each M2070 requires 16 lanes of PCIe, and an QDR IB interface requires 8 lanes of PCIe, you very quickly run out of PCIe bandwidth.

HP engineers worked with Intel and Nvidia to come up with a simple yet elegant solution. We designed the SL390 to support up to two IOHs, as shown in the block diagrams in this HP white paper. As a result, the SL390 can provide a full 16 lanes of dedicated PCIe to up to three Nvidia M2070 GPUs, along with a full 8 lanes of dedicated PCIe to up to two QDR IB connections.

So now you’ve built your GPU cluster, how do you manage it? Traditional systems management tools have generally not caught up with the GPU trend and are rather useless in helping you manage GPU clusters. Nvidia recognizes this and thus provides a basic command line tool, Nvidia-smi to view GPU metrics such as GPU % busy, memory usage, temperature, and ECC counts. Command line tools work great if you have only 1 or 2 GPUs, but if you have 1 or 2 thousand GPUs, that is a lot of typing. Not to worry, HP has updated its Cluster Management Utility (CMU) to help you monitor GPU clusters.

Twitter follower HPC_Guru recently asked what could be monitored besides GPU temperature. Two of the more important things that CMU is configured to automatically monitor are GPU and IOH temperature. Given that the M2070 spec sheet lists a power consumption of 225 watts, it is no surprise that the GPU temperature is something you want to monitor (most x86 CPUs, by comparison, consume between 95 and 130 watts). But the IOH doesn’t stand out as big heat source. As it turns out, when you are driving two GPUs at full speed, along with a QDR IB link, the IOH curiously runs at a consistent hot temperature.

In addition to GPU temperature, CMU also is pre-configured to monitor GPU utilization and memory usage, along with ECC error counts. CMU actually utilizes the Nvidia-smi command behind the scenes to collect this information. However, its a bit of an incomplete answer to HPC_Guru’s question. Since CMU supports custom monitoring scripts, you can configure your CMU installation to monitor any information collected by Nvidia-smi. This is consistent with CMU’s overall design simplicity approach, and in fact the approach HP generally takes in designing high performance clusters. Complicated, layered, management tools have their place in enterprise environments, but most HPC customers want simple, scalable tool sets they can extend and customize as necessary for their own environments. Many tools that do an excellent job of managing 10 or 100 servers in an enterprise environment, simply don’t scale well to manage 1000’s of 10,000’s of servers in HPC and other hyperscale environments.

So there you have it. Managing high performance GPU clusters starts with a balanced system design like the SL390 that lets your cluster scale in the first place, and then adds simple, scalable management tools like HP’s CMU. CMU also extends today to manage HPC storage such as HP’s X9000 storage system, and through our IB partners we also provide complete IB fabric management tools.


About Marc Hamilton

Marc Hamilton – Vice President, Solutions Architecture and Engineering, NVIDIA. At NVIDIA, the Visual Computing Company, Marc leads the worldwide Solutions Architecture and Engineering team, responsible for working with NVIDIA’s customers and partners to deliver the world’s best end to end solutions for professional visualization and design, high performance computing, and big data analytics. Prior to NVIDIA, Marc worked in the Hyperscale Business Unit within HP’s Enterprise Group where he led the HPC team for the Americas region. Marc spent 16 years at Sun Microsystems in HPC and other sales and marketing executive management roles. Marc also worked at TRW developing HPC applications for the US aerospace and defense industry. He has published a number of technical articles and is the author of the book, “Software Development, Building Reliable Systems”. Marc holds a BS degree in Math and Computer Science from UCLA, an MS degree in Electrical Engineering from USC, and is a graduate of the UCLA Executive Management program.
This entry was posted in Uncategorized. Bookmark the permalink.

One Response to Managing High Performance GPU Clusters

  1. Pingback: Blog Post Looks at Managing High Performance GPU Clusters |

Comments are closed.