HPC Friday – Setting New PRs With Nvidia

This week I had the great pleasure of speaking to about 200 Nvidia sales & technical staff during a worldwide training session. I was speaking just before one of HP’s product managers presenting on the 14 HP ProLiant servers like the SL390s G7 which we qualify with Nvidia GPUs, so I decided to keep my talk decidedly non product focused, speaking instead on awareness, setting personal records (PRs), and credibility. The discussions continued well into the evening at a wonderful dinner hosted by Nvidia’s CEO, Jen-Hsun Huang who I always enjoy talking with. All of which will make wonderful material for future blogs, but for HPC Friday, I thought I would share some of the discussions I had with Nvidia technical staff over dinner on tips for setting application PRs running on the SL390s.

Many run-of-the-mill GPU enabled servers slap CPUs onto standard 2-socket x86 servers without adding I/O capacity. The single IOH used in most 2-socket x86 servers simply doesn’t provide enough PCI bandwidth (lanes) for 2-3 GPUs, 1 or more IB network connections, and other system I/O. So in our GPU-enabled SL390s configs, we support dual IOHs. That is the first step in enabling application PRs.

Now lets get specific and talk about one particular SL390s config. The measurements detailed below used an SL390s G7 2U server configured with dual IOH controllers and three Nvidia M2070 GPUs. This config was operating with a QPI bus speed of 6.4 GT/s and memory bus speed of 1066 Mhz. The CPU is mostly irrelevant to these performance tests and thus not listed. As shown below, when reading from DRAM to the local CPU, to the local IOH, to the local GPU, we achieved 5.7 GB/s reads (DRAM to GPU) and 6.3 GB/s writes (GPU to DRAM).

A logical question to ask is what penalty would you pay if the transfer went to one of the remote GPUs. In this test we measured 4.9 GB/s reads and 4.34 GB/s writes. This was a simple test measuring data transfer only, and not necessarily representative of what every application will achieve. Nevertheless, the 14 percent write penalty and 31% read penalty is significant enough to impact the PR of your application if not managed.

So, the question is, how do you help ensure GPU read/write accesses, as much as possible, go to the local CPU’s DRAM.

The near CPUs are:

GPU0 -> CPU0 – even number cores
GPU1 and GPU2 -> CPU1 – odd number cores

For the application yourapp, you can specify which GPU to use via the Linux taskset command option –device, for example:

taskset –c 2 yourapp –device=0

taskset –c 3 yourapp –device=2

This will run yourapp on the specified processor, which will use memory that is local to the processor, until it runs out and then it will use memory on the other CPU.

The Linux numactl command also allows for specifying the processor. It has additional options related to specifying processors and memory to use. For example:

numactl –membind=0 yourapp –device=0

will force your app to allocate memory on only CPU0 and fail is there is insufficient memory on CPU0. Like taskset, numactl can also specify the processor to use:

numactl –physcpubind=2 yourapp –device=0

You can also set the affinity from your code using the system function sched_setaffinity. The following is a simple C function that sets the affinity based on the GPU device id:

#include
void set_cpu_affinity(int id)
{
cpu_set_t mask;
int gpu_table[3] = {2, 3, 5};

/* Set the affinity for the GPU specified by id. */
CPU_ZERO(&mask);
CPU_SET(gpu_table[id], &mask);
sched_setaffinity(0, sizeof(mask), &mask);
}

A sample usage of this is:
/* Select the CUDA GPU and set the CPU processor affinity. */
cudaSetDevice(dev);
set_cpu_affinity(dev);

For some applications, it is not practical to restrict GPU to memory transfers to the memory of a single CPU socket. For example, suppose yourapp loads a large data set that spans the memory of both CPU sockets and operates on data that spans the sockets using all GPUs. In that case, it may be best to minimize the difference between using near memory and far memory. This can be done two ways.

One is to use the BIOS setting “Node interleaving” and set this to enabled. The other is to use the numactl command:
numactl –interleaving=all yourapp

The results of these two alternatives are different, so if you think that this approach is best for yourapp, you should investigate both. If you are a new GPU user, all this may sound complicated, but don’t worry, even without any of these tips and tricks, HP is confident the SL390s G7 2U, with its dual IOHs, built-in 10G/IB networking, and up to 3 Nvidia GPUs will out-perform any other GPU server in its class, all with the power-saving PR setting efficiency of the Greenest production supercomputer in the world!

Refer to the Linux man pages for taskset, numactl, and sched_setaffinity for more details.

I hope this helps you set some new PRs with your application running on the SL390s G7 2U. While you thinking about maximizing GPU performance, you should also consider using HP’s Cluster Management Utility (CMU). CMU includes a number of GPU monitoring functions to help you understand how your GPU is operating.

Many thanks to Axel from Nvidia for performing the initial benchmark and for Glenn from HP for providing the PR hints. Just one great example of the credibility you get when working with HP and Nvidia!

Advertisements

About Marc Hamilton

Marc Hamilton – Vice President, Solutions Architecture and Engineering, NVIDIA. At NVIDIA, the Visual Computing Company, Marc leads the worldwide Solutions Architecture and Engineering team, responsible for working with NVIDIA’s customers and partners to deliver the world’s best end to end solutions for professional visualization and design, high performance computing, and big data analytics. Prior to NVIDIA, Marc worked in the Hyperscale Business Unit within HP’s Enterprise Group where he led the HPC team for the Americas region. Marc spent 16 years at Sun Microsystems in HPC and other sales and marketing executive management roles. Marc also worked at TRW developing HPC applications for the US aerospace and defense industry. He has published a number of technical articles and is the author of the book, “Software Development, Building Reliable Systems”. Marc holds a BS degree in Math and Computer Science from UCLA, an MS degree in Electrical Engineering from USC, and is a graduate of the UCLA Executive Management program.
This entry was posted in HPC. Bookmark the permalink.

3 Responses to HPC Friday – Setting New PRs With Nvidia

  1. Christopher D Maestas says:

    It would be interesting how hwloc will help play in this space in the future as well. The deprecated PLPA project seems to have been replace with this.

    http://www.open-mpi.org/projects/hwloc

  2. samuel says:

    Hi Marc,
    can you clear my doubt: Do gpu can utilize complete bandwidth of PCIe 2.0 x16 ?

    Thanks…..

  3. Hi Samuel,
    Yes, in the SL390s 2U server, all 3 GPUs have a dedicated PCIe 2.0 x16 connection back to the CPU.

Comments are closed.