The Secret Weapon of NVIDIA’s Solution Architect Team

NVIDIA’s worldwide team of solution architects work with our largest customers around the world to solve some of the toughest high performance computing, deep learning, enterprise graphics virtualization, and advanced visualization problems. Often, as in the recent US Department of Energy CORAL award, the systems our customers purchase are many times larger than anything we have on the NVIDIA campus. While no where near the size of CORAL, one of the secret weapon’s of NVIDIA’s Solution Architect team is our benchmarking and customer test cluster located behind locked doors deep within NVIDIA’s Santa Clara campus.

For what it lacks in size, the system makes up for with the latest GPUs, servers, storage, and networking gear from NVIDIA and our partners. One of our recent additions which is receiving lots of usage is a rack of Cray CS-Storm servers, fully loaded with eight K80 GPUs each. We also have a Cray XC30 system with GPUs.

We have many different types and brands of servers, not only with the latest NVIDIA GPUs but with high-end 16-core Haswell CPUs, plenty of memory (256 GB on many servers), and the latest networking technology including Mellanox 56Gb FDR InfiniBand and Arista low latency 10/40GbE switches. The systems are supported by multiple types of storage, although one of our newest additions is a large capacity Pure Systems all-flash storage array. The Pure Systems array sees dual use supporting NVIDIA Grid vGPU instances running VMware ESX and Citrix Xen hypervisors and a separate partition allocated for HPC applications.

Doug, our superstar system manager, is almost constantly adding new platforms to the system. Walking through the lab today I spotted a pile of Dell C4130 severs waiting to be mounted in racks and outfitted with four K80 GPUs each before being put to use by our solution architects to benchmark customer applications.

Of course, we also have GPU servers with Power8 and ARM-64 CPUs, so solution architects and customers can test applications in a cross platform environment. Sometimes more important than the mix of servers, however, is the full complement of NVIDIA and partner software we have installed on the systems. This ranges from the latest CUDA 7 RC to powerful NVIDIA libraries like cuDNN integrated with Caffe and ready to go for training deep learning networks. Of course for our Grid enterprise graphics virtualization business, the system supported our recent vGPU VMware early access program. Now that VMware has officially launch support for vGPU, the system is being used for the DIRECT ACCESS TO NVIDIA GRID™ vGPU™ WITH VMware Horizon® and vSphere® program.

While HP is a bit under-represented currently on the server side, we are excited to be getting in a new HP BladeSystem shortly to work with. But on solution architects’ desks, the HP Z840 is by far the favorite. Best features: support for multiple Quadro and Tesla GPUs, super-quite, and snap-in tool-less design makes swapping in new GPUs or other components a breeze. Walking between offices and the server room however, the favorite solution architect laptop these days is the new 14″ HP Chromebook. Internally we run a technology preview of the next generation of VMware Blast protocol which delivers super-fast workstation class graphics to the Tegra TK1 powered HP Chromebook. Two monitors is pretty much the minimum on any solution architect’s desk, and some have quite a few more.

The systems all live on our cloud, and besides seeing use by NVIDIA solution architects we also provide customers Cisco VPN-secured remote access, from anywhere in the world, to test our latest offerings. These days, many of the systems are busy preparing and testing demos for next month’s GPU Technology Conference. While the exact content of the demos is a secret I can’t share, lets just say we are doing a lot of deep neural network training right now on many of those K80 GPUs.

It is a great resource, and we couldn’t do our job and serve our customers without it. And a big special thanks to all of our partners who contribute to the system’s success, including Arista, Cisco, Cray, Dell, HP, Pure Storage, and Supermicro.

Posted in Cloud Computing, HPC | Leave a comment

New Architectures for Energy Efficiency in Deep Learning

According to Wikipedia, the history of Deep Learning, a class of machine learning, can be traced back to at least 1980. Despite its long history, until recently, deep learning remained primarily a subject of academic interest, with few widespread commercial applications. Over the last year, almost out of no where, an explosion of commercial interest in deep learning has evolved, fueled by everyone from startups to the largest Internet companies. For instance, Startup.ML’s first post to Twitter was just over 2 months ago and already it has over 1500 followers. Facebook’s guru of deep learning, Yann LeCun has over 4500 Twitter followers, not to mention many more following him on Facebook. As deep learning moves from academic research to large scale commercial big data deployments, the need for new systems-level architectures to increase energy efficiency will be required.

Sidebar, here is a little big data challenge for you. What is the average length of time since Startup.ML and Yann LeCun’s Twitter followers first post a tweet which included the words “deep learning”. If you have a Twitter analytics app for that, let me know.

Twitter analytics aside, back to deep learning and energy efficiency. Deep learning algorithms typically are trained using very large data sets, like millions or billions of images. If you have a billion images, quite likely you are using some sort of distributed file system like Hadoop to store all those images. The distributed nature of Hadoop, in itself, can contribute to energy efficiency by reducing data movement. In a traditional storage environment, be it SAN or NAS, in order to perform an operation on a stored image, it must be moved over a network from the storage to the server. This data movement often accounts for a large percentage of the total energy used in the operation. Using Hadoop, each storage node typically has local compute processing power as well. So if you want to resize an image, you can do so in place, on the Hadoop node, without moving it to a central server. While the original motivation for Hadoop might have been more driven by horizontal scalability, the energy efficiency is nice side effect.

So since deep learning has been around since 1980 and Hadoop has been around since 2005, why did it take until 2015 for deep learning to take off? Back to energy efficiency. The main purpose for the processor[s] in a Hadoop server was originally just to handle file system operations, and perhaps a little MapReduce. Since neither the Hadoop file system or MapReduce are computationally intensive, at least compared to traditional high performance computing applications, Hadoop servers typically were configured with processors from the lower end of the performance spectrum. Of course deep learning algorithms rely heavily on complex convolutions such as FFT’s. Not a good match for your average Hadoop server. So what happened next.

GPUs tend to be very good at FFTs. And with thousands of CUDA compute cores, a modern GPU can solve many FFTs in parallel. Now as luck would have it, many researchers, including no doubt some of those involved for years in deep learning, enjoyed a bit of computer gaming when they were not hard at work on their research, and discovered that the GPU was a great processor for deep learning. Of course if one GPU is good, then shouldn’t two be better? Given the size of the Chinese Internet market, it is of no great surprise that one of the first open source multi-GPU versions of the popular Caffe deep learning framework came from the Chinese computer company Inspur. Just like avoiding the data movement of a traditional central storage system helps Hadoop gain energy efficiency, running Caffe over 2 GPUs in the same server vs over 2 servers adds to the energy efficiency.

Of course, the challenge in most internet data centers today is that GPUs, if they are available, are not yet embedded in the Hadoop storage nodes. Many deep learning systems are still trained by moving petabytes of data from Hadoop storage to GPU clusters to be processed. As the use of deep learning continues to proliferate, I am sure we will see new architectures evolve focused on, among other things, minimizing data movement and maximizing energy efficiency. Putting a GPU or two into your Hadoop node is certainly one possibility.

Just this month, Baidu’s Deep Image system achieved record breaking results on the ImageNet image classification benchmark using a cluster of GPUs. While not all of the details of the Deep Image architecture are known, Baidu distinguished scientist Ren Wu describes the system as a “purpose-built supercomputer”. How many of the architectural innovations of Deep Image make it into Baidu’s production deep learning systems remains to be seen, but no doubt companies like Baidu are examining all sorts of new architectures for energy efficiency and high performance.

Within the GPU, NVIDIA continues to optimize deep learning for energy efficiency. The Maxwell GPU inside NVIDIA’s latest TX1 mobile superchip includes new FP16 instructions optimized for deep learning, allowing the superchip to process four 16-bit deep learning instructions at a time in each of it’s 256 CUDA cores, delivering over 1 teraflop of performance for deep learning using only about 10 watts of power. If a traditional PCI-card size NVIDIA GPU doesn’t fit in your Hadoop server, maybe a TX1 superchip will?

Not only Internet companies, but all sorts of commercial businesses who have collected large amounts of big data are now starting to look at deep learning. It is going to be an exciting technology to watch over the coming few years.

Posted in Uncategorized

CUDA 7 Release Candidate Now Available

Be the first to get your hands on the official CUDA 7 Release Candidate now available for download.

NVIDIA will host a CUDA 7 Overview webinar tomorrow, January 14th, at 10 am PT to help you learn about all the new CUDA 7 features and enhancements. These include:

  • C++11 support makes it easier for C++ developers to accelerate their applications
  • Write less code with ‘auto’ and ‘lambda’, especially when using the Thrust template library.
  • New cuSOLVER library of dense and sparse direct solvers delivers significant acceleration for Computer Vision, CFD, Computational Chemistry, and Linear Optimization applications.
  • Key LAPACK dense solvers 3-6x faster than MKL.
  • Dense solvers include Cholesky, LU, SVD and QR
  • Sparse direct solvers 2-14x faster than CPU-only equivalents.
  • Sparse solvers include direct solvers and eigensolvers
  • Runtime Compilation enables highly optimized kernels to be generated at runtime.
  • Improve performance by removing conditional logic and only evaluating special cases when necessary.

    For more technical details on CUDA 7 read our ParallelForAll Blog

    New to CUDA, learn CUDA programming with qwikLabs CUDA hands-on online training labs.

    After you give it a try, please post feedback to our Developer Forums

  • Posted in Uncategorized

    How Facebook Can Help Make Your Next Car Safer

    To anyone who knows a teenage driver, you can’t help but worry about the inevitable near-miss (or worse) accidents caused by a distracted driver checking their Facebook page. However, your next car just might be a whole lot safer because of Deep Neural Networks, or DNN for short, technology being developed at Facebook and scores of other companies up and down Silicon Valley. You need look no farther than Yann LeCun’s Facebook page, but not while driving please, to see what Facebook’s Director of Artificial Intelligence is up to with DNNs. Besides his job at Facebook, Yann also is a professor at NYU’s Computer Science department where he helped pioneer many current advances in the field. But there is more to the Facebook-NYU connection than your typical Silicon Valley university relationship, it stems from the core of the innovation driving DNNs as one of today’s leading Machine Learning approaches, the massive amounts of big data used to train DNNs.

    DNN algorithms are not particularly new. What is relatively new is the use of DNNs combined with the massive amounts of unstructured big data including voice, images, and video stored by today’s top social networking and search sites combined with unparalleled levels of performance provided by GPUs to crunch all of that data through DNNs in a cost and power efficient manner. One of the key mathematical algorithms used in DNNs is the Fast Fourier Transform, or FFT. GPUs are particularly well suited to processing FFTs. For DNNs, Facebook recently made this even more true when LeCun and his collaborators released the new fbFFT library.

    If the fbFFT paper is a bit too technical for you, Jeremy Howard’s recent Ted Talk on Machine Learning helps explain the technology in simpler ways through lots of examples. As the founder of Silicon Valley startup Enlitic, Jeremy knows a thing or two about machine learning.

    Now Facebook may or may not have any interest in self-driving cars, but the same DNN technology that can automatically identify your friend’s picture on a Facebook page is much the same as the technology that can already help an automobile identify pedestrians in a crosswalk or slowing traffic ahead. This week at the CES Consumer Electronics Show, NVIDIA introduced a host of new technologies including the new Tegra X1 mobile super chip, capable of processing over 1 TeraFlop a second of DNN instructions to the new NVIDIA Drive PX auto-pilot car computer which will make it easier than ever for automotive manufacturers to integrate advanced DNN technology into future vehicles.

    While you can’t yet buy a car with the Drive PX auto-pilot computer, developers today can start writing software for it on any NVIDIA GPU platform, from the $192 Jetson TK1 developer kit to the GeForce GTX 980, the world’s most advanced GPU utilizing the same Maxwell technology used in the upcoming Tegra X1.

    But for now, Facebook in cars should remain for passenger user only. For more info on the Tegra X1, Drive PX, and other new NVIDIA technologies watch our CES press conference below.

    Broadcast live streaming video on Ustream

    Posted in Uncategorized

    WordPress 2014 in review

    The stats helper monkeys prepared a 2014 annual report for this blog.

    Here’s an excerpt:

    The concert hall at the Sydney Opera House holds 2,700 people. This blog was viewed about 12,000 times in 2014. If it were a concert at Sydney Opera House, it would take about 4 sold-out performances for that many people to see it.

    Click here to see the complete report.

    Posted in Uncategorized

    Last Day Highlights from SC14

    While SC14 is perhaps one of the best computer hardware shows on the planet, it isn’t just all about hardware. Software, especially GPU accelerated software, increasingly plays an important role in scientific discovery. One of the most broadly used HPC software packages has long been MathLab, and it is great to see them highlighting GPU acceleration on at least two sides of their booth.

    GPU powered systems large and small were highlighted throughout the show floor, and with Google’s Nexus 9 tablet now shipping with a full 192 Kepler core powered Tegra TK1 SOC, it was only a matter of time before folks started building mini-clusters out of NVIDIA’s Jetson TK1 developer kit. French phone company Orange commissioned this system, built out of several dozen Jetson TK1 dev kits, for a data analytics project. If you wonder what all the excitement is about, you can order your own 192 core Jetson TK1 dev kit for only $199

    One booth I missed earlier was SGI. Keeping the legendary SGI name and HPC legacy alive when cloud hardware vendor Rackable acquired the name and various assets in 2009, it is no surprise that SGI prominently displayed their 8-way high density (2RU) NVIDIA Tesla GPU platform in their booth. As GPUs become increasingly popular for machine learning and other big data analytics, SGI has a great opportunity to sell their 8-way box to both HPC and big data customers. And they are giving away a great looking tee-shirt too!

    If 8 GPUs in 2RU isn’t enough for you, OneStop systems sells a chassis with 16 NVIDIA GPUs in 3RU. The OneStop chassis doesn’t include the server, and is intended for customers who want to cable up more GPUs to a server than it can physically hold. Several OEMs on the show floor also showed off OneStop chassis connected to their servers in other booths on the show floor.

    As customers increasingly look to optimize the energy efficiency of their computing solution, various liquid cooling solutions continue to proliferate at SC14. And while they don’t build servers, the 3M booth had an interesting display of their fluid immersion technology, just one of the many liquid cooling solutions displayed out on the show floor.

    Posted in Uncategorized

    Marc’s Best GPU Servers of SC14

    This afternoon I spent a bit of time walking around the SC14 show floor and here is a list of my favorite NVIDIA GPU-powered servers. While very unofficial, I did follow a few guidelines. First, the NVIDIA partner had to have the server with NVIDIA GPUs displayed on the show floor. Second, there had to be someone in the booth who offered to talk to me intelligently about their GPU powered solutions. Based on those simple guidelines, here are my favorites. Its great to see so many new GPU powered solutions out on the show floor. If I missed one of your favorites, let me know, I’ll be out on the show floor again tomorrow and happy to take a look and listen.

    Best Water-Cooled GPU Solution

    HP’s Apollo 8000 also wins extra bonus points as the tallest GPU server. This beast is not for the casual user. Not only does this rack pack 72 server nodes with 144 GPUs into a single rack, it also manages to include all the Mellanox InfiniBand leaf switches you need for a full fat-tree topology. Besides the efficiency of HP’s unique liquid cooling solution, the Apollo 8000 also saves power with its 480V power supply and HVDC internal power distribution. While some of the other solutions may physically fit more than 144 GPUs in a rack, this is likely the densest GPU solution you can actually operate, especially when you consider it integrates in all the InfiniBand leaf switches. Downside? Only two GPUs per node are offered.

    Best 8-GPU Solution Proven in Top500

    Cray’s CS-Storm hits the other end of the GPUs per node range, supporting 8 GPUs in a compact 2RU form factor. As so many new GPU powered servers are now available, many of the systems out on the show floor have yet to be proven out in large Top500 configurations. Not the CS-Storm, that managed to be the only new server to break into the Top 10 of the Top500. While the CS-Storm is a standalone rack-mount server, it really is intended to be sold in complete rack configurations, with Cray integrating not only the power and optional rear door water cooling likely to be required by most full-rack configurations, Cray also does one of the best jobs of integrating an entire software stack including OS and management tools. Downside? The CS-Storm requires a non-standard width rack. Penalty points for only being displayed behind a plastic cover. SC14 is the last great hardware show on the planet, we want to leave fingerprints on your servers.

    Most Improved 4-GPU Solution

    While Dell ships a lot of NVIDIA GPUs, they haven’t historically had category leading products. Well that all changed on Monday with the new C4130. Moving away from earlier multi-node GPU designs and their complications, the C4130 is a new single-node, 1RU, 4GPU server which is even “EDR-ready” for the new Mellanox 100G InfiniBand, thanks to careful PCI slot layout. Dell also figured out how to support all 4 GPUs with a single x86 CPU, so customers who’s applications don’t need the extra serial performance can skip paying for the extra CPU. Especially with the new NVIDIA K80 GPU module sporting 2 Kepler GK210 GPU chips in each module (8 GPUs total), the C4130 promises to quickly become a workhorse GPU solution.

    Best Non-x86 GPU Server

    The IBM booth was happily displaying this unnamed future OpenPower based server. Supporting two NVIDIA K80 GPUs in 2RU, with up to 1TB of RAM, this promises to be an interesting server for customers wanting to get started with with Power + GPUs today before the next-generation NVLink connected Pascal + Power8+ systems start shipping.

    Best 8-way PCI Design

    Cirrascale has an interesting 8-way design that allows up to 8 NVIDIA GPUs to configured on a single PCIe root complex which is optimal for some applications with heavy GPU peer to peer communications. Most other 8-way designs split the GPUs between the separate PCIe root complex’s of the two host CPUs. The same server also supports a more traditional split PCI design. This isn’t the densest solution at 5RU, but as denser solutions typically require water cooling, the 5RU design isn’t likely to be an issue and in fact makes air cooling a lot easier than some of the denser designs.

    Best Dense 8-way Standard Rack Mount Server

    Penguin wins this one by managing to fit 8 K80’s into a standard width 2RU rack mount server. More than a quarter rack of these and you had better start shopping for water cooled rear doors. But if you are looking for a super-dense 8-way server, you should take a look at Penguin.

    Posted in Uncategorized