Technology Driven Business Models (aka unlimited storage, FLOPS, & bandwidth)

Recently, Amazon announced unlimited storage on their Cloud Drive service for about $5 a month. The Microsoft Cloud and Google also offer similar (although not unlimited) personal cloud storage services. Cloud Drive is great for individuals, and all the large public clouds have plenty of commercial, industrial-strength cloud storage offerings too. Some large organizations are even banding together to create their own cloud-like storage which can be more highly tailored to their performance and other specific requirements. James Cuff, Harvard University Assistant Dean for Research Computing points out on Twitter how the university, as part of the Massachusetts Green High Performance Computing Center (MGHPCC), makes it easy to use similar cloud storage.

Of course as MGHPCC executive director John Goodhue points out, making it fast and easy isn’t just about having large amounts of storage but also requires fast network connections, and the right networking protocols and software.

A similar story is taking shape in computation. Sun Microsystems launched one of the first public clouds in 2005 with a basic $1 per CPU/hour and $1 per GB/month offering, a year before Amazon officially launchd AWS in 2006. Today Amazon and Softlayer already offer GPU instances in their public clouds. While not quite free, the latest NVIDIA Titan X GPU offers a whopping 7 TFLOPS of compute capability (single precision) for $999. One might ask how much would Amazon have to charge for their Prime service to offer unlimited FLOPS along with their unlimited storage Cloud Drive service?

Of course, just like Goodhue points out that storage requires networking and software to be useful, the same is true for FLOPS. You need fast connections (between the GPU and the rest of your server and data center) and you need the right software. Since last week’s announcement of Titan X, I’ve had several customers write to me praising the performance of the card, especially when combined with NVIDIA’s cuDNN deep neural network library. And new NVIDIA interconnects like NVLink will let next generation GPUs accelerate applications even more.

Storage, FLOPS, network switches all have one thing in common, they require power to move data. Already, on a modern processor, the average power used by the actual floating point unit is actually less than the power required to move the operands from memory to the floating point unit, and then to move the result back again. If Amazon charged the right amount for moving data around on a processor, it wouldn’t be too hard to offer the FLOPS themselves for free. Don’t worry, Amazon doesn’t charge consumers extra to move data into and out of your Cloud Drive. Although on a larger scale, Amazon and all major public clouds charge for bandwidth into and out of their data centers for most commercial services.

Technology advances have allowed both the providers and the users of public clouds to develop innovative business models. But we are still only at the start of cloud adoption. New technology like NVIDIA shared virtual GPU (vGPU) allow not just storage and computation to be moved into the cloud, but a user’s entire desktop. Anyone who has used a Chromebook has had a flavor of the productivity locked up in the hundreds of millions of desktop users who have been turned into unwilling system administrators. Being able to deliver a full designer or power-user desktop, rich workstation-class 3D graphics, to any laptop, tablet, or smartphone will drive a new wave of enterprise and public cloud adoption.

Posted in Uncategorized | Leave a comment

GTC15 Keynote Live Blog

11:11 Have a great GTC everyone.

11:11 Jensen wraps up. Titan X, the world’s fastest GPU, DIGITIS DevBox GPU deep learning platform, Pascal – 10x Maxwell for deep learning, Drive PX deep learning platform for self-driving cars.

11:04 Jensen – I get excited every time I get an [Tesla] OTA

10:59 10-50 MPH in urban environments is the most challenging part of autonomous driving, but we know how to solve it and will be there in a few years

10:58 Elon – Tegra will be a real enabler for autonomous driving

10:55 2B cars on the road, about 100M manufactured each year, 20Y replacement cycle, we won’t have 100% self-driving or 100% electric cars for a long time

10:52 Self driving cars will be like elevators (elevators used to have human operators)

10:50 Applause welcomes Elon Musk on stage

10:49 Drive PX DevKit, $10,000, May 2015

10:47: AlexNet on Drive PX, 184 frames per second

10:34 Jensen shifts discussion to ADAS, I think his special desk must be coming on stage soon

10:31 Key Pascal Features

10:28: Coming next year, Pascal, mixed precision, 3D memory, NVLink, 10X faster than Maxwell for deep learning, *very rough estimates


10:19 DIGITS Deep GPU Training System for Data Scientists. Process data, configure DNN, monitor progress, visualize layers.

10:15 Walking through some automated image captioning results. Thanks to all those Amazon Mechanical Turks who helped caption the original training data and to Julie B. for countless hours of reviewing results to pick out the most interesting examples

10:12 Andrej Karpathy describes how ConvNets and Recurrent Nets can be joined together for automated image captioning.

10:07 Deep learning revolutionizing medical research, predicting toxicity of new drugs, understanding gene mutation to prevent disease, detecting mitosis in breast cancer cells, groundbreaking work

10:06 Very glad to see EyeEm listed as one of the start-ups doing GPU-accelerated deep learning.

10:04 Volumetric rendering being done by Nvidia IndeX

9:58 Mike Houston joins Jensen to help explain how deep neural networks, specifically AlexNet, are processed on a GPU. No rabbits were harmed in the processing (have to watch the video).

Titan X with cuDNN2

9:40: #gtc15 on Twitter. You don’t have to be here to know that the press tables are full

9:39: Titan X, $999. Clapping.

9:36 Titan X for Deep Learning, training AlexNet, 43 days on 16-core Xeon, 2.5 days on Titan X + cuDNN.

9:31 Epic Games trailer – kite, need to watch the video, words don’t capture the beauty, 100 square miles (about the size of Silicon Valley), running in realtime on a single Titan X

9:30 Titax X. World’s fastest GPU. 8B transistors, 3072 CUDA cores, 7 TFLOPS SP, 12GB memory, based on Maxwell

9:28 Rolling Titan X video

9:27 Part of our promise is access to the platform. By putting CUDA in every GPU we make it easy for every developer in the world to access

9:25 Our promise is to accelerate your code, striking a balance between ease of programming and speed.

9:24 54,000 GPU teraflops around the world today, 3 million CUDA downloads, 60,000 academic papers

9:15 4 things to talk about 1) A new GPU and deep learning, 2) A very fast box and deep learning 3) Roadmap reveal and deep learning 4) Self driving cars and deep learning

9:13 Jen-Hsun comes on stage. GTC is about developers, about sharing ideas, about being inspired

9:11 Opening video starts, “do you remember the future … its here”. Going to be a lot of discussion of deep learning.

9:09 Tune into the Livesteam to watch.

9:08 Packed house for the #gtc15 keynote. Announcer asking everyone to take their seats

Posted in Uncategorized | Leave a comment

Last Minute Guide to GTC15

This year’s GPU Technology Conference officially kicks off at 16:00 PT today with early badge pickup, followed by posters & welcome reception at 17:00. Of course that will just be warm up for Tuesday’s opening keynote including a special not to be missed appearance by Elon Musk joining Jen-Hsun Huang on stage for a conversation on self-driving cars and deep learning.

If you just have time to do one thing before heading off to GTC15, download the GTC Mobile app from the Google Play or iTunes store.

While the amazing keynotes are reason enough for any visual computing fan to attend, GTC is a GPU developer conference at its core. While you don’t have to be a GPU developer to attend, most of the 500+ sessions are technical in nature, covering a diverse range of topics from Astronomy to Web acceleration.

On Wednesday, the Emerging Companies Summit will award $100,000 to one lucky early stage startup. Of course, when you consider the previous companies who have participated, $100,000 seems like small change. Just ask Oculus Rift (acquired by Facebook for $2 billion), Gaikai (acquired by Sony for $380 million), Natural Motion (acquired by Zynga for $527 million), or Keyhole (the progenitor of Google Earth).

This year, the first OpenPOWER Summit is being hosted as part of GTC. This seems only fitting as OpenPOWER was the first processor architecture to adopt NVIDIA’s NVLink technology and also was at the core of the DOE’s $325 million CORAL award and related FastForward 2 projects.

The NVIDIA technology won’t be limited to inside the San Jose Convention Center. The GTC15 Ride and Drive program offers GTC attendees the chance to test-drive NVIDIA powered cars from Audi, BMW, Mini, Tesla Motors, and Volkswagen.

Finally, one of my favorite events is going to be the first ever NVIDIA Iron Chef challenge, Tuesday from 5-7 pm, where the solution architect teams from NVIDIA and VMware battle it out to see who can install a complete NVIDIA Grid vGPU with VMware Horizon and vSphere environment, starting from scratch, in under one hour. If all goes according to plan, in the second hour, the losing team will have to relinquish their servers and watch the winning team double the capacity of their GPU-accelerated desktop virtualization environment.

Posted in Uncategorized | Leave a comment

The Secret Weapon of NVIDIA’s Solution Architect Team

NVIDIA’s worldwide team of solution architects work with our largest customers around the world to solve some of the toughest high performance computing, deep learning, enterprise graphics virtualization, and advanced visualization problems. Often, as in the recent US Department of Energy CORAL award, the systems our customers purchase are many times larger than anything we have on the NVIDIA campus. While no where near the size of CORAL, one of the secret weapon’s of NVIDIA’s Solution Architect team is our benchmarking and customer test cluster located behind locked doors deep within NVIDIA’s Santa Clara campus.

For what it lacks in size, the system makes up for with the latest GPUs, servers, storage, and networking gear from NVIDIA and our partners. One of our recent additions which is receiving lots of usage is a rack of Cray CS-Storm servers, fully loaded with eight K80 GPUs each. We also have a Cray XC30 system with GPUs.

We have many different types and brands of servers, not only with the latest NVIDIA GPUs but with high-end 16-core Haswell CPUs, plenty of memory (256 GB on many servers), and the latest networking technology including Mellanox 56Gb FDR InfiniBand and Arista low latency 10/40GbE switches. The systems are supported by multiple types of storage, although one of our newest additions is a large capacity Pure Systems all-flash storage array. The Pure Systems array sees dual use supporting NVIDIA Grid vGPU instances running VMware ESX and Citrix Xen hypervisors and a separate partition allocated for HPC applications.

Doug, our superstar system manager, is almost constantly adding new platforms to the system. Walking through the lab today I spotted a pile of Dell C4130 severs waiting to be mounted in racks and outfitted with four K80 GPUs each before being put to use by our solution architects to benchmark customer applications.

Of course, we also have GPU servers with Power8 and ARM-64 CPUs, so solution architects and customers can test applications in a cross platform environment. Sometimes more important than the mix of servers, however, is the full complement of NVIDIA and partner software we have installed on the systems. This ranges from the latest CUDA 7 RC to powerful NVIDIA libraries like cuDNN integrated with Caffe and ready to go for training deep learning networks. Of course for our Grid enterprise graphics virtualization business, the system supported our recent vGPU VMware early access program. Now that VMware has officially launch support for vGPU, the system is being used for the DIRECT ACCESS TO NVIDIA GRID™ vGPU™ WITH VMware Horizon® and vSphere® program.

While HP is a bit under-represented currently on the server side, we are excited to be getting in a new HP BladeSystem shortly to work with. But on solution architects’ desks, the HP Z840 is by far the favorite. Best features: support for multiple Quadro and Tesla GPUs, super-quite, and snap-in tool-less design makes swapping in new GPUs or other components a breeze. Walking between offices and the server room however, the favorite solution architect laptop these days is the new 14″ HP Chromebook. Internally we run a technology preview of the next generation of VMware Blast protocol which delivers super-fast workstation class graphics to the Tegra TK1 powered HP Chromebook. Two monitors is pretty much the minimum on any solution architect’s desk, and some have quite a few more.

The systems all live on our cloud, and besides seeing use by NVIDIA solution architects we also provide customers Cisco VPN-secured remote access, from anywhere in the world, to test our latest offerings. These days, many of the systems are busy preparing and testing demos for next month’s GPU Technology Conference. While the exact content of the demos is a secret I can’t share, lets just say we are doing a lot of deep neural network training right now on many of those K80 GPUs.

It is a great resource, and we couldn’t do our job and serve our customers without it. And a big special thanks to all of our partners who contribute to the system’s success, including Arista, Cisco, Cray, Dell, HP, Pure Storage, and Supermicro.

Posted in Cloud Computing, HPC

New Architectures for Energy Efficiency in Deep Learning

According to Wikipedia, the history of Deep Learning, a class of machine learning, can be traced back to at least 1980. Despite its long history, until recently, deep learning remained primarily a subject of academic interest, with few widespread commercial applications. Over the last year, almost out of no where, an explosion of commercial interest in deep learning has evolved, fueled by everyone from startups to the largest Internet companies. For instance, Startup.ML’s first post to Twitter was just over 2 months ago and already it has over 1500 followers. Facebook’s guru of deep learning, Yann LeCun has over 4500 Twitter followers, not to mention many more following him on Facebook. As deep learning moves from academic research to large scale commercial big data deployments, the need for new systems-level architectures to increase energy efficiency will be required.

Sidebar, here is a little big data challenge for you. What is the average length of time since Startup.ML and Yann LeCun’s Twitter followers first post a tweet which included the words “deep learning”. If you have a Twitter analytics app for that, let me know.

Twitter analytics aside, back to deep learning and energy efficiency. Deep learning algorithms typically are trained using very large data sets, like millions or billions of images. If you have a billion images, quite likely you are using some sort of distributed file system like Hadoop to store all those images. The distributed nature of Hadoop, in itself, can contribute to energy efficiency by reducing data movement. In a traditional storage environment, be it SAN or NAS, in order to perform an operation on a stored image, it must be moved over a network from the storage to the server. This data movement often accounts for a large percentage of the total energy used in the operation. Using Hadoop, each storage node typically has local compute processing power as well. So if you want to resize an image, you can do so in place, on the Hadoop node, without moving it to a central server. While the original motivation for Hadoop might have been more driven by horizontal scalability, the energy efficiency is nice side effect.

So since deep learning has been around since 1980 and Hadoop has been around since 2005, why did it take until 2015 for deep learning to take off? Back to energy efficiency. The main purpose for the processor[s] in a Hadoop server was originally just to handle file system operations, and perhaps a little MapReduce. Since neither the Hadoop file system or MapReduce are computationally intensive, at least compared to traditional high performance computing applications, Hadoop servers typically were configured with processors from the lower end of the performance spectrum. Of course deep learning algorithms rely heavily on complex convolutions such as FFT’s. Not a good match for your average Hadoop server. So what happened next.

GPUs tend to be very good at FFTs. And with thousands of CUDA compute cores, a modern GPU can solve many FFTs in parallel. Now as luck would have it, many researchers, including no doubt some of those involved for years in deep learning, enjoyed a bit of computer gaming when they were not hard at work on their research, and discovered that the GPU was a great processor for deep learning. Of course if one GPU is good, then shouldn’t two be better? Given the size of the Chinese Internet market, it is of no great surprise that one of the first open source multi-GPU versions of the popular Caffe deep learning framework came from the Chinese computer company Inspur. Just like avoiding the data movement of a traditional central storage system helps Hadoop gain energy efficiency, running Caffe over 2 GPUs in the same server vs over 2 servers adds to the energy efficiency.

Of course, the challenge in most internet data centers today is that GPUs, if they are available, are not yet embedded in the Hadoop storage nodes. Many deep learning systems are still trained by moving petabytes of data from Hadoop storage to GPU clusters to be processed. As the use of deep learning continues to proliferate, I am sure we will see new architectures evolve focused on, among other things, minimizing data movement and maximizing energy efficiency. Putting a GPU or two into your Hadoop node is certainly one possibility.

Just this month, Baidu’s Deep Image system achieved record breaking results on the ImageNet image classification benchmark using a cluster of GPUs. While not all of the details of the Deep Image architecture are known, Baidu distinguished scientist Ren Wu describes the system as a “purpose-built supercomputer”. How many of the architectural innovations of Deep Image make it into Baidu’s production deep learning systems remains to be seen, but no doubt companies like Baidu are examining all sorts of new architectures for energy efficiency and high performance.

Within the GPU, NVIDIA continues to optimize deep learning for energy efficiency. The Maxwell GPU inside NVIDIA’s latest TX1 mobile superchip includes new FP16 instructions optimized for deep learning, allowing the superchip to process four 16-bit deep learning instructions at a time in each of it’s 256 CUDA cores, delivering over 1 teraflop of performance for deep learning using only about 10 watts of power. If a traditional PCI-card size NVIDIA GPU doesn’t fit in your Hadoop server, maybe a TX1 superchip will?

Not only Internet companies, but all sorts of commercial businesses who have collected large amounts of big data are now starting to look at deep learning. It is going to be an exciting technology to watch over the coming few years.

Posted in Uncategorized

CUDA 7 Release Candidate Now Available

Be the first to get your hands on the official CUDA 7 Release Candidate now available for download.

NVIDIA will host a CUDA 7 Overview webinar tomorrow, January 14th, at 10 am PT to help you learn about all the new CUDA 7 features and enhancements. These include:

  • C++11 support makes it easier for C++ developers to accelerate their applications
  • Write less code with ‘auto’ and ‘lambda’, especially when using the Thrust template library.
  • New cuSOLVER library of dense and sparse direct solvers delivers significant acceleration for Computer Vision, CFD, Computational Chemistry, and Linear Optimization applications.
  • Key LAPACK dense solvers 3-6x faster than MKL.
  • Dense solvers include Cholesky, LU, SVD and QR
  • Sparse direct solvers 2-14x faster than CPU-only equivalents.
  • Sparse solvers include direct solvers and eigensolvers
  • Runtime Compilation enables highly optimized kernels to be generated at runtime.
  • Improve performance by removing conditional logic and only evaluating special cases when necessary.

    For more technical details on CUDA 7 read our ParallelForAll Blog

    New to CUDA, learn CUDA programming with qwikLabs CUDA hands-on online training labs.

    After you give it a try, please post feedback to our Developer Forums

  • Posted in Uncategorized

    How Facebook Can Help Make Your Next Car Safer

    To anyone who knows a teenage driver, you can’t help but worry about the inevitable near-miss (or worse) accidents caused by a distracted driver checking their Facebook page. However, your next car just might be a whole lot safer because of Deep Neural Networks, or DNN for short, technology being developed at Facebook and scores of other companies up and down Silicon Valley. You need look no farther than Yann LeCun’s Facebook page, but not while driving please, to see what Facebook’s Director of Artificial Intelligence is up to with DNNs. Besides his job at Facebook, Yann also is a professor at NYU’s Computer Science department where he helped pioneer many current advances in the field. But there is more to the Facebook-NYU connection than your typical Silicon Valley university relationship, it stems from the core of the innovation driving DNNs as one of today’s leading Machine Learning approaches, the massive amounts of big data used to train DNNs.

    DNN algorithms are not particularly new. What is relatively new is the use of DNNs combined with the massive amounts of unstructured big data including voice, images, and video stored by today’s top social networking and search sites combined with unparalleled levels of performance provided by GPUs to crunch all of that data through DNNs in a cost and power efficient manner. One of the key mathematical algorithms used in DNNs is the Fast Fourier Transform, or FFT. GPUs are particularly well suited to processing FFTs. For DNNs, Facebook recently made this even more true when LeCun and his collaborators released the new fbFFT library.

    If the fbFFT paper is a bit too technical for you, Jeremy Howard’s recent Ted Talk on Machine Learning helps explain the technology in simpler ways through lots of examples. As the founder of Silicon Valley startup Enlitic, Jeremy knows a thing or two about machine learning.

    Now Facebook may or may not have any interest in self-driving cars, but the same DNN technology that can automatically identify your friend’s picture on a Facebook page is much the same as the technology that can already help an automobile identify pedestrians in a crosswalk or slowing traffic ahead. This week at the CES Consumer Electronics Show, NVIDIA introduced a host of new technologies including the new Tegra X1 mobile super chip, capable of processing over 1 TeraFlop a second of DNN instructions to the new NVIDIA Drive PX auto-pilot car computer which will make it easier than ever for automotive manufacturers to integrate advanced DNN technology into future vehicles.

    While you can’t yet buy a car with the Drive PX auto-pilot computer, developers today can start writing software for it on any NVIDIA GPU platform, from the $192 Jetson TK1 developer kit to the GeForce GTX 980, the world’s most advanced GPU utilizing the same Maxwell technology used in the upcoming Tegra X1.

    But for now, Facebook in cars should remain for passenger user only. For more info on the Tegra X1, Drive PX, and other new NVIDIA technologies watch our CES press conference below.

    Broadcast live streaming video on Ustream

    Posted in Uncategorized