According to Wikipedia, the history of Deep Learning, a class of machine learning, can be traced back to at least 1980. Despite its long history, until recently, deep learning remained primarily a subject of academic interest, with few widespread commercial applications. Over the last year, almost out of no where, an explosion of commercial interest in deep learning has evolved, fueled by everyone from startups to the largest Internet companies. For instance, Startup.ML’s first post to Twitter was just over 2 months ago and already it has over 1500 followers. Facebook’s guru of deep learning, Yann LeCun has over 4500 Twitter followers, not to mention many more following him on Facebook. As deep learning moves from academic research to large scale commercial big data deployments, the need for new systems-level architectures to increase energy efficiency will be required.
Sidebar, here is a little big data challenge for you. What is the average length of time since Startup.ML and Yann LeCun’s Twitter followers first post a tweet which included the words “deep learning”. If you have a Twitter analytics app for that, let me know.
Twitter analytics aside, back to deep learning and energy efficiency. Deep learning algorithms typically are trained using very large data sets, like millions or billions of images. If you have a billion images, quite likely you are using some sort of distributed file system like Hadoop to store all those images. The distributed nature of Hadoop, in itself, can contribute to energy efficiency by reducing data movement. In a traditional storage environment, be it SAN or NAS, in order to perform an operation on a stored image, it must be moved over a network from the storage to the server. This data movement often accounts for a large percentage of the total energy used in the operation. Using Hadoop, each storage node typically has local compute processing power as well. So if you want to resize an image, you can do so in place, on the Hadoop node, without moving it to a central server. While the original motivation for Hadoop might have been more driven by horizontal scalability, the energy efficiency is nice side effect.
So since deep learning has been around since 1980 and Hadoop has been around since 2005, why did it take until 2015 for deep learning to take off? Back to energy efficiency. The main purpose for the processor[s] in a Hadoop server was originally just to handle file system operations, and perhaps a little MapReduce. Since neither the Hadoop file system or MapReduce are computationally intensive, at least compared to traditional high performance computing applications, Hadoop servers typically were configured with processors from the lower end of the performance spectrum. Of course deep learning algorithms rely heavily on complex convolutions such as FFT’s. Not a good match for your average Hadoop server. So what happened next.
GPUs tend to be very good at FFTs. And with thousands of CUDA compute cores, a modern GPU can solve many FFTs in parallel. Now as luck would have it, many researchers, including no doubt some of those involved for years in deep learning, enjoyed a bit of computer gaming when they were not hard at work on their research, and discovered that the GPU was a great processor for deep learning. Of course if one GPU is good, then shouldn’t two be better? Given the size of the Chinese Internet market, it is of no great surprise that one of the first open source multi-GPU versions of the popular Caffe deep learning framework came from the Chinese computer company Inspur. Just like avoiding the data movement of a traditional central storage system helps Hadoop gain energy efficiency, running Caffe over 2 GPUs in the same server vs over 2 servers adds to the energy efficiency.
Of course, the challenge in most internet data centers today is that GPUs, if they are available, are not yet embedded in the Hadoop storage nodes. Many deep learning systems are still trained by moving petabytes of data from Hadoop storage to GPU clusters to be processed. As the use of deep learning continues to proliferate, I am sure we will see new architectures evolve focused on, among other things, minimizing data movement and maximizing energy efficiency. Putting a GPU or two into your Hadoop node is certainly one possibility.
Just this month, Baidu’s Deep Image system achieved record breaking results on the ImageNet image classification benchmark using a cluster of GPUs. While not all of the details of the Deep Image architecture are known, Baidu distinguished scientist Ren Wu describes the system as a “purpose-built supercomputer”. How many of the architectural innovations of Deep Image make it into Baidu’s production deep learning systems remains to be seen, but no doubt companies like Baidu are examining all sorts of new architectures for energy efficiency and high performance.
Within the GPU, NVIDIA continues to optimize deep learning for energy efficiency. The Maxwell GPU inside NVIDIA’s latest TX1 mobile superchip includes new FP16 instructions optimized for deep learning, allowing the superchip to process four 16-bit deep learning instructions at a time in each of it’s 256 CUDA cores, delivering over 1 teraflop of performance for deep learning using only about 10 watts of power. If a traditional PCI-card size NVIDIA GPU doesn’t fit in your Hadoop server, maybe a TX1 superchip will?
Not only Internet companies, but all sorts of commercial businesses who have collected large amounts of big data are now starting to look at deep learning. It is going to be an exciting technology to watch over the coming few years.