Being on the steering committee for the ISC Big Data’13 conference coming up in September, I’m spending more than my typical amount of time thinking about “big data” these days. I’ve even added InsideBigData.com to my daily reading. Coming at the problem from the HPC side of the house, however, I’m not too worried about your run of the mill Hadoop cluster running batch clickstream analysis. With giant companies like Intel doing their own Hadoop distribution as well as the smaller Hadoop speciality players like Cloudera, MapR, and HortonWorks there are plenty of choices out there for Hadoop software. Companies like HP will even sell you turnkey Hadoop appliances like the HP AppSystem for Hadoop. The interesting (and difficult) problems these days tend to focus more on realtime analytics. As my ex Sun HPC friend Deepak turned BigData VC guru pointed out in this podcast today, the interesting space for startups these days is in realtime analytics.
Google does a pretty good job at making their analytics more realtime. Type “Hadoop” into the address bar of your Chrome browser and Google not only serves up some static choices probably including hadoop.apache.org but also some near realtime choices based on web sites you have visited in the last few minutes. About an hour ago I was exchanging some notes with another ex Sun co-worker, Tim Marsland, now engineering VP for machine learning company Skytree and I visited several pages on their web site, one of which lists Hadoop as a data source for the Skytree Server. Experimenting for this blog I typed “hadoop” into the address bar of my Chrome browser and sure enough a skytree.net page showed up as one of the top five suggestions. Hint to SkyTree folks, having UC Berkeley professor David Patterson on your home page and technical advisory board is great marketing. Paying a few extra dollars to also get the skytree.com domain in addition to skytree.net would be a good next step. Next time I talk to Tim I have to ask him about that, there has got to be a back story.
Anyhow, the Google does a great job of near realtime analytics. I really don’t need up to the second address suggestions based on my browser history, up to the last hour or whenever Google had time to reprocess my history is good enough. This is a great example of asynchronous processing vs synchronous processing. Same as when you upload a new photo to your favorite social networking web site. If your friend in Japan doesn’t see the photo when viewing the local mirror site for a few minutes, usually no one is the wiser for it. On the other hand, when I go to make a withdrawl from the ATM machine, my bank really wants the transaction to be synchronous and based on the current state of my bank balance, not my bank balance 1 hour or 1 minute ago. Even if my wife is traveling in Japan and just happens to have made a withdrawl a few seconds earlier. This is of course much easier to do with a relatively simple transactional database like a bank account balance than with the types of “big data” that Google uses to make my search recommendations.
But as more of the world’s data is created as unstructured data, more of the interesting problems push us towards realtime analytics of data stored not in SQL databases but in Hadoop file systems and other big data repositories. If I’m monitoring my company network to detect intrusions, I don’t want to just check against a pre-defined list of potential hacks, but I really would like to look for dynamic patterns that may indicate an intrusion that I have never seen before, perhaps corelating observations with other external data sources. To use a simpler analogy, you want to identify the bank robber’s face as they walk in the door and automatically lock the safe versus scanning the video an hour after the robbery and comparing it against images of known bank robbers. Of course these days some of the worst robberies are not physically walking in the door of the bank they are electronically “walking” into the bank’s networks, servers, and databases. In the cloud, it is a lot easier to move money. Just ask Chris Reynolds and PayPal. Luckily the PayPal incident was accidental, non-malicious, and not at all initiated by Chris.
So it is only fitting that ISC Big Data’13 comes immediately after the ISC Cloud’13 conference which focuses on the use of Clouds for HPC and Manufacturing. Maybe too, Deepak was just a bit ahead of his time in moving from the HPC world into Big Data. I hope to see some of you in Germany this September for one or both of the ISC conferences.