Nvidia stretches Tesla GPU coprocessors from HPC to big data
'Anything a CPU can do, a GPU can do better'
GTC 2013 Graphics chip maker Nvidia has barely begun to put a dent in the traditional high performance computing segment with its Tesla GPU coprocessors and it is already gearing up to take on new markets. The next target is big data, and as with parallel supercomputing, Nvidia is hoping to get the jump on rivals Intel and AMD, which peddle their respective x86 and GPU coprocessors.
As it turns out, a GPU is not just good at doing floating point math in single precision or double precision, but it can also be used to sift through streams of data to index it and sort it. It takes a bit of work to repurpose these jobs for a GPU, which have been run on CPUs for the most part, but Nvidia is seeing more and more companies using its Tesla GPUs to augment the indexing, sorting, and otherwise chewing of large data sets.
This will be on of the themes that Jen-Hsun Huang, co-founder and CEO at Nvidia, covered during his opening keynote at the GPU Technology Conference in San Jose on Tuesday. Sumit Gupta, general manager of the Tesla Accelerated Computing business unit at Nvidia, gave El Reg some examples of the kind of adjacent big data jobs where GPUs are being deployed.
The first example comes from SaaSy CRM software vendor Salesforce.com. As it turns out, Salesforce.com is one of the six companies in the world that has full access to the Twitter firehose. The number of tweets per day was miniscule back in 2007 through 2009, but in 2010 it jumped to about 50 million tweets per day, busted through 200 million per day in 2011, and broke through 500 million last year.
If you want to chew on all that data to do sentiment analysis for CRM customers, as Salesforce.com most certainly wants to do, then you would need at least ten times the iron you needed three years ago to get the job done.
Just like sticking with regular server processors is not practical in terms of cost, bang for the buck, or performance per watt for a lot of traditional HPC shops, Salesforce.com found the same issues in the big data munchers it had to store and index the raw Twitter feed.
For one thing, so-called real-time text searches on its CPU clusters holding the Twitter data were taking up to ten minutes to complete, which is better than running in batch mode, but cannot be considered real time.
So Salesforce.com engineers figured out how to index the raw Twitter feed using GPUs to accelerate the CPUs, and then also uses GPU offload to match a search term against that index to create baby Twitter feeds that can in turn be pumped over to customers and meshed with their CRM apps. With the GPU assist, Salesforce.com has been able to reduce its Twitter text search down to one second or less, and that is basically real-time.
Salesforce.com is not, however, talking about the infrastructure that makes up its Twitter search engine. Marc Benioff would have to kill you if you found out. The odds favor some kind of NoSQL data store running on an x86 cluster, of course.
The Shazam music identification service has also shifted to a ceepie-geepie architecture to sort through the digital fingerprints that it creates to identify songs from any snippet anywhere in the song. The move to GPUs was precipitated by a trebling of searches and records in the past year.
In 2011, the company supported 100 million user inquiries to identify a song snippet, chewing through 10 million records. Last year, the service added hundreds of the then-shipping "Fermi" Tesla GPU coprocessors to its cluster and was able to process 300 million inquiries and index and sort over 27 million records.
Gupta says that by offloading a lot of the work from CPUs to GPU coprocessors Shazam can grow its service to cover more music and do searches faster, too. All while keeping the server footprint down. No word on when Shazam will move to "Kepler" family of Tesla GPU coprocessors, but like other vendors, Shazam has a qualification cycle for adopting new technologies and is working through that process now.
Over at Cortexica Vision Systems, it was not even possible to do the visual recognition that the company is bringing to the shopping experience without some sort of cheap computing as you can get in a coprocessor that is based on some sort of parallel architecture.
The system that Cortexica has cooked up allows shoppers to take a picture of an item they want to comparison shop. Each photo is uploaded into the Cortexica parallel ceepie-geepie, with over 1,000 different points of identification taken from that photo.
Cortexica has built a database of over 1 million apparel items, and can find matches or near matches for the items in a matter of seconds based just on the photos. Cortexica is not a retailer itself, but rather offering its database and search algorithms as a service that shopping sites can embed in their services.
This photo matching technology has obvious applicability in other kinds of applications, some of them potentially of a dubious or nefarious nature. That has always been true of every technology humans have created.
Another use for GPU coprocessors that Huang will be talking about is the reformatting of live video feeds using ceepie-geepie appliances from Elemental.
This was the company that supplied the live encoding of video streams from the Summer Olympics in London last year, and is also used by the Weather Channel to do the same job for its video streams. The latest Elemental boxes have Nvidia's Tesla K10 GPU coprocessors paired one-for-one to an x86 processor in a two-socket rack server.
The Weather Channel serves up video to 38 million viewers on mobile devices each month, and during Superstorm Sandy last October the site handled 12 million concurrent live video streams over the Internet, and each one of those streams was re-encoded on the fly from the live feed from Weather Channel studios to meet the screen size and resolutions of various PCs, smartphones, and tablets. ®
Not sure why..
But I found the last example, the weather channel, the most impressive. Realtime encoding to every device sounds pretty impressive. Yet are they perhaps making it more difficult then it needs to be? It would seem that it would be more efficient to employ some cdn secret sauce. Make the cdn request a particular encoding if it was not already available locally so that you only have to encode once? Even if you were livestreaming it would seem like you could eliminate the redundant encoding jobs. I'm no expert mind you and I've no idea about the costs involved so maybe there is a financial reason. I'm just looking at it from a efficiency stand point.
Still cool to hear what can be done.
Mostly just hype
Nvidia has been on a marketing campaign lately looking for new revenue streams from servers. When you investigate the details you find their server designs have a lot of bottlenecks and aren't anywhere near as impressive as they'd like people to believe.
Re: Not sure why..
Exactly what I was thinking. Impressive but profoundly stupid. Still if you have a hammer...