Original URL: http://www.theregister.co.uk/2010/10/01/gpu_sharing/

GPUs: Sharing good, islands bad

New words needed!

By Dan Olds, Gabriel Consulting

Posted in HPC, 1st October 2010 10:44 GMT

GPU Video Blog I’ve talked to several folks at the 2010 GPU Tech Conference about the burgeoning need to be able to dynamically share GPUs across multiple systems without having to re-cable boxes or bog down the system by moving processing from one box to another. Dell put forward a solution with its C410x PCIe extension box that allows eight systems to be connected to 16 PCIe devices – including GPUs.

While this is a good thing and a solid first step, it doesn’t quite get us to the point where these devices can be used with the flexibility of, say, a printer or other network attached device. Having this capability is important because it opens up GPUs to a much wider set of users in both HPC and enterprise data centers. It makes them good cloud citizens, too.

On the last day of the show, I visited NextIO and found Kyle Geisler, who gave me an overview of how they’re separating the traditional server, with its CPUs and memory, from I/O. What they’ve done is build a box that hosts up to 16 PCIe devices, like GPUs, with connectivity to 24 individual servers. The devices don’t have to be GPUs; they could be Infiniband or Ethernet adapters or any other PCIe-based I/O gadget.

These devices can be devoted to any or none of the 24 attached servers, and attached and detached logically. NextIO has implemented hot plug PCIe on their system so that dynamic logical attach and detach isn’t a problem and doesn’t require reboots, or anything more than some clicks on the GUI management screen. But, as Kyle explains in the video, most customers are using APIs provided by NextIO to accomplish dynamic switching from within their programs or their own management stack.

The most recent news from NextIO is their introduction of a more bite-sized GPU solution with their vCORE Express 2070 product. It’s a 1U chassis that holds up to 4 Tesla M2050 or M2070 GPUs. It’s a GPU starter solution for customers who have modest needs or are looking for an easy (and less expensive) first step into GPU computing.

The NextIO product line (and other products that are on the way) are a big step toward virtualized GPUs, but we’re not quite there yet. The GPUs can be devoted to servers and hot switched between systems, but they can’t be shared in the same way as a typical general-purpose processor. I think that we’ll see more advances on this front sooner rather than later, particularly as GPUs find their way deeper into the enterprise space.

As I’ve often said in these blogs, the biggest single trend I see happening in business is a move toward much greater use of analytics, particularly predictive analytics. My most complete rant on this topic is here, for anyone wanting to see my rationale.

If I’m right about the trend, then there’s a whole new world of pain coming at data centers as business units demand more data and the computing power to crunch it in wildly varying ways – oh, and fast enough gear to provide the results in near-real time. I think that GPUs will be a key component in enterprise analytics infrastructures. They’re very fast on this type of work, and the ecosystem has come a long way in just a few years. We’ll soon be at the point where there is vendor support for most of the analytic routines that a business would want to run.

I think that enterprises are going to adopt GPUs in piecemeal fashion. I don’t see most companies buying huge analytic infrastructures in a single swipe; rather they will add this capability over time on a project-by-project basis. The ability to make GPUs a shared resource will make justifying them – and the additional investment in time and code to utilize them – an easier decision to push through the organization.

In this final video from the GPU technical conference, it’s also noteworthy to see how my term “GPU-riffic” has come into common usage. With only 10 minutes of cajoling, Kyle and his PR representative were talked into actually saying it live for use at the tail end of the video.

While I can put together a good case for my former boss and myself being the first to use the term “server sprawl,” I don’t have much in the way of proof other than some old slides from 1996. With “GPU-riffic,” it’s different: I have evidence that I’m the first guy to say it. It’s annoying and it’s moronic and it’s mine, damn it! ®