Cycle Computing fires up 10,000-core HPC cloud on EC2
Only $1,060 per hour, management included
Setting up a 10,000-core physical server cluster to run supercomputing workloads is a tough task that can take weeks or months, and cost millions of dollars including the servers, storage, switching, and personnel overhead. And the thing about a cluster is that no matter how hard you try to share it, it is very tough to get anywhere near peak utilization over the course of the year.
Enter the cloud, which HPC customers have been skeptical about (remember the Sun Cloud?) but may start taking a shining to given the compelling economics and the management and security services that companies are layering on top of public clouds.
Cycle Computing has made a tidy little business for itself firing up HPC grids on various public clouds using its homegrown cloud management and security tools. Last month, Genentech, arguably the pioneer of the modern biotech industry and now a part of the Roche big pharma conglomerate, came to Cycle Computing to have it fire up 80,000 hours of computing time on a 10,000 virtual core HPC cluster on Amazon's EC2 compute cloud to run one of its protein analysis jobs.
Genentech didn't want to use the Amazon EC2 Virtual Private Cloud  launched in July 2009, which takes virtualized servers and corrals them into their own virtual private network, or the specialized dedicated HPC VPC instances with 10 Gigabit Ethernet networking , which debuted last August. The VPC options are pricier than raw EC2 capacity, and Genentech wanted to have Cycle Computing manage deal with the setup, monitoring, and breakdown of the virtual cluster.
This is something that Cycle Computing now does for a living. The HPC startup that was founded in 2005 by Jason Stowe, the company's chief executive officer, to provide services to HPC shops deploying the open source "Condor" grid management system developed at the University of Wisconsin. (Condor is now the key grid software used by Red Hat in its Enterprise Linux 6 distribution.) Stowe previous worked for Walt Disney Studios before setting up his company and helped manage movie production and was well aware of the need for computing capacity for film production.
Cycle Computing has created two tools to manage virtual HPC clusters. The first is called CycleServer, a management and encryption layer that rides atop Condor that Cycle Computing delivered in 2007 to help simplify its own life, and CycleCloud, a domain layer that rides atop of the EC2 compute cloud at Amazon as well as on the Rackspace Cloud from Rackspace Hosting and any public cloud that runs VMware's vSphere and vCloud combo. CycleServer monitors the jobs and tells Condor how to move workloads around the grid of virtual servers, while CycleCloud provisions the images onto EC2 and other public clouds.
Stowe tells El Reg that during December last year, Cycle Computing set up increasingly large clusters on behalf of customers to start testing the limits. First, it did a 2,000-core cluster in early December, and then a 4,096-core cluster in late December. The 10,000-core cluster that Cycle Computing set up and ran for eight hours on behalf of Genentech would have ranked at 114 on the Top 500 computing list from last November  (the most current ranking), so it was not exactly a toy even if the cluster was ephemeral.
The configuration, which Cycle Computing refers to as "Tanuki" in case you want to run a job on the same setup, was configured on February 3 to run on top of 1,250 physical servers inside of Amazon's North American data center. The cluster was configured with a 2PB file system Cycle Computing has set up virtualized file systems for HPC jobs on EC2 that have spanned up to 14PB so far, according to Stowe. It is not clear how far it can push capacity, moving more data than this can take days or weeks, depending on the network connection between a company and Amazon's data center.
The Tanuki configuration includes 1,250 of Amazon's extra large instances, which each have eight virtual cores and 7GB of main memory allocated to them. The Condor collector and negotiator as well as CycleServer ran a four-core extra large instance with 17.1GB of memory, and the primary scheduler and disk filer ran on another extra large instance with eight cores and 7GB of memory. The auxiliary schedulers for the cluster were on two large two-core instances with 7.5GB of memory. So the whole shebang had 1,254 servers, 10,014 virtual cores, 8.6TB of aggregate memory, and 2PB of disk. The virtual servers were configured with CentOS 5.4, the freebie clone of Red Hat Enterprise Linux.
Using the CycleCloud and CycleServer tools, it took 15 minutes to fire up the first 2,000 cores in the virtual cluster and within 45 minutes, all of the 10,000 virtual cores on those 1,250 servers were up and running. Stowe used the open source Chef configuration management tool to tweak the settings on each virtual server instance on EC2. Then Genentech loaded up its code and ran the job for eight hours at a total cost of $8,480, including EC2 compute and S3 storage capacity charges from Amazon and the fee for using the Cycle Computing tools as a service. That works out to $1,060 per hour for a 10,000-core cluster. Cycle Computing charges a 20 per cent premium over the raw EC2 capacity as its management fee.
Stowe did some back-of-the-envelope math and says that to buy the physical servers, storage, switching, and plunk it into a data center and operate it – including power, cooling, and people costs – would require around $5m. Assuming the data center is free – which it is not – then that works out to about $571 per hour for the cluster per year, if you write down the cost of the hardware in one year. However, that is not the point. What is the point is that researchers in the HPC space want to run simulations every once in a while, not own clusters and be in the IT utility business, trying to keep their clusters busy and yet available for their own work.
"We're really trying to democratize this so any researcher can get a grant and run their code," says Stowe. "Rather than buying two servers and waiting several months for a simulation to complete, they can spend less money and get their results in a matter of hours." ®