The cloud is great for HPC: Discuss
Scientists rejoice: It’s raining TeraFLOPS from the cloud
Sponsored High-performance computing (HPC) environments are expensive. Government research facilities and commercial laboratories spend hundreds of thousands building out large, monolithic supercomputers and then jealously guard their compute cycles. This approach to HPC is restrictive. It creates a rarified environment in which only the cream of the crop get the FLOPS they want.
Cloud computing has spent the last few years democratising commercial computing, driving new infrastructure efficiencies and making computing power more accessible. Could it do the same for HPC applications? Scientists have more in common with startup CEOs than you might think. They’re after knowledge, rather than market share, but they must increasingly act like entrepreneurs to get it, experimenting frequently, failing often, and recovering quickly.
The whole scientific method rests on constant experimentation, which means creating hypotheses and then testing them. Whether the results are what you expected or not, you always learn from them. This constant testing can often take scientists down different paths, though, and may require different kinds of tests as the results shake out.
Pivoting research in this way can be difficult in traditional HPC computing environments, points out Brendan Bouffler, Global Scientific Computing, Amazon Web Services (AWS). He describes merit allocations, which are a common practice in HPC, where researchers must write up a request for system access.
HPC provisioning problems
Research teams often have to fill out an applicant profile and project detail before they even get to the meat of the merit application process. Then it has to be adjudicated. It can take months to get through the entire thing and get your hands on a machine.
“They’re essentially waiting three to four months to do an iteration in their research that’s probably going to take twelve hours to run,” he says. The other problem with traditional HPC environments is that they don’t often lend themselves to workload customization.
“You pay a bunch of people to build a big datacentre, and they build one monolithic machine, which is one size that fits all,” he says.
That approach makes sense if you’re running an HPC system, because you’re often dealing with a small number of system admins. It may run all of your scientific research applications, but it won’t serve them all equally. “It’s really good if you’ve got the one scientific application that’s exactly optimized for that workload. For all the others, it’s suboptimal,” he argues.
The beauty of running HPC in the cloud is that you’re not dealing with a single type of node anymore, or a handful of harried sysadmins. You can pick and choose from a variety of node types, configured to suit your needs, and you can bolt them together into cluster configurations that make sense for your app.
Moreover, you can spin your infrastructure up in minutes and pay for just as much time as you want. This helps to reduce your procurement risk, points out Bouffler.
You may want to try throwing a set of GPUs at a particular computational fluid dynamics problem to see if that architecture can handle the workload in a more effective way. If it doesn’t deliver the gains you expected, then you haven’t sunk capital into a hardware investment.
That leaves you free to try other options, such as increasing the amount of shared memory, say, or varying the processor type to support a more advanced set of vector instructions.
Optimized hardware configurations are only one part of the customization story. The other part lies in software. Large cloud-based platforms tend to attract partner ecosystems; software development companies will develop their own apps that sit atop the infrastructure platform and offer added value. In AWS’s case, there are a variety of partners serving applications that enhance HPC jobs, ranging from virtual machine configuration and workflow management through to visualization.
Nevertheless, researchers do face some challenges running HPC in a cloud computing environment. One of them involves the architectural requirements underpinning most high-performance computing workloads.
Because these workloads require high performance, they often demand operation ‘close to the metal’, running with as much access to the processor’s core physical computing resource as possible. The worry is that running atop a virtualized hypervisor may slow operations down.
Performance requirements drive another idiosyncrasy in HPC applications: they frequently call for very high-speed interconnects. In traditional HPC setups, this entails direct communication between processes outside the OS kernel.
The high-speed communication requirements apply to storage, too. Scientific research applications frequently deal with extremely large data volumes, which they must pull from persistent storage. These applications will often divide their data between multiple storage instances for lower-latency access, putting an even greater strain on I/O environments. ‘Vanilla’ cloud environments aren’t always set up for this kind of computing.
This doesn’t preclude HPC in a cloud environment, though. Much depends on which cloud computing infrastructure you use, and the specific nature of your application. Traditional HPC infrastructures have led to an ‘Infiniband or bust’ attitude among scientific researchers, says Bouffler. “When you use an Infiniband-connected computer all the time, you start to get the impression that you must always need that,” he says. The reality is far different. “Most of the workloads don’t need that specialised hardware.”
Advances in cloud technology
Cloud service providers are also making strides in supporting HPC environments, he says. Some, like AWS, have been busy creating faster networks to increase data flows between different nodes, while also developing placement strategies that group nodes used for the same application physically closer together in a cloud environment.
AWS has also increased the variety of base computing instances available to customers over the last few years, turning to advanced Xeon processors with Advanced Vector Set instructions that support single mathematical options on large numbers of nodes. This brings Single Instruction Multiple Data (SIMD) – a processing type typically restricted to supercomputer users – within reach of cloud-based HPC users.
That’s all well and good, but it still leaves HPC customers with one big challenge in a cloud computing environment: getting their data into the cloud in the first place. Scientific research applications often deal with data in the Terabyte range. Uploading it to someone else’s server can be like pushing a data lake through a straw.
In many cases, good old-fashioned sneakernet comes to the rescue. AWS uses its Snowball appliance to slurp data at the customer’s premises and then transfers it physically to its own datacentre. For very high-volume data loads, it can use its SnowMobile, an 18-wheeler semi designed to transfer Petabytes of data across the country.
Once the lion’s share of a research team’s historical data has been forklifted in this way, ongoing data updates can be made incrementally on the wire, points out Bouffler. Sharing is caring
Once the data is in the cloud, it gives HPC customers a unique advantage: it doesn’t need to move anymore. Keeping it in a cloud environment makes collaboration easier. Scientific research teams often want to work with other teams in the same interest area, but they may be distributed around the world. That has made things historically difficult when dealing with large data sets, because moving it between sites has been prohibitive. In a cloud environment, research teams can work on the same data sets, held online. A European team can save a virtual machine at the end of its day and share it with a Californian team who can work on the data while Europe is asleep, creating round-the-clock research.
All of this contributes to one of the biggest benefits when combining cloud and HPC: Innovation. The ability to pivot research instantly using on-demand compute and storage instances, combined with the power to collaborate between teams, makes the entire scientific research process more fluid.