Feeds

Evergrid ships 'little blue pill' for clusters

Don't let CD get you down

Intelligent flash storage arrays

SC06 A small software start-up thinks it might have "the little blue pill" necessary to keep massive clusters up and running at all times.

Evergrid this week unveiled something called the Availability Management Suite, but you might as well call it cluster Viagra. The company’s software works to make sure that minor – or major – system failures don't hamper the overall completion of large computing jobs. In total, Evergrid helps you keep your cluster up, sturdy and strong for hours and even days at a time.

Okay, we'll stop the bad "jokes" for a moment and get to the point.

Evergrid is working off the premise that the high performance computing industry lacks the proper tools for restarting jobs on large clusters. Systems made up of thousands of boxes will have failures – plenty of them – and need a way to get jobs restarted automatically.

Today, administrators must keep a close eye on their clusters, break jobs down into different chunks or try running jobs on smaller systems in order to avoid the pains of system failures. Evergrid tries to sidestep all this mess by running continuous "checkpoints" that capture the state of servers and their applications. If a failure occurs, the system can roll back to the last known state and then get cranking away on jobs once again.

Some companies and labs have developed their own checking systems over the years, but Evergrid believes that the industry is begging for a standard here.

Thankfully, customers will not need to rewrite any of their own software to make Evergrid's code work. The Evergrid "abstraction layer" slides in between an OS kernel and its applications. The checkpoints can then gather information on the state of memory, file I/O and the network at desired intervals.

Evergrid CEO's Dave Anderson, speaking here at the Supercomputing conference, told us that the monitoring software could chew through as much as 5 per cent of a system's resources. He, however, pitched that as an Armageddon type scenario and claimed that more often than not administrators will not notice the Evergrid code.

Come January, Evergrid plans to GA its Availability Services software and Resource Manager. Together, these products form the "flagship" Availability Management Suite.

The Availability package performs the checkpoint operations, while the resource manager handles a broad array of tasks such as making sure certain jobs get a set amount of processing power and priority levels.

We caught a demo of the Suite, and it worked as billed. An Evergrid administrator had three jobs spread across an eight-server cluster. The software could detect idle systems, throw them at jobs as needed and caught a failure when we ripped out one server. It stopped the one job that was affected and then got it up and running again in a couple of seconds on the available hardware.

By the second half of next year, Evergrid hopes to extend out of the HPC market and target smaller business clusters. It's looking, in particular, at the database and application server markets. Such customers might flock to Evergrid's tools for stopping jobs and then restarting them on new servers. This would allow a company to juggle different jobs with more flexibility than they have today.

For now, however, Evergrid will center on the HPC crowd that has enormous clusters built out of thousands of machines. Many HPC users deals with jobs that take days, weeks and sometimes months to process. They often have to restart these jobs from scratch due to system failures.

Evergrid has one patent and another handful of patents pending that it thinks will protect its IP from overzealous Linux coders who might come up with something similar. Some of you will be familiar with the company's CTO Dr. Srinidhi Varadarajan who built the massive G5 cluster at Virginia Tech and is credited with banging out much of the special "checking" sauce.

Evergrid claims two major customers at this point – an unnamed financial services company and the University of Oklahoma. The company has yet to set official pricing for its software, although Anderson guessed it will come in around "$250 per node with large volume discounts."

There's more information available here. ®

Top 5 reasons to deploy VMware with Tegile

More from The Register

next story
Just don't blame Bono! Apple iTunes music sales PLUMMET
Cupertino revenue hit by cheapo downloads, says report
The DRUGSTORES DON'T WORK, CVS makes IT WORSE ... for Apple Pay
Goog Wallet apparently also spurned in NFC lockdown
Cray-cray Met Office spaffs £97m on VERY AVERAGE HPC box
Only 250th most powerful in the world? Bring back Michael Fish
Microsoft brings the CLOUD that GOES ON FOREVER
Sky's the limit with unrestricted space in the cloud
'ANYTHING BUT STABLE' Netflix suffers BIG Europe-wide outage
Friday night LIVE? Nope. The only thing streaming are tears down my face
IBM, backing away from hardware? NEVER!
Don't be so sure, so-surers
Google roolz! Nest buys Revolv, KILLS new sales of home hub
Take my temperature, I'm feeling a little bit dizzy
prev story

Whitepapers

Why and how to choose the right cloud vendor
The benefits of cloud-based storage in your processes. Eliminate onsite, disk-based backup and archiving in favor of cloud-based data protection.
Getting started with customer-focused identity management
Learn why identity is a fundamental requirement to digital growth, and how without it there is no way to identify and engage customers in a meaningful way.
High Performance for All
While HPC is not new, it has traditionally been seen as a specialist area – is it now geared up to meet more mainstream requirements?
Saudi Petroleum chooses Tegile storage solution
A storage solution that addresses company growth and performance for business-critical applications of caseware archive and search along with other key operational systems.
Protecting against web application threats using SSL
SSL encryption can protect server‐to‐server communications, client devices, cloud resources, and other endpoints in order to help prevent the risk of data loss and losing customer trust.