Feeds

Evergrid ships 'little blue pill' for clusters

Don't let CD get you down

Top 5 reasons to deploy VMware with Tegile

SC06 A small software start-up thinks it might have "the little blue pill" necessary to keep massive clusters up and running at all times.

Evergrid this week unveiled something called the Availability Management Suite, but you might as well call it cluster Viagra. The company’s software works to make sure that minor – or major – system failures don't hamper the overall completion of large computing jobs. In total, Evergrid helps you keep your cluster up, sturdy and strong for hours and even days at a time.

Okay, we'll stop the bad "jokes" for a moment and get to the point.

Evergrid is working off the premise that the high performance computing industry lacks the proper tools for restarting jobs on large clusters. Systems made up of thousands of boxes will have failures – plenty of them – and need a way to get jobs restarted automatically.

Today, administrators must keep a close eye on their clusters, break jobs down into different chunks or try running jobs on smaller systems in order to avoid the pains of system failures. Evergrid tries to sidestep all this mess by running continuous "checkpoints" that capture the state of servers and their applications. If a failure occurs, the system can roll back to the last known state and then get cranking away on jobs once again.

Some companies and labs have developed their own checking systems over the years, but Evergrid believes that the industry is begging for a standard here.

Thankfully, customers will not need to rewrite any of their own software to make Evergrid's code work. The Evergrid "abstraction layer" slides in between an OS kernel and its applications. The checkpoints can then gather information on the state of memory, file I/O and the network at desired intervals.

Evergrid CEO's Dave Anderson, speaking here at the Supercomputing conference, told us that the monitoring software could chew through as much as 5 per cent of a system's resources. He, however, pitched that as an Armageddon type scenario and claimed that more often than not administrators will not notice the Evergrid code.

Come January, Evergrid plans to GA its Availability Services software and Resource Manager. Together, these products form the "flagship" Availability Management Suite.

The Availability package performs the checkpoint operations, while the resource manager handles a broad array of tasks such as making sure certain jobs get a set amount of processing power and priority levels.

We caught a demo of the Suite, and it worked as billed. An Evergrid administrator had three jobs spread across an eight-server cluster. The software could detect idle systems, throw them at jobs as needed and caught a failure when we ripped out one server. It stopped the one job that was affected and then got it up and running again in a couple of seconds on the available hardware.

By the second half of next year, Evergrid hopes to extend out of the HPC market and target smaller business clusters. It's looking, in particular, at the database and application server markets. Such customers might flock to Evergrid's tools for stopping jobs and then restarting them on new servers. This would allow a company to juggle different jobs with more flexibility than they have today.

For now, however, Evergrid will center on the HPC crowd that has enormous clusters built out of thousands of machines. Many HPC users deals with jobs that take days, weeks and sometimes months to process. They often have to restart these jobs from scratch due to system failures.

Evergrid has one patent and another handful of patents pending that it thinks will protect its IP from overzealous Linux coders who might come up with something similar. Some of you will be familiar with the company's CTO Dr. Srinidhi Varadarajan who built the massive G5 cluster at Virginia Tech and is credited with banging out much of the special "checking" sauce.

Evergrid claims two major customers at this point – an unnamed financial services company and the University of Oklahoma. The company has yet to set official pricing for its software, although Anderson guessed it will come in around "$250 per node with large volume discounts."

There's more information available here. ®

Beginner's guide to SSL certificates

More from The Register

next story
It's Big, it's Blue... it's simply FABLESS! IBM's chip-free future
Or why the reversal of globalisation ain't gonna 'appen
'Hmm, why CAN'T I run a water pipe through that rack of media servers?'
Leaving Las Vegas for Armenia kludging and Dubai dune bashing
Microsoft and Dell’s cloud in a box: Instant Azure for the data centre
A less painful way to run Microsoft’s private cloud
Facebook slurps 'paste sites' for STOLEN passwords, sprinkles on hash and salt
Zuck's ad empire DOESN'T see details in plain text. Phew!
Windows 10: Forget Cloudobile, put Security and Privacy First
But - dammit - It would be insane to say 'don't collect, because NSA'
CAGE MATCH: Microsoft, Dell open co-located bit barns in Oz
Whole new species of XaaS spawning in the antipodes
prev story

Whitepapers

Cloud and hybrid-cloud data protection for VMware
Learn how quick and easy it is to configure backups and perform restores for VMware environments.
A strategic approach to identity relationship management
ForgeRock commissioned Forrester to evaluate companies’ IAM practices and requirements when it comes to customer-facing scenarios versus employee-facing ones.
High Performance for All
While HPC is not new, it has traditionally been seen as a specialist area – is it now geared up to meet more mainstream requirements?
Three 1TB solid state scorchers up for grabs
Big SSDs can be expensive but think big and think free because you could be the lucky winner of one of three 1TB Samsung SSD 840 EVO drives that we’re giving away worth over £300 apiece.
Security for virtualized datacentres
Legacy security solutions are inefficient due to the architectural differences between physical and virtual environments.