Steer clear of the desktop virtualisation bootstorm
Prepare to form an orderly queue
It is every IT administrator’s worst nightmare.
All the employees’ desktops have been virtualised and are running on a server. The pilot project worked well and everyone was happy, but then the team tried to scale it up and now it’s Monday morning and 3,000 users have just walked in with their lattes and croissants, sat down at their shiny thin-client machines and tried to log on.
The system grinds to a halt and people are waiting 20 minutes or more to access the system. Productivity is falling and in about 30 seconds the helpdesk will be flooded with angry calls. It's what is known as a bootstorm.
It isn’t just underpowered servers that can cause bootstorms. They can be due to I/O bottlenecks in two areas: storage, and the network. Disks in a storage area network can spin only so fast, and when too many people are trying to access their machines all at once it is not fast enough.
Even if the disks can get the data off quickly enough it still has to get to the server. That can cause problems if the network connection between the SAN and the server is too slow.
What to do?
Ross Bentley, head of professional services at consulting firm Assist, suggests configuring the network to stagger virtual desktop bootups.
“You can get to the point where you know the whole environment can work by 50 turning virtual machines on at the same time,” he says.
Doing it in chunks of 50 could get all the users up in a few tens of minutes, although it would mean starting before the users get into work.
George Crump, chief steward at analyst firm Storage Switzerland, disagrees with the solution. For one thing, bootstorms are not limited to the morning login.
“For example a virus scan might kick off at the same time or a big patch update. In any case, you can’t pre-log them in because you violate all kinds of security,” he says.
One approach is to use a single shared image, rather than replicating a new machine for every user, in a Remote Desktop Services scenario. You could then use folder redirection to access users’ personal data. This drastically reduces the number of separate images that are accessed on the disk.
“The other thing about a golden image is that it makes it more affordable to move that into solid state storage,” says Crump.
Removing the mechanical element of storage speeds up access. You might not want to store 3,000 virtual machines on expensive SSDs but a single image would work.
Administrators also have another element to to deal with: the network. A poorly-configured network that can’t transport data from virtual machines to the server fast enough will cripple performance.
Moving to fibre channel for the SAN interface speeds things up, but that requires a different set of administrative skills – not to mention a whole new set of host bus adaptors – which ratchets up cost.
Another option is to use direct-attached storage to ease the bottleneck. However, this doesn’t give you a free pass, according to Sylvester de Koster, group technical manager at distributor CDG UK.
“Even with direct storage, if you don’t configure it right CPU or disk or memory will be overused, and that causes major issues,” he says.
Understanding the workers
Hamish Macarthur, founder of storage analyst Macarthur Stroud, points out that simply configuring storage and network links for high performance is not enough. You have to understand the types of users you have and their working patterns.
“There might be issues when people are in different time zones and that might flatten things out,” he says. “You need management tools to recognise the spikiness of the load.”
Baselining storage and network demand is therefore a crucial element of the desktop virtualisation process, and that includes knowing how much inventory you have.
“It’s a trial-by-error process”
“You might believe that you’re providing PCs or desktops for 1,000 users, but there might be another 500 to 600 out there connected via supplier, or maybe a customer that ties in a bit more closely," MacArthur says.
Balancing machine density is important to prevent port I/O and CPU bottlenecks, but there is no easy formula to balance the ratio between physical boxes and virtual machines. “It’s a trial-by-error process,” says Crump.
Part of the challenge of desktop virtualisation is maintaining the user experience – and employees allow little margin for error.
Avoiding bootstorms entails both hidden costs and a high degree of configuration expertise. Are you ready for that? ®
Just think about it..
..you could even load up the entire OS locally, saving huge bandwidth and server power. Some kind of "personal" computer.
Why hasn't anyone thought of that before?
Only a surprise to noobs
This effect has been around since the beginning of time.
Whether it's hundreds of users coming in on a monday morning and all trying to access their email (off the one single server, that was only capacity-planned for a steady-state load) at the same time.
Or the hundreds of call centre staff who all go <click> when their shift starts: like the email or Windows servers "storm", but much more intense, as they all start within a few minutes of each other.
Or (worst of all) bringing up a system after a crash when EVERYone tries to log in continuously just as soon as they see the login screen.
It's even been a problem in the days of mainframes when everyone tried to fill in their weekly timesheets at 16:30 on a friday afternoon (they had to be done before you left, you couldn't fill them in earlier - 'cos you didn't know what you'd be doing - well you did: you'd be waiting for CROVM4 to respond for about 15 minutes)
But, of course, no manager is prepared to shell out for a system that's specc'd at 500% of their steady-state capacity requirements, to handle a workload that will only exist for a few minutes once or twice a week.
I wonder if there's milage in caching some of the OS components on the client machines. Sure, you'd need slightly more expensive and complex and powerful clients, but on the other hand if they booted using a local kernel and userland and then accessed files and applications over the network, think of the bandwidth-saving!