Complexity killed the IT quality of service ...

Original URL: https://www.theregister.com/2011/07/19/cloud_service/

So many dependencies, so much to go wrong

By Dale Vile, Freeform Dynamics

Posted in On-Prem, 19th July 2011 13:00 GMT

“Ring, ring...”

“IT support. How can I help you?”

“The system’s really slow. It’s taking about a minute to save stuff. Normally it’s only a few seconds.”

“How long has this been going on?”

“About 20 minutes, and the backlog is building up. It slowed down yesterday too, but then seemed to right itself after about 10 minutes so I didn’t bother to call. But the boss didn’t seem too happy about it.”

This day-in-the-life snippet might be all too familiar to some of you but it allows us to talk about a few of the challenges that arise from the complexity of today’s IT environment.

Smoking can kill

It doesn’t matter if the system is an order entry function, a customer service call management function, or whatever: if systems are not performing well the first the IT department hears about it is when a user calls.

And if there is a tangible impact on throughput or productivity, that’s when supervisors, managers and execs get involved and the stakes are raised.

This is why it is important to nip problems in the bud, or better still prevent them from happening in the first place. Letting things drag on, especially when incidents are occurring frequently, leads directly to problems between IT and management, not to mention a difficult life on the front line of IT.

OK, I hear you thinking, but users often don’t help. In our little scenario above, the truth might be that when problems occurred the day before, the accounts clerk took the opportunity to sneak off for a quick cigarette on the fire escape.

Had she reported the problem to IT instead, there would have been an opportunity to investigate and avoid the next day’s more persistent performance issues.

Wouldn’t it have been better if the IT team had visibility of such problems beforehand, instead of relying on users to pick up the phone?

Warning signal

This is clearly where monitoring and management software comes into play. Sadly, research tells us that it something that many IT departments are lacking, and also that getting the budget to invest in better tools is hard.

Even where automated monitoring is in place, the chances are that it is focused on individual components. You might get a red flashing light on a dashboard when a server is struggling, for example, but it often takes the knowledge of the IT ops guys to figure out how significant that is.

If it is running some vital component of your authentication infrastructure, then the failure could be about to cause widespread chaos. If it is one server in a web server farm, then it might not be such a big deal.

Coming back to our skiving, cigarette-smoking user, the system she relies on to do her job is likely to be underpinned by a pretty complex infrastructure.

Go figure

The PC on the desk is running a web browser that connects via a web server to an app server that in turn reads and writes to one or more database servers hooked into a storage area network, with various switches, routers and other network equipment managing the traffic flow.

So when the system is running slowly, where do you begin to investigate? If you are lucky enough to have a monitoring dashboard, it is not unusual for it to be telling you that everything is OK, even though there is obviously a problem. It is what is known as “water melon metrics” – green on the surface hiding a lot of red underneath.

The bit that’s often lacking is an active understanding (beyond the out-of-date architecture diagram on the wall) of the way things work end to end.

Users really don’t care what happens behind the screen in front of them

As a starting point for tackling this problem, the trendy thing to do is to define the capability presented to the user a service, then refer to taking care of the way everything works together with the right level of performance and availability as "service management".

Recently, some vendors and analysts have coined the term service assurance, which is a fancy way of acknowledging that users really don’t care what happens behind the screen in front of them, they just want assurance that the capability will be working well enough when they need it.

Normal chat

The vendors can then talk about “services” and “service levels”, which is often a more natural conversation than the systems one.

The challenge is that the IT department is increasingly expected to just shut up and deal with all that complexity behind the scenes, and all the rhetoric about cloud is reinforcing this services view of the world.

The implications when it comes to architecture, monitoring, management, troubleshooting and even how you organise teams within your IT department can be significant, which means some aspects of IT need a bit of a rethink.

Fortunately, we are here to help with our upcoming service assurance workshop, so watch this space as we tackle some of these hard questions over the coming weeks. ®