Why storage needs Quality of Service

Original URL: https://www.theregister.com/2013/12/16/storage_quality_of_service/

Makes shared storage play nicely

Posted in Storage, 16th December 2013 10:29 GMT

Storage consolidation looks great on paper. Direct-attached storage is notorious for its inefficiency, with some arrays being just 40 per cent occupied or even less.

Providing an Oracle database with 10,000 IOPS could mean aggregating dozens of 15,000 RPM drives, and unless the database is several terabytes in size that is a lot of wasted space.

The alternative is shared storage, probably with virtualisation and thin provisioning to allocate physical disk capacity more efficiently, and perhaps with a Flash tier or cache to boost performance.

As well as reducing wastage, shared storage can also bring other advantages, not least a reduction in the number of points of management.

But what happens if one of your clients or applications doesn't play well with its fellows – if it is badly behaved and greedy and doesn't realise that in shared storage “shared” is the operative word?

In many systems, it is all too easy for one application to become the bully in the playground, grabbing too much for itself and leaving the other children crying in the dust.

Key element

“Quality of service is more crucial than people give it credit for. It's a small thing but without it a lot of the value propositions of shared storage go away,” says John Rollason, NetApp's director of product, solutions and alliances marketing.

“Essentially, if you don't have QoS on a shared storage platform, you can't guarantee overall QoS when users move to a virtualised environment. Virtualisation also makes the I/O a lot more random.”

A common example is when applications were not designed to share and have different access patterns, says Alex D’Anna, director of solutions consulting EMEA at Virtual Instruments.

“A really interesting use-case is service-hungry applications,” he says. He cites the example of a customer which had problems with its crucial SAP installation, despite apparently having plenty of SAN capacity to hand.

“SAP is there to help you manufacture, but you also need data warehousing and business analytics for forecasting. The amazing thing to us was that the customer had completely different read/write patterns and the data warehousing was completely eating up its 8Gbps Fibre Channel SAN,” D’Anna says.

He adds that the challenge is magnified once you move into the cloud. "With cloud storage, people are looking for ease of provisioning and so on. We work on the assumption that a share-everything philosophy will eventually dominate. On that platform you need a picture of what is happening," he says.

"For example, when there are performance problems people ask to be put back onto dedicated storage. But in the cloud you can't do that any more."

Feed the hungry

Frank Reichart, senior director of product marketing at Fujitsu, agrees. “QoS is necessary for storage consolidation. There is no way around that,” he says.

“If you do nothing, the server that demands the most performance will get it – and if that's your business intelligence system, then response times for the more time-critical production system will suffer. QoS also impedes the service level agreement-driven organisation, and if you cannot set QoS, you punish the user who has simple applications.”

The business intelligence (BI) problem is a big one because more and more BI users want to run their queries against the production data, not least because of the cost of setting up a dedicated data warehouse and the time needed to copy data there.

Anyone else trying to use that storage might as well take a coffee break because they are not going to get a lot done

It is not the only example, though. A heavy database query could also easily soak up all the I/O available, starving the web and email servers that are sharing the same storage. As for the impact of a VDI bootstorm on shared storage, anyone else trying to use that storage might as well take a coffee break because they are not going to get a lot done.

All of this is especially true for public-cloud operators, whose very existence and profitability is predicated on being able to share resources such as storage across multiple customers or tenants.

Increasingly this also applies to IT departments, as they too must service multiple internal clients – and typically for less and less money.

So what are the storage developers doing to deal with the issue and ensure equitable and appropriate access, without forcing you to solve the problem by expensively throwing storage at it?

The first thing is obviously to add QoS mechanisms, assigning priorities to applications. Stopping rogue applications or clients requires other approaches. One of the simpler ways to do it is to apply I/O rate limits to badly behaved applications so they don't grab everything available.

That can be too simple, advises Jesper Matthiesen, the CTO at Debriefing Software. “I don't consider bandwidth throttling to be a good thing because if the capacity is there you should use it,” he says.

Another route, and the one chosen by a number of leading-edge developers such as Fujitsu, NetApp and NexGen (now part of FusionIO), is to enforce minimum application data throughput levels rather than maximum.

Extra portions

The idea is that once the total demand for storage performance has exceeded the system's ability to deliver IOPS or Mbps, then instead of granting I/O requests on a first-come first-served basis, the system ensures each server gets its minimum number of IOPS. After that, it uses any remaining headroom to deliver additional IOPS to high-priority workloads.

The incorporation of Flash is also important in making storage QoS feasible, as lots more IOPS is available for prioritisation when you cache items in Flash.

The final key element is automation, according to Reichart. "Actually setting QoS parameters is quite another matter," he says.

"Typically the metrics would be response time, but to get say sub-5ms for a database is a very complex task. You could have to play with 20 different parameters. Even then it's a moving target because once you have set up the QoS you want, another application could come in and you have to start again from scratch.

"What you want is automated or semi-automated systems that are self-optimising so the administrators can just define the requirements and let the system do the rest. It also needs more reporting and monitoring – which LUNs use which storage, which application is on which storage tier.

“The alternative is that people may even de-consolidate or look for point solutions, which is clearly inefficient and leads to over-provisioning."

That means capacity planning and modelling, with careful attention to all the performance data that your storage systems are already generating.

Start by modelling the physical workload capacity, then model the hosts; for example you can say a database server needs X number of I/Os.

Then you can begin to define monitoring policies and migrate high-performance hosts onto a high-performance storage tier or low-performance hosts onto low-performance storage.

A good technique can be to divide your primary storage into tiers, typically for high, medium and low performance. Next, you define service-level agreements for each tier: how many I/Os the storage can handle, what limit it should have on latency, what availability levels it should offer and so on.

Danger ahead

Ideally, you then want to have the system do as much as possible of the repetitive grunt-work of data movement for you – and thankfully, the nuts and bolts and tools exist pretty much to automate it all.

Matthiesen warns, however, that although some of these processes can and should be automated, the tools involved are powerful and can be dangerous. It is a case of great power coming with great responsibility – and great risk.

For example, moving a host from one tier to another can involve hundreds of megabytes and require lots of I/O, and just moving that amount of data around is going to affect your other systems.

"You have to go with caution because this is the core of your company," Matthiesen says. "You can do horrible things moving data around. It needs great care.

“Major configuration changes still need human knowledge because it takes business understanding as well as technical insight." ®