Why should storage arrays manage server flash?
One butt to kick if it all goes, erm, pear-shaped
EMC's Project Lightning has a storage array managing flash cache in servers networked to the storage array. Dell is thinking along similar lines. This is supposed to provide better storage service to the servers. Really? How?
An enterprise infrastructure architect working in the insurance area got in touch to give me a use case in which it makes perfect sense.
His scenario envisages an ESXi server farm, masses of virtual machines (VMs), and a NetApp filer, with cache in its controller (PAM) running NetApp's A-SIS deduplication. This is what he wrote:-
ESXi Farm and a filer
"Take an ESXi farm which is connected to a NetApp filer. The volume on the filer has an A-SIS job run on it every night which consolidates the identical blocks down to a single instance. This pays big dividends as the space utilisation doesn't grow linearly with the number of VMs you provision.
"You can deploy PAM read cache in the filers and cache the actual blocks on disk rather than the dehydrated blocks served up, so yes – you get a high cache hit rate, meaning also you don't need to grow the number of physical spindles for performance with the number of VMs you provision.
"The problems lie in the scalability of the filer heads and the latency incurred by the network stack. The drawback is that the filers need lots of CPU to service the number of requests coming from the hundreds of VMs residing in the same blocks. This limits the number of VMs you can provision on 31xx and 60xx filers to around 300-500 before the CPU in the filers get really hot, and limits the performance of the VMs themselves due the 5-10ms latency of a typical storage request incurred by the network stack.
"You can upgrade your filers to the latest and greatest and spend life-changing sums of money – bang – CPU problem solved for another three years, until the same issue occurs again and you fork out another life changing sum because the filers can't keep up with the growth of your sprawling VM estate. This doesn't fix the network stack latency however."
Put PAM contents into server flash cache
"If you could take the contents of the PAM card in the filer and replicate this into flash cache in the ESX host, it serves two purposes. First, it reduces the network stack latency back to the filers which improves VM performance and subsequent consolidation ratios on each ESXi host, and it increases the length of time before the next big spend cycle when you need to upgrade your storage and spend another huge sum to fix the CPU issue. Some people need shared storage, but want the performance of local SSD. It is also much easier to sign off a couple of grand per ESXi host as you purchase them than to spend huge sums of money every few of years on new storage controllers."
You can upgrade your filers to the latest and greatest and spend life-changing sums of money – bang – CPU problem solved for another three years, until the same issue occurs again and you fork out another life changing sum.
Server agents for rehydraytion
At this point I thought that rehydration might need agent software in this use case, with my thinking going like this:
1) A deduplicated file equals unique data segments plus pointers to master copies of duplicated data segments.
2) In this use case example, we have the master copies in the ESXi server's flash cache and the unique data segments in the storage array – the file having parts in two locations. The I/O request for a VM then involves the storage array-held data and the server flash cache data being combined as the deduplicated file us rehydrated. Where is the rehydration done?
3) I'm guessing it is executed by the storage array, but wouldn't it need an agent in the ESXI server to combine the cached data with the data coming from the array to build the rehydrated file?
Our correspondent said: "Yes, the cache in the hosts would require some intelligence or an agent to do the rehydration – ie, a table which says, when I access this reference, actually go and get it from this other reference. If I don't have it, get it from the shared storage and cache it for next time. Some component of the NetApp filer's caching algorithm [would] need to exist in the host."
He emphasises that the storage array and flash cache should come from a single supplier, so you should have one throat to choke if problems occur.
It doesn't have to be a NetApp filer. This use case will work in principle with an EMC VNX array controlling server-located flash cache: that is Project Lightning as EMC describes it.
A second use case
Our correspondent devised a second use case:
"Another application is something like a Data Warehouse (DWH) system. This requires vast amounts of disk performance and very low latency. I know of a DWH team that use servers with locally attached SAS disk as this gives them a solution that cannot be affected by other tasks, meaning they get predicable (but not optimal) results.
"The system copies an entire database dump from a very large line of business system every day and runs huge amounts of number crunching on it.
"The speed this system can complete the process has a direct impact on the profitability and competitiveness of their organisation. The trade-off of the current DWH model is that they spend a lot of time copying working set data around over the network, when they could take advantage of snapshot technology in shared storage to get data where it needs to be instantly. They don't want to incur the extra latency and unpredictable performance of shared storage however.
"Having local flash cache could be the way forward if it could keep a large proportion of the working set in cache and smooth out the load on the shared storage. They would not use deduplication as in the ESX case above as it would be little benefit, but they would only sign up to this if it was 100 per cent supported by one vendor with one butt to kick if it all went pear-shaped."
All this makes good sense to me. Does it to you? Can you see the sense in storage arrays managing server-located flash cache and loading them with data? ®
It may work, but it's just a stopgap
There are certainly advantages to server-side SSD caching, the biggest of which is that it reduces load on storage arrays that are these days taxed far beyond what they were originally designed for, but in the long run I think we'll see server-side SSD caching as nothing but a complex stopgap making up for deficiencies in current array designs.
If you look at "why" it's claimed server-side cache is necessary, it basically boils down to:
-The array can't handle all the IO load from the servers, particularly when flash is used with advanced features like dedupe
-The reduction in latency from a local flash cache
The first is a clear indication that current array designs aren't going to scale to cloud-workloads and all (or mostly) solid state storage levels of performance. Scale-out architectures are going to be required to deliver the controller performance needed to really benefit from flash.
The second is based on the assumption that the network or network stack itself is responsible for the 5-10ms of latency that he's reporting. The reality is that a 10G or FC storage network and network stack will introduce well under 1ms of latency - the bulk of the latency is coming from the controller and the media. Fix the controller issues and put in all-SSD media, and suddenly network storage doesn't seem so "slow". Architectures designed for SSD like TMS, Violin, and SolidFire have proven this. Local flash, particularly PCI-attached, will still be lower, but that micro-second performance is really only needed for a small number of applications.
EMC and Netapp have huge investments in their current architectures, and are going to try every trick they can to keep them relevant as flash becomes more and more dominant in primary storage, but eventually architectures designed for flash from the start will win out.
Mr Leon has some valid points, but only *some*
He is talking specifically about NetApp:
“This limits the number of VMs you can provision on 31xx and 60xx filers to around 300-500 before the CPU in the filers get really hot” – that’s factually incorrect: http://www.vmware.com/files/pdf/VMware-View-50kSeatDeployment-WP-EN.pdf (50,000 VMs on ten FAS3170 clusters => 50,000 /20 = 2,500 VMs per storage controller)
“and limits the performance of the VMs themselves due the 5-10ms latency of a typical storage request incurred by the network stack” – it varies depending on what networking gear is used & in most mass deployment scenarios (VDI for typical office worker) is irrelevant.
That being said:
- NetApp has their own server-side caching project: http://www.theregister.co.uk/2011/06/17/netapp_project_mercury/
- VMware View 4.5 (& XenDesktop 5 as well) can utilise ‘standard’ local SSD drive for caching purposes: http://www.vmware.com/go/statelessvirtualdesktopsRA; funnily enough, they have used NetApp FAS2050 for testing :)
Yes but where ?
Assuming that you can create a distributed coherent cache ( which EMC / NetApp has been claiming is impossible for the last ten years) then where would you put the SSD cache ?
On the motherboard ? How would the local cache software communicate back to the remote array, how often would the cache update ( EMC updates their Flash cache once per day ). This would most likely use a kernel driver in the OS e.g., VMware to use the cache.
On the CNA / HBA ? And make it part of the storage understructure would require support in the driver ? At what price would this highly custom piece of silicon, that would be bathed in Unicorn Tears and individually blessed by a virginal Tech priest as it left the factory ? I'd expect it to be orders of magnitude more expensive than the Fusion-IO product. Fusion IO is a a goodish flash drive built use the PCI-E bus in certain computers - but an entirely custom CNA with Flash and handy CPU/Software is quite different.
More answers than questions here.