IO, IO, it's profiling we do: Nimble architect talks flash storage tests
Bi-modal IO distributions a purer way to see array performance
Interview We interviewed Dimitris Krekoukias, Nimble Storage's global technology and strategy architect, on the subject of storage array performance claims – he has some strong opinions – particularly about Pure Storage's approach to performance.
Pure provided a response to Krekoukias' points which has been added after the interview. Both points of view stress that one IO size does not fit all performance characterisation needs.
El Reg: Tell me about storage array performance claims.
Dimitris Krekoukias: Storage performance (or any kind of performance, really) is one of those things often exaggerated for maximum marketing impact. Sometimes exaggerated by a whole lot, in conjunction with Reality Distortion math to make it more believable.
Case in point: The dual assertion by Pure Storage that not only is the average storage I/O block size 32KB, but that its arrays also offer high performance at the 32KB I/O size.
El Reg: What do you mean?
Dimitris Krekoukias: Here's one of the easiest analogies regarding how averages can be extremely misleading when it comes to performance: The average speed of a supercar (insert your favorite) is 60mph. This may well be a true statement given how fast most supercars are driven on average (combined city/highway/track), but it is not particularly useful in that it doesn't explain how the car actually performs in different real conditions.
Another car analogy: Pure claims a family car should be designed for a person who is four feet tall and weighs 100 pounds – because that's the average of a two adult, 2.3-child family. Such a car would be unusable – you need a car that feels comfortable for 5'8" adults weighing 160 pounds, as well as four-foot, 45-pound kids.
El Reg: So does Nimble know better about storage IO characteristics?
Dimitris Krekoukias: Nimble Storage has extremely comprehensive analytics from InfoSight, which collects far more information than any other storage array in order to build an extremely accurate picture of real-world array usage. Hundreds of billions of data points across thousands of arrays are processed every day by a massive, really Big Data back-end.
These data points contain enough information to not only predict and prevent problems, but also to point out many other things – for instance, how different applications actually carry out I/O in real life.
David Adamson (Nimble's principal data scientist) has published a nice blog post on this subject (here), plus some very thorough research (here, PDF) that breaks things down by application. To summarize:
- Large numbers of operations use a small I/O size. Small I/O sizes are more efficient for latency-critical workloads. Applications mostly do small block I/O in the less-than 16KB range, and primarily around 4/8KB. This kind of I/O is usually more latency-sensitive and often quite random. Across all customers and all applications, the bulk of all operations (52 per cent of all reads, 74 per cent of all writes) falls in the small I/O area.
- Large amounts of data tend to be transferred in large I/O sizes (greater than 64K). Large I/O sizes are more efficient for high-throughput data transmission. Indeed, the bulk of the data (84 per cent of all data read and 72 per cent of data written) uses a larger-than 64KB I/O size. Such I/O is more sequential than random and is not nearly as latency-critical.
- A good example is SQL Server. Strongly bimodal in I/O density – most I/O happens either around 8K or >64K, with a greater shift toward smaller I/Os in transaction-heavy, latency-sensitive OLTP environments, and a bigger concentration in larger I/O sizes in OLAP-type environments.
- Since the data distribution is bimodal (even within the same application), with extreme values at either end, using a single average number to define I/O size is not very useful. Actual I/O is simply not centred around that average number.
El Reg: How does this affect array benchmarking?
Dimitris Krekoukias: Since we now have more useful statistical data and know both the count, type and size of I/Os across the spectrum and various applications, we can more accurately benchmark storage. Clearly, benchmarking needs to follow the bi-modal aspect of real applications:
- Smaller block random with a high percentage of writes. A good array should be able to do lots of these at very low and consistent latencies.
- Larger block sequential (with a bit more reads vs writes). A good array needs high throughput.
El Reg: Have you used this approach in a real POC?
Dimitris Krekoukias: Yes. Pure publicly claims a performance potential of 300,000 32K IOPS for its //m70 array. Recently, a customer performed a performance bakeoff between a Nimble AF7000 (not Nimble's fastest model) and the Pure //m70 (Pure's fastest model as of August 2016).
One of the tests performed did I/O at 50 per cent reads vs writes, and small 4KB block sizes. The Pure array performed at about half the number of IOPS that Pure claims it should be able to do. The Nimble array beat it, but that's not the point.
The main point here is that Pure utterly failed to meet its own performance claims for exactly the transactional type of performance All-Flash Arrays are supposed to help with, using realistic I/O sizes for transactional I/O. So its stated high 32KB IOPS number (which we believe are all reads) isn't useful in real world environments.
El Reg: And the moral of this story is...
Dimitris Krekoukias: Just because something has flash in it doesn't mean it will necessarily be fast for what you need it to be. At a minimum, ask vendors for performance numbers for both small block random and large block sequential I/O, including during heavy writes. And don't forget to also ask for latency figures for the small block performance.
Pure Storage and the 32K IO issue
A Pure spokesperson responded at length to Krekoukias' points, and said that Pure "agrees 100 per cent that the storage industry has been guilty of exaggerating storage performance for marketing purposes using hero benchmarks at specific block sizes."
These fixed block sizes are not true representations of workloads running on an array. This is why Pure took the leadership step over two years ago to stop publishing 4K IOPS and moved to 32K IOPS numbers, as we felt they were more representative of real-world workloads (more on this below). This is also the reason we always encourage customers to test using copies of their real workloads while evaluating storage systems.
"Nimble seems to assert that Pure believes that the entire world is 32K IO size, or averages to that," the spokesperson added.
"If you check out our blog, it depends quite a bit on whether you are looking at simple IO size or actual data transfer (size-weighted IO). The blog explains quite clearly why we decided to talk about 32K, rather than 4K or 8K; however, that is only the start of the conversation with a customer around sizing.
"There is no 32K IO size assumption or optimization in FlashArray. Unlike most vendors who prefer to optimize for a particular class of workload using a fixed block metadata architecture (4K, 8K, 16K, 32K etc.), we designed Purity from the beginning around a VARIABLE block size, with the realization that IO is variable size on ALL arrays (esp. in mixed/virtualised workloads).
"The advantage of a Pure Storage FlashArray has always been that it handles mixed IO sizes well without tuning or worry, with consistent performance and with maximum data reduction. We don't have a fundamental block size in our architecture. In other words, we don’t have the problem inherent in classic storage systems where splits and wastes occur even in moderate IO sizes."
Pure's spokesperson continued: "Why do we publish 32K IOPS in our specifications although our architecture is not designed around, nor optimized for, a specific IO size? In the early days the market still benchmarked storage devices on fixed block size. In fact, most standardized benchmarking tests called for a fixed block size. Since we had to choose one size, we chose a block size more representative of the workloads consolidated in an array.
"Naturally this is the size-weighted average of IO in the array which happens to be closer to 32K. As we have shown in our recent blog published by our data science team, the size-weighted average IO for an array comes to around 32K when you consolidate numerous workloads.
"At the macro level, when you consider workloads across our entire fleet of storage systems, block sizes of 16K or lower dominate for IOPS whereas block sizes greater than 64K dominate for throughput. And the data distribution is indeed bi-modal as you can see in these two carts from our blog.
"Kudos to other vendors for starting to do similar data driven research. We have gone further and asked much deeper questions that influenced our variable block design across all data services based on these findings. We have been openly discussing and encouraging the market to look at block sizes holistically for long time."
Find blog articles here where Pure people dig deeper into block sizes for various workloads. They all have a common theme: it is meaningless to compare 4K-8K-based vanity benchmarks because application IO is more than just a single block size repeated transactions. ®