OASIS: Refreshment for dehydrated secondary storage users?
Cohesity to converge secondary storage into its platform
Comment Cohesity has launched its all-embracing secondary storage offering on what looks like an embedded hyper-converged infrastructure appliance platform.
We first learned about this startup's technology in June.
The intent is to provide a kind of super-secondary storage repository covering the many different secondary storage use cases for which customers typically use several different systems. Cohesity says it saves customers CAPEX and OPEX costs associated with having different systems by converging them all onto its single system and avoiding capacity-based licensing.
The Cohesity Data Platform uses a scale-out cluster architecture of from 4 to 40 nodes or blocks. Each block has four clustered servers in a 2U chassis with direct-attached disk and PCIe flash drives.
There are two models in a C2000 product line. The C2300 offers 48TB of raw disk capacity and 3.2TB of PCIe flash storage (12TB and 800GB respectively per node). A C2500 has 96TB of disk and 6.4TB of flash (24TB and 1.6TB per node).
A 40-node C2500 system would have 3.84PB of raw capacity and need at least two racks
The operating system is OASIS, meaning Open Architecture for Scalable Intelligent Storage. OASIS data services include:
- Data protection (backup and recovery) for virtualised environments through unlimited snapshots and thin cloning
- DevOps support via clones from backup data
- File services wth NFS and SMB support
- Deduplication which is global and policy-driven
- Analytics - built-in and programmable
- Native analytics capabilities provide real-time metrics and forecasting
- A Programmable Analytics Workbench runs custom queries against datasets
- Analytics run in-place on the Cohesity cluster
- Analytics are based on MapReduce
C2000 hardware blocks
Analytics clearly need CPU cycles to run, when they run, and not when they don't. So what are those CPU cycles doing when analytics are not running? Unless Cohesity says otherwise they are sitting there idle, which would be a waste. If they are not sitting there idle then they are being used for storage tasks. Which leads to the question: how do you maintain storage performance when analytics are running? Is here some sort of quality of service functionality?
Delphix and Actifio and Catalogic copy-reduction technologies now have a friend with OASIS's thin-cloning of files for developers.
The system is designed, Cohesity says, to work with existing on-premises storage, so customers don't have to rip-and-replace any existing secondary storage systems. So it starts out in this case as yet another secondary storage silo, albeit one which can replace others. Only then would you start saving money.
Customers can scale out one node at a time, Cohesity is calling this a pay-as-you-grow approach. It's apparent that, at the moment, you can't scale compute and storage separately. We would imagine there is some form of load-balancing so that when a node/block is added the overall workload is rebalanced to distribute the workload equally across the blocks. If there isn't then there should be.
We asked Cohesity CEO Mohit Aron some questions about the points that intrigued us with this product launch.
El Reg: Would you agree the scale-out hardware design looks like a hyper-converged infrastructure appliance dedicated to storage?
Mohit Aron: Yes, we are taking advantage of the dense, standards-based servers that web-scale companies like Google, Amazon, etc have been popularizing. For me, it’s my third time around building distributed systems software designed for this type of hardware. The first time was pioneering this type of architecture with the Google File System at Google and the second time was adapting this for primary storage virtualization with the hyper converged architecture I built at Nutanix.
That brought compute, networking and storage hardware together. Now I’m taking on a bigger problem in secondary storage in terms of the volume of data and types of workloads for data protection, DevOps, file services and analytics. Cohesity doesn't just converge hardware components, it also converges secondary storage workflows together.
El Reg: How do you maintain storage performance when analytics are running? Is there some sort of quality of service functionality? Or are there dedicated analytics CPU resources which are idle when analytics are not running?
Mohit Aron: Highly granular QoS is a fundamental feature in the platform, configurable on per workload process basis. We allow admins to allocate resources to particular workloads, e.g., analytics and backups, to isolate their performance from one other. This allocation, however, is work-conserving - which means that if they're not being used 100 per cent of the time, we'll reallocate those resources to other workloads.
Our built-in storage analytics will provide useful statistics over time to guide customers in growing their Cohesity cluster to optimize for the unique mix of secondary storage workloads in their environment. We also provide policies that can be tuned per workload. For example, our deduplication functionality can be configured to be inline for backups, and post-process for test/dev workloads.
El Reg: Will there be an HDFS connector added or some other way of linking to Hadoop-style data?
Mohit Aron: Yes, support for HDFS and Apache Spark is in the works. Stay tuned.
El Reg: How would you compare and contrast Cohesity's product ideas with those of Actifio, Delphix and Catalogic? I'm thinking of file copy reduction, for example, here.
Mohit Aron: These vendors are tackling the data sprawl issue by focusing on one particular symptom, which is copies of that data for a particular use case, e.g. for databases with Delphix, or backup at Actifio. A key difference is that these products are added on top of your existing siloed storage products, and in the case of several of them, do not scale out to handle large data capacities in a simple to manage manner.
Cohesity is looking at the root of the problem, which is why are those storage silos there in the first place when each system is not at 100 per cent utilization? With backup, this is particularly horrendous because a backup target just sits there idle as an insurance policy. We’ve designed the Cohesity Data Platform to be a storage architecture that is built to handle diverse workloads simultaneously and manage QoS accordingly, so you can run a backup stream, spin up clones for development, deliver file services and run analytics, all at the same time.
It’s a combination of really smart distributed systems software and use of a hyper converged hardware platform with PCIe flash that enables this. We liken it to what’s happened with server consolidation. Cohesity delivers secondary storage consolidation to deliver new levels of efficiency and simplicity to your data centre.
El Reg: How would you compare and contrast Cohesity's product ideas with those of Coho, which also does in array processing of storage system-level application functionality, such as video transcoding?
Mohit Aron: The concept of running compute services closer to where the data is stored is definitely growing in popularity in primary storage as people realize they are producing way too much data to move around efficiently.
We believe it makes more sense to run compute services such as analytics on secondary storage so that (a) These non-mission critical services do not contend with mission critical ones running in primary storage, (b) Secondary storage today is anyways idle and is being used only as an insurance policy - we improve utilization by using the idle compute cycles there to drive additional value to our customers, (c) Secondary storage contains far more data than primary storage.
El Reg: It appears that you can't scale compute and storage separately? Is this a problem?
Mohit Aron: Cohesity's software has been architected to deal with heterogeneous hardware. Thus, to scale compute, one needs to add compute heavy nodes in the cluster, and to scale storage, one can add more storage heavy nodes in the cluster. Today we have two hardware models with different storage capacities that can be mixed in the same cluster. Over time we will provide more options. Cohesity protects a customers’ investment because they only have to buy what they need when they need it, and grows with their needs over time.
Cohesity's Data Platform is available now, at a starting price under $120,000. ®