Apeiron claims NVMe fabric speed without NVMe over fabrics - but how?
The secret's in the special HBA hardware sauce
Backgrounder Apeiron Data Systems' external ADS1000 array uses NVMe media to deliver block storage access using NVMe over Ethernet (NOE) but not NVMe over fabric technology, (NVMeF) which can use Ethernet. How does this subtle distinction work and what is the difference between NOE and NVMeF?
NVMeF is a way of using the NVMe protocol, based on a standard PCIe flash card driver, to link a server with an external storage array at near-PCIe speed, across an InfiniBand, Ethernet or Fibre Channel connection. For the accessing server it is like making a local, direct-attached PCIe flash drive access only across a network.
Data is transferred to/from the array using RDMA (Remote Direct Memory Access). One way of doing this is to use iWARP (internet Wide Area RDMA Protocol) across an IP Network which can use Ethernet or InfiniBand cabling. Another is RoCE (RDMA over Converged Ethernet), which works over layer 2 and layer 3 DCB-capable (DCB - Data Centre Bridging) switches.
NOE is a different way of using the Ethernet “plumbing” and needs proprietary HBAs and storage software in the accessing servers to set up a so-called data fabric linking accessing servers to the ADS1000 array and its NVMe flash drives. Crossing this fabric is said to take 1.5 microseconds (a three-microsecond round trip), which gets added to the NVMe flash drive’s access latency.
The Apieron converged system, with Apeiron compute nodes linked by Apeiron HBAs to ADS1000 storage enclosures fitted with Ethernet switches, has a distributed storage architecture which virtualises the accessing server connections to the array. Both the HBAs and the switches use Intel Altera FPGA ASICS. The accessing server nodes handle storage processing. Storage is virtualised at the block-level interface and Apeiron software uses policy-based user configurations to manage the physical mapping.
Overall, the system passes NVMe commands to the drives from the application servers, with the ASICS handling NVMe/PCIe-to-Ethernet interchanges.
An ESG paper states: ”The driver and HBAs virtualise each storage command by inserting a fixed, four-byte identifier and sending it over hardened Layer-2 Ethernet to the integrated switch and FPGA component; the packet is identified and either sent directly to the appropriate drive via native NVMe, or to the dedicated server-to-server communication channel.”
It continues: “This out-of-band server network eliminates server 'chatter' from the data path. To ensure network robustness, Apeiron has integrated all of the error checking and correction capabilities of Layer 3, but without the overhead. By moving switching and storage functions (such as the physical-to-virtual mapping) to the FPGAs, and eliminating server-to-server traffic from the switching infrastructure, Apeiron has developed ultra-high performance NVMe networking at petabyte scale.”
This is how Apeiron compares and positions NOE and NVMe over fabrics:
NOE versus NVMe over Fabrics
Apeiron's system is proprietary, although it uses commodity NVMe drives and X86-based servers. If you have workloads that need its level of performance, then the product alternatives are few - DSSD and Mangstor to name two, with E8 coming along. Tegile and Kaminario and HPE will also provide NVMe over fabrics systems in the future - 12-24 months? - as may NetApp, so your choices will increase.
Price, reliability, availability, resilience and scalability will all come into play as will NVMeF standards and that standardisation may be a strong influencer of buying decisions. However NVMeF is beset by implementation choices, such as iSER, ROCE and iWARP, and these may prove confusing to buyers, who could then be grateful for NOE's clarity.
Check out this ESG paper for a performance view of Apeiron’s ADS1000 system, and position the product in a DSSD-like corner of your storage landscape. It has NVME over fabrics speed but doesn’t use NVMe over fabrics to do so, basing its speed on ASIC-accelerated Ethernet instead. ®