Flexi-Plexistor's software-defined memory roadmap
Storage-class memory emerging in software
Comment Startup Plexistor's SDM software is said to run any application at near-memory speed by using caching and tiering. It has a file system that covers DRAM, NVDIMM-N (byte-addressable flash DIMMs fully mapped to memory space and accessed at cache-line granularity), NVDIMM-F (block-addressable flash DIMM on memory bus), forthcoming XPoint, and SSDs.
The open source software runs on Linux (Red Hat, CentOS and Ubuntu) and converges memory, meaning DRAM or NVDIMM-N, and flash storage according to Plexistor, providing a single, unified memory address space. DRAM and NVDIMM-N provide a tier 1 with SSD/PCI flash providing a tier 2, that is still seen as memory by applications like MongoDB, Cassandra and Couchbase.
The second tier is limited to 12.5 times the capacity of the first tier. In tier 2 Plexistor says, "NVMe devices are preferred, but aggregating several SSDs via Linux LVM or using an AFA LUN are also valid options."
The software eliminates the OS and flash media's block abstraction layer, and cuts latency time by allowing the application to directly access the storage media without creating an additional copy of the data in DRAM*. Write accesses to the flash part of SDM's memory space are immediately made persistent.
Traditional Linux IO stack. Feel the complexity
A Plexistor white paper has this to say about running MongoDB on its software:
"Plexistor's Software Defined Memory (SDM) accelerates performance for MongoDB by liberating it from the overhead of the ordinary Linux operating system's I/O stack and the constraints of decades-old conventional storage architectures, which are overdue to be replaced. Through the evolutionary approach of Plexistor's SDM and NVDIMM-N memory cards, MongoDB performance no longer must be sacrificed to ensure data persistency or durability."
Plexistor SDM "IO" stack
Amit Golander, Plexistor's CTO, told us that XPoint, if/when it arrives, would be a tier 1 media for Plexistor. The second tier can include NVME over Fabrics-connected all-flash arrays. There is also a tier 3 for (relatively) cold data, which is NFS disk or the AWS public cloud. Auto-tiering is a Plexistor SDM data service and SDM is NUMA-optimised. These services are being extended.
Currently SDM is at v1.71 level. The coming v1.8, to be released later this month, will introduce mirroring as a basis for future availability.
A later version this quarter will introduce cluster-wide high-availability (slide 11) based on mirroring between nodes. Asynchronous mirroring will add one microsecond of latency while sync' mirroring will add three. The node interconnect is 100Gbit/s Ethernet and switch interconnects servers running the SDM software and so-called "PM and Flash Bricks," where PM stands for Persistent Memory**.
The PM Brick is a dumb passive brick allowing RDMA access; an NVME over fabric appliance, providing remote memory. Plexistor is working on a protocol and library for that. It would support things like requests to allocate and free memory units. It could be a hybrid of XPoint and 3D NAND and, Golander says, tier data up and down in 2MB chunks.
That will be followed by a single namespace so you could access data/file from wherever. Golander said: "We want to keep compute and data on the same node to keep latency down. Our clustering technique is around data and metadata locality ... You assume data is on the local node and go off node if there is a miss-prediction."
V2.0 will introduce snapshots and clones and mirroring, and also have ideas around Docker integration. The enhanced single namespace will be in a later version. V1.8 will feature public domain testing. The v2.0 mirroring software will likely have a test restricted to a few customers.
With Plexistor, we're seeing Linux software being developed to provide storage-class memory, the concept that Fusion-io originally popularized but failed to turn into available software product, and which acquirer SanDisk seems to have more-or-less abandoned. That's why we're paying what must seem like a lot of attention to it. ®
* With MongoDB, its storage engine known as WiredTiger that buffers data writes in memory as storage media is traditionally orders of magnitude slower than memory. A second copy of the data is kept in the MongoDB journal for persistency. Were MongoDB to be amended so as not to do this, Plexistor reckons it could be accelerated 19 per cent more than the 450 per cent Plexistor achieves "out of the box."
** See US patent application number 14/658264 dated 12/10/2015 by Amit Golander and others. It describes a method for data placement based on a file-level operation.