Original URL: http://www.theregister.co.uk/2010/12/13/making_mlc_flash_last_longer/

'Wear levelling' - a bedroom aid for multi-layer cell Flash

Helps it last longer

By Chris Mellor

Posted in Storage, 13th December 2010 09:15 GMT

Comment Multi-level cell NAND flash doesn't last as long as single-level cell flash but there are ways to increase its life, under the generic heading of wear-levelling.

NAND flash cells have a finite life, in that they only support a specific number of writes before failing to return valid data from a read request. It's necessary to understand some background characteristics of flash before moving on to ways of overcoming this limitation.

Flash is not byte-addressable, unlike disk drives and DRAM. Instead it is block-addressable, with pages of 4KB or so making up blocks of 512KB to 1MB typically, and bytes are written as page writes. A further property of flash is that two access cycles are needed to write data to a block that already contains written data.

Access one is used to erase all the data in the block's cells, a block erase, and access two is used to write the fresh data there. A common problem is that not all the data in a block is invalid and the valid data must be retained.

This means that the write cycle is further complicated by the need to somehow retain the valid data in a cache, delete the block's cells, and then write in the saved valid data and the new fresh data. The whole process is called a read-erase-modify-write cycle, and obviously takes far longer than writing to empty flash cells.

In effect, the single write needed for new data is increased or amplified by additional writes to save the older data. Different flash products have different values for such write amplification and, generally speaking, the lower the value the better. The difference between the amount of data a host wants to write - say 10MB - and the actual data the SSD controller has to write to achieve that - say 27MB - is the write amplification factor.

A factor as close to one as possible is ideal. Our numbers provide a 2.7 write amplification factor, which sounds bad but is actually used by an SSD startup called Anobit, as we shall see.

The greater the write amplification factor, the shorter the working life of the flash and the more its performance will degrade over time.

Garbage collection

All this erasing and writing takes time, and so what generally happens is that the erase part of the write cycle is carried out before an actual data write is needed. The process is called garbage collection, and occurs in the background while the flash is not being used.

The solid state drive (SSD) controller has a map of the NAND it controls and the blocks and their usage. When the host system deletes data stored in the flash it is marked as being invalid in the map.

In quieter periods the controller reviews the map or patrols the flash as some suppliers say and erases those blocks with the most invalid data. As it wants to reduce the number of write cycles a block will undergo, we understand it would generally choose the blocks containing the oldest invalid data and erase those. Such erased blocks are then added to a list of available blocks, the free block pool.

Now, when a write request comes in the data is written to blocks from the free block pool and thus the write is accomplished in a shorter time. Again, we would expect the oldest blocks in the pool to be used as a way of reducing the total of writes over a time period to any one block.

Pliant says it uses background patrol read and memory reclaim and offers unlimited writes with no restrictions limiting the frequency of writing over the rated lifetime of its products, like the LB 200M and LB 400M MLC NAND family.

Wear-levelling

Wear-levelling algorithms are used to reduce the likelihood of particular blocks being used up, having their maximum number of writes reached, and so falling out of service and reducing the capacity of the SSD, and its ability to free up blocks for fresh writes.

With dynamic wear-levelling written data is put in blocks from the free block pool. Garbage collection patrols the existing blocks and allocates ones with deleted (invalid) data to the background erase processing after which they are added to the free block pool.

However blocks containing static, unchanging data just sit there and don't get rewritten. We could imagine that over a period of, say 6 months, such blocks receive zero writes whilst other blocks could receive, let's be dramatic, 200 writes, this creating an imbalance. Static wear-levelling locates these static data blocks and moves the data to the more often written blocks, transferring their data to the previously static data blocks.

Then, over the next six months the previously static blocks get 200 writes and the previously well-used blocks get no writes. At the end of a year both sets of blocks have had 200 writes; a levelled wear number between the two sets of blocks.

All the foregoing applies both to SLC and MLC NAND. The problem with MLC NAND is that its endurance is less than SLC NAND, For example, Samsung has suggested SLC flash can support up to 100,000 writes, 2-bit MLC is a tenth of that at 10,000 writes and 3-bit MLC is ten per cent of that at 1,000 writes. Extending this trend would have 4-bit MLC supporting 100 writes; clearly a complete no-no for its deployment unless radical measures are taken.

SandForce says its controllers level the amount of writes across flash blocks. The controllers have a recycler for garbage collection, and its says its DuraWrite technology optimises the number of program cycles and this can, SandForce claims, extend the endurance of its flash by up to 20 times compared to other controllers.

Over-provisioning

The main point of basic wear-levelling is to ensure an equality of write numbers across the blocks in the SSD. Above and beyond that there are other techniques used to extend endurance.

One technique is to over-provision the flash, an opposite of the thin-provisioning idea seen in shared storage arrays. An SSD with a nominal capacity of 200GB may actually have 250GB capacity, with the extra 50GB hidden from the host system and used solely at the discretion of the SSD controller. As flash blocks in the SSD wear out they are mapped out of use by the controller, and a new block added to the general free block pool from the 50GB reserve.

There is a limit to how long this will work because, eventually the 50GB reserve is used up and the SSD then faces a slow death as blocks fail one after the other. If the SSD is targeted at a known application, such as a consumer media player then its makers know most writes will be of long sequential files, music tracks or videos, and they can predict how long a given amount of flash will last if they assume an average number of bytes of data is written per day.

With a combination of wear-levelling and over-provisioning they can produce flash for a consumer device that could last say, five years with 500GB of data being written per day.

TRIM

It is harder for SSDs used in business and desktop computing applications as a large number of small writes can occur, ones of 4KB or less with Windows, according to a SanDisk spokesperson, Don Barnetson. Host operating systems could manage SSD storage better, by batching up small writes or suchlike, and some SSD controllers do that by caching writes until a large amount of data to be written has built up.

However, host operating systems for notebooks, desktops and servers assume their online storage outside main memory is disk and write to what they think are disk devices. The SSD controller simply has to cope.

One way that operating systems have improved is with the TRIM command. This command is used by operating systems like Windows to tell the SDD controller that certain sets of data are no longer valid. Normally an O/S file system tracks which files are valid or deleted.

If a file is deleted the file system sends a note to itself saying certain disk sectors are now available for use again. It does not tell the hard disk drive controller that though; there is no need. So, up until the TRIM command came along it didn't tell the SSD controller either.

When TRIM is supported then the SSD controller can can then work out the flash blocks involved when a disk sector holds invalid data - it already maps disk sectors to flash blocks as part of its day job as it were - and add them to the background garbage collection pool. This way it can build up a buffer of free blocks that can be written to far more quickly than if an entire read-erase-modify-write cycle were needed for a block.

Signal processing

Flash cells don't simply stop working when they have received one write request more than they can manage. Instead the quality of information they return to a read request degrades over time until the flash controller can no longer use it. Israeli startup Anobit is using software digital signal processing (DSP) technology to better extract a flash cell's information from what channel processing people call noise.

There is a signal to noise ratio (SNR) for the channel to a flash cell and Anobit says its DSP algorithms can extract the signal at lower signal to noise ratio levels than existing hardware methods. As the cells' signal to noise ratio has more of a noise component due to the write-erase cycles mounting up, Anobit says its controllers can continue to use the NAND chips long after other controllers have given up.

It claims it can produce 2-bit MLC flash drives supporting 50,000 write/erase cycles, five times more than the Samsung number above. This is with its first generation technology. Anobit says its product has a five-year endurance for its 200GB SSD with a 2TB/day write rate and a 10-year endurance for its 400GB product at the same rate, or five years at 4TB/day.

It calculates this by multiplying the capacity of the SDD by ten, then by the number of days in a year, then by five for five years, and then by a 2.7 write amplification factor. Each cell in the SSD is written to 10 x 365 x 5 x 2.7 = 50,000 times or cycles.

Anobit reckons its second generation product will support up to 100,000 cycles for 3-bit MLC using sub-20nm process technologies. Data reading takes longer this way than using pure hardware to read the flash, but Anobit suggests reading multiple pages in one read event could compensate for that.

If these claims are justified then DSP could become a standard part of a flash controller's technology. It's interesting to note that STEC's latest CellCare technology uses signal processing as well as data management algorithms to improve MLC endurance.

The promise of MLC

The great promise of MLC flash is that it will make flash affordable, engendering much wider use. But this is only the case if it has an acceptable working life.

Its endurance is being extended wear-levelling methods, over-provisioning (which unfortunately adds to its cost), TRIM support and signal processing. Of course there are other necessary factors such as a low error rate, but that is not the focus of this article.

The net of all this is that each flash cell should be written to the same number of times as its neighbour as far as possible, and the overall number of writes minimised by better organising the SSD's resources and tracking of cell, page and NAND block status.

Lastly, DSP techniques promise to dramatically increase cell content readability as wear levels go up. ®