How ECC Error Correction Works on an SSD

Surely you have heard (or read) talk about the ECC error correction code in many hardware components, all related to memory (either RAM or storage), although very few understand its importance. For this reason, in this article we are going to explain how the ECC works in an SSD controller , and how thanks to it it is possible to increase longevity and make a big difference in the useful life of SSDs .

Every device that uses NAND Flash memory requires a random bit error correction code (known as a “soft” error). This is because a lot of electrical noise is produced inside a NAND chip and the signal levels of the bits that pass through a chain of NAND chips are very weak.

How ECC Error Correction Works on an SSD

One of the ways in which NAND memory has become the cheapest of all is because it requires that error correction be performed from an element outside the NAND chip itself; In the case of SSDs, the ECC is performed on the controller .

How the ECC works on an SSD controller

This same error correction also helps correct bit errors due to wear on the memory cells themselves . Attrition can cause bits to “get stuck” in one state or another (known as a “hard” error, or hard error) and can increase the frequency of “soft” errors.

Although not a concept that is overly broad, flash memory resistance is a measure of how many erase / write cycles a flash block can withstand before “hard” errors begin to appear. Very often these failures are only in individual bits, and it is very rare that the entire block fails. With a high enough erase / write number, the “soft” error rate also increases due to a number of other mechanisms in the SSD itself.

Funciones de una controladora SSD

If ECC can be used to correct these “hard” errors and the “soft” errors do not increase, then the life of the entire block is greatly lengthened, well beyond the resistance specified by the manufacturer.

Let’s take an example: let’s say that an unused NAND chip has enough “soft” errors to require 8 bits of ECC, that is, each page read can have up to 8 bits that have been corrupted randomly (generally due to the electrical noise we were talking about). at first). The ECC used in this chip can correct 12-bit errors, so that the ECC could not solve this problem we should find 8 “soft” errors related to electrical noise plus another 5 “soft” due to wear.

Now, flash memory manufacturers guarantee that the first of those 5 failures will occur sometime after the SSD’s strength specification. This means that no bit will fail due to wear until the erase / write cycles specified by the manufacturer are exceeded. Now keep in mind that the specs are not precise enough to predict when the next bit will fail, and it can actually take several thousand erase / write cycles above the spec for this to happen; remember that the manufacturer guarantees that it will not happen before X cycles, but not when it will happen once they are exceeded.

This means that it can take a long time before a block becomes so corrupt that it needs to be removed from service (and also for this the SSD usually have “extra” blocks to replace the corrupt ones), which in turn means that the resistance of the error-corrected block could be many times larger than the specified resistance, depending on the number of excess errors that the ECC is designed to correct.

What impact does the error correction code have on an SSD?

As we have explained before, flash memory is so cheap because it does not include the ECC in the chips themselves, but is integrated into other hardware external to them, and as you will suppose this has a price. A more sophisticated ECC requires more processing power on the controller and can be slower if the algorithms are not very modern. Also, the number of errors that can be corrected will depend on how large the memory sector is being corrected, so an SSD controller with a sophisticated ECC algorithm is likely to use a lot of resources, reducing overall SSD performance . These enhancements also make the controller more expensive .

Phison-E16-controladora

ECC algorithms have their own mathematical state depending on the controller (in other words, there is no standard), and even the most basic ECC encodings (Reed-Solomon and LDPC) are quite complicated to understand. When someone talks about the Shannon limit (the maximum number of bits that can be corrected), it is a quantity that, as you don’t know from the manufacturer in the technical specifications, is extremely difficult to calculate.

Just stick with it: More correction bits lead to a longer lifespan for the SSD, but it also has some impact on performance, or even product price, by needing a more powerful controller.