MISC-TN-017: Persistent storage and read-write file systems

From DAVE Developer's Wiki
Revision as of 09:13, 21 January 2021 by U0002 (talk | contribs) (U0002 moved page MISC-TN-017: Persistent storage and read/write file systems to MISC-TN-017: Persistent storage and read-write file systems without leaving a redirect: remove / from title (which is used in mediawiki for subpages)

Jump to: navigation, search
Info Box

History[edit | edit source]

Version Date Notes
1.0.0 January 2021 First public release

Introduction[edit | edit source]

In many cases, embedded systems that are based on Application Processors such as the NXP i.MX6 make use of read/write file systems. In turn, these file systems use non-volatile flash technologies integrated into several different devices (NOR flashes, raw NAND flashes, eMMC's, etc.).

By nature, these components are subject to several issues that need to be handled properly. If not, this can affect negatively their performance in terms of reliability and/or lifetime.

This Technical Note deals with the use of read/write file systems in combination with such memories providing some real-world examples as well.

Embedded Linux systems with NOR flashes or raw NAND flashes[edit | edit source]

Some of the following examples refer to embedded Linux systems making use of NOR flashes or raw NAND flashes. Such systems are commonly managed by MTD/UBI subsystems and, on top of them, UBIFS to manage files.

Therefore, before diving into these examples, we suggest to take a look at our Memory Tecnology Device (MTD) article where these subsystems are explained in more detail.

Wear-out[edit | edit source]

One of the most important factors to take into account is wear-out. Simply put, this is a degradation of the memory device due to repeated erasing/writing cycles — aka P/E cycles — resulting in a limited lifetime.

In order to mitigate this phenomenon, erasing and writing operations have to be distributed uniformly all over the memory. Please note that this process, known as wear leveling, can be either implemented in the host (in the case of a raw NAND memory, for example) or in the memory device itself (for instance, in the case of eMMC's).

Even though wear-out is properly managed, it is unavoidable when writing operations are performed. That being said, how to estimate the lifetime of such a device in practice? Manufacturers provide the number of guaranteed P/E cycles. For more details about this number, please refer to the specifications of your device, which detail the test conditions this number refers to. Once the guaranteed P/E cycles are known and assuming a proper wear-leveling algorithm is in place, the expected lifetime can be determined as follows.

First of all, the Total Bytes Written (TBW) has to be calculated:

TBW = [capacity * P/E cycles] / WAF

where WAF is the Write Amplification Factor. WAF takes into account the actual amount of data written to the memory when performing write operations. This is due to the fact that non-volatile flash memories are organized as an array of sectors that can be individually erased or written. Often, the size of erase sectors and write sectors are different. That is why, in the case of NAND flashes for instance, they are named differently (blocks and pages, respectively). WAF varies largely depending on the workload. If it is not known for the application under discussion, it can also be measured experimentally (see the following example for more details).

Once the TBW is calculated, the expected lifetime can be estimated with this equation:

LT = TBW / D

where D is the amount of data written in the unit of time of interests (month, year, etc.).

Example: embedded Linux system equipped with a raw NAND flash memory and UBIFS file system[edit | edit source]

This example shows how to estimate the lifetime of a raw NAND flash memory used in an embedded Linux system making use of the UBIFS file system. Specifically, the memory p/n is W29N08GVSIAA by Winbond, which is a 1-bit ECC Single-Level Cell (SLC) component. In this case, the wear leveling algorithm is implemented at the Linux kernel level.

According to the datasheet:

  • erase block size is 128KiB
  • the number of P/E cycles is 100000
  • the capacity is 1 GiByte (8 Gibit).

For the sake of simplicity, it is assumed that the file system makes use of the entire memory. Otherwise, only the capacity of the partition of interest has to be taken into account. Regarding the WAF, it is assumed it is 5. This means that for each byte written by the user-space applications and daemons, five bytes are actually saved onto the memory.

TBW = (1 GiByte * 100000) / 5 = 20000 GiByte ~ 19.5 TiByte 

Assuming that the user-space software writes 650 GiB every year, the expected lifetime is

LT = 20000 / 650 = 30.8 years
Experimental measurement of the actual written data[edit | edit source]

In many cases, WAF is unknown and can not be estimated either. As stated previously, the system integrator can determine the lifetime expectancy by adopting an experimental approach though. The following procedure describes how to determine the actual written data for the system used in this example.

The main indicator of how much data has been written for NAND devices is how many blocks has been erased, assuming that a block has been erased only if:

  • has already being written (even if not completely)
  • needs to be written again (this is not completely true, because UBI has a background tasks that erases dirty LEB while the system is idle).

Assuming that TEC is the sum of PEB Erase Counter and DAYS is the number of days the test has been run, the estimated amount of written data per year can be computed as:

D = (TEC * PEBsize) * (365 / DAYS)

This already includes WAF and, thus, we can estimate life-time, in year, as:

LF = [capacity * P/E cycles] / D

In the same case above, if we have 30000 TEC/day we have

LF = (1GiB * 100k) / ((30k * 128KiB) * (365 / 1)) ~ 74 years

Power failures[edit | edit source]

Even though modern file systems are usually tolerant w.r.t. power failures (*), in general, sudden power cuts should be avoided. The system should always be turned off cleanly. As this is not always possible, several techniques can be put in place to mitigate the effects of a power failure. For instance, see this section of the carrier board design guidelines.

(*) Roughly speaking, this means that these file systems are able to keep their consistency across such events. They can not avoid data loss if a power failure occurs in the middle of a write operation, however. For this reason, further countermeasures, such as data redundancy and/or the use of error-detecting/correcting codes, should be implemented at the application level for particularly important data. At the hardware level, DAVE Embedded Systems products usually leverage the "write protect" feature of flash memories in order to prevent erase/program operations during power transitions.

Example: embedded Linux system equipped with a raw NAND flash memory and UBIFS file system over UBI partition[edit | edit source]

Even though both UBI and UBIFS are designed with power-cut tolerance in mind without having support from additional hardware (e.g. supercap, battery power supply, and so on) some data might be lost and some weird effect happens when not performing a clean shutdown of the system.


Additional failures like UBIFS mounted as read-only at boot time usually do not depend only on power-cut but are symptom of major failures (buggy MTD device driver, storage device hardware failure, device wear-out, major EMI and so on).

When designing application to be as safe as possible w.r.t. power-cuts, please also take care of:

Memory health monitoring[edit | edit source]

Although implementing a mechanism for monitoring the health of flash memories is not required strictly speaking, it is recommended. Think about it as a sort of life insurance to cope with unpredictable events that might occur during the life of the product. As a result of a on-the-field software upgrade, for example, new features could be added leading to an increase of data rate written onto the flash memories. Consequently, the lifetime expectancy calculated when the product was designed is not valid anymore. In such a case, a properly designed monitoring system would alert the personnel devoted to the maintenance who could take measures before it is too late (see for instance the case of eMMC's used in Tesla cars). The following section details an example of such a system.

Example: embedded Linux system equipped with a raw NAND flash memory and UBIFS file system[edit | edit source]

There's two main indicator of NAND device health:

  • current ECC corrected errors
  • block erase counter.

We will focus on the latter because it is easy to extract and give a good lifetime expectation of the device.

UBI put its header on top of each NAND physical erase block (PEB) and here, among the other fields, user can find the erase counter (EC). By Comparing the sum of the EC of all PEB's with the nominal expected maximum erase count, user can estimate the usage of the whole NAND device.

To read EC directly from PEB at runtime, user can rely on ubidumpec tool: this is not yet merged in mtd-utils package, but is provided as RFC on linux-mtd mailing list (it is also provided by default on most of DAVE Linux Embedded development kit).

UBI partition expected remaining life in percentage can be calculated with a simple formula:

RL = ((MaxEC * nr_blocks) - sum(EC)) / (MaxEC * nr_blocks)) * 100


  • MaxEC is the maximum erase count supported by raw NAND
  • nr_blocks is the count of PEB that are contained on this partition

E.g. in case of a "standard" SLC NAND, which usually has 100k maximum erase count, this can be implemented as simple bash pipe between ubidumpec and awk:

ubidumpec /dev/ubi0 | awk -v MAXEC=100000 '{ s+=$1; n=n+1} END {print s, n*MAXEC, (((n*MAXEC)-s)/(n*MAXEC))*100 }'

This command prints:

  • sum of EC (in this /dev/ubi0 partition)
  • total number of erase/program cycle allowed by this partition
  • expected lifetime left to be used (in percentage).

Running on a (nearly) 1GiB partition on a brand new SLC NAND flash gives:

root@axel:~# ubinfo /dev/ubi0
Volumes count:                           3
Logical eraseblock size:                 126976 bytes, 124.0 KiB
Total amount of logical eraseblocks:     8112 (1030029312 bytes, 982.3 MiB)
Amount of available logical eraseblocks: 0 (0 bytes)
Maximum count of volumes                 128
Count of bad physical eraseblocks:       0
Count of reserved physical eraseblocks:  160
Current maximum erase counter value:     2
Minimum input/output unit size:          2048 bytes
Character device major/minor:            248:0
Present volumes:                         0, 1, 2

root@axel:~# ubidumpec /dev/ubi0 | awk -v MAXEC=100000 '{ s+=$1; n=n+1} END {print s, n*MAXEC, (((n*MAXEC)-s)/(n*MAXEC))*100 }'
8161 811200000 99.999

As a confirmirmation of this data, the maximum EC of a given UBI partition can be read directly from sysfs:

root@axel:~# cat /sys/class/ubi/ubi0/max_ec