Difference between revisions of "MISC-TN-017: Persistent storage and read-write file systems"

From DAVE Developer's Wiki
Jump to: navigation, search
(Video)
(Appendix 2: Video)
 
(2 intermediate revisions by one other user not shown)
Line 27: Line 27:
 
|Minor changes
 
|Minor changes
 
|-
 
|-
|3.0.0
+
|{{oldid|16652|3.0.0}}
 
|May 2022
 
|May 2022
 
|Added detailed analysis of e.MMC accesses (SanDisk SDINBDG4-8G-XI1)
 
|Added detailed analysis of e.MMC accesses (SanDisk SDINBDG4-8G-XI1)
 +
|-
 +
|3.1.0
 +
|June 2022
 +
|Added video of technical presentation by Lauterbach Italy
 
|}
 
|}
  
Line 600: Line 604:
 
}
 
}
 
</syntaxhighlight>
 
</syntaxhighlight>
 +
 +
==== Appendix 2: Video ====
 +
Technical Note presentation by Lauterbach (Language: Italian; Subtitles: English and Italian)
 +
{{#ev:youtube|YDWAGy2QnA0|600|center|Persistent storage and read-write file systems|frame}}
  
 
=== Device's built-in advanced functionalities ===
 
=== Device's built-in advanced functionalities ===

Latest revision as of 14:17, 16 June 2022

Info Box
DMI-Mito-top.png Applies to MITO 8M


History[edit | edit source]

Version Date Notes

1.0.0

January 2021 First public release

2.0.0

January 2022 Added the sections
  • "Embedded Linux systems with eMMC or SD cards"
  • "Example: embedded Linux system equipped with SanDisk SDINBDG4-8G-XI1 eMMC and ext4 file system"

2.0.1

January 2022 Minor changes

3.0.0

May 2022 Added detailed analysis of e.MMC accesses (SanDisk SDINBDG4-8G-XI1)
3.1.0 June 2022 Added video of technical presentation by Lauterbach Italy

Introduction[edit | edit source]

In many cases, embedded systems that are based on Application Processors such as the NXP i.MX6 make use of read/write file systems. In turn, these file systems use non-volatile flash technologies integrated into several different devices (NOR flashes, raw NAND flashes, eMMC's, etc.).

By nature, these components are subject to several issues that need to be handled properly. If not, this can affect negatively their performance in terms of reliability and/or lifetime.

This Technical Note deals with the use of read/write file systems in combination with such memories providing some real-world examples as well.

Embedded Linux systems with NOR flashes or raw NAND flashes[edit | edit source]

Some of the following examples refer to embedded Linux systems making use of NOR flashes or raw NAND flashes. Such systems are commonly managed by MTD/UBI subsystems and, on top of them, UBIFS to manage files.

Therefore, before diving into these examples, we suggest taking a look at our Memory Tecnology Device (MTD) article where these subsystems are explained in more detail.

Embedded Linux systems with eMMC or SD cards[edit | edit source]

Another typical use case refers to eMMC's and SD cards. As explained here, these components are FTL devices, where FTL stands for Flash Translation Layer. This layer emulates a block device on top of flash hardware. Therefore, these storage devices are used in tandem with file systems such as ext4 and FAT32. Besides a raw NAND flash memory, eMMC's and SD cards integrate a microcontroller implementing the FTL and other important tasks as detailed in the rest of the document. All things considered, eMMC's and SD cards appear therefore to the host as managed-NAND block devices.

Regardless of the file system used, e.MMC devices provide some functionalities conceived to monitor their health while operating. As these functionalities are defined by JEDEC standards, all the vendors implement them.

In practice, e.MMC's integrate some registers providing specific information about the health status. These registers can be accessed with the mmc-utils, which are documented here. Interestingly, JEDEC standard also defines a set of registers (VENDOR_PROPRIETARY_HEALTH_REPORT) that vendors are free to use for providing further, fine-grained information about the device's health status. Engineers and system integrators are supposed to contact the e.MMC manufacturer to get the required tools for accessing such registers.

The sections related to eMMC-based use cases are the result of a joint effort between Western Digital (which purchased SanDisk in 2016), Lauterbach Italy, and DAVE Embedded Systems. Parts of such sections are retrieved from the White Paper TRACE32 log method for analysing accesses to an eMMC device by Lauterbach, which is freely available for download here.

Lauterbach-logo.png
WesterDigital-logo.png

Wear-out[edit | edit source]

One of the most important factors to take into account is wear-out. Simply put, this is a degradation of the memory device due to repeated erasing/writing cycles — aka P/E cycles — resulting in a limited lifetime.

In order to mitigate this phenomenon, erasing and writing operations have to be distributed uniformly all over the memory. Please note that this process, known as wear leveling, can be either implemented in the host (in the case of a raw NAND memory, for example) or in the memory device itself (for instance, in the case of eMMC's).

Even though wear-out is properly managed, it is unavoidable when writing operations are performed. That being said, how to estimate the lifetime of such a device in practice? Manufacturers provide the number of guaranteed P/E cycles. For more details about this number, please refer to the specifications of your device, which detail the test conditions this number refers to. Once the guaranteed P/E cycles are known and assuming a proper wear-leveling algorithm is in place, the expected lifetime can be determined as follows.

First of all, the Total Bytes Written (TBW) has to be calculated:

TBW = [capacity * P/E cycles] / WAF

where WAF is the Write Amplification Factor. WAF takes into account the actual amount of data written to the memory when performing write operations. This is due to the fact that non-volatile flash memories are organized as an array of sectors that can be individually erased or written. Often, the size of erase sectors and write sectors are different. That is why, in the case of NAND flashes for instance, they are named differently (blocks and pages, respectively). WAF varies largely depending on the workload. If it is not known for the application under discussion, it can also be measured experimentally (see the following example for more details).

Once the TBW is calculated, the expected lifetime can be estimated with this equation:

LT = TBW / D

where D is the amount of data written in the unit of time of interests (month, year, etc.).

Example: embedded Linux system equipped with a raw NAND flash memory and UBIFS file system[edit | edit source]

This example shows how to estimate the lifetime of a raw NAND flash memory used in an embedded Linux system making use of the UBIFS file system. Specifically, the memory p/n is W29N08GVSIAA by Winbond, which is a 1-bit ECC Single-Level Cell (SLC) component. In this case, the wear leveling algorithm is implemented at the Linux kernel level.

According to the datasheet:

  • erase block size is 128KiB
  • the number of P/E cycles is 100000
  • the capacity is 1 GiByte (8 Gibit).

For the sake of simplicity, it is assumed that the file system makes use of the entire memory. Otherwise, only the capacity of the partition of interest has to be taken into account. Regarding the WAF, it is assumed it is 5. This means that for each byte written by the user-space applications and daemons, five bytes are actually saved onto the memory.

TBW = (1 GiByte * 100000) / 5 = 20000 GiByte ~ 19.5 TiByte 

Assuming that the user-space software writes 650 GiB every year, the expected lifetime is

LT = 20000 / 650 = 30.8 years

Experimental measurement of actual written data[edit | edit source]

In many cases, WAF is unknown and can not be estimated either. As stated previously, the system integrator can determine the lifetime expectancy by adopting an experimental approach though. The following procedure describes how to determine the actual written data for the system used in this example.

The main indicator of how much data has been written for NAND devices is how many blocks has been erased, assuming that a block has been erased only if:

  • has already been written (even if not completely)
  • needs to be written again (this is not completely true, because UBI has a background tasks that erases dirty LEB while the system is idle).

Assuming that TEC is the sum of PEB Erase Counter and DAYS is the number of days the test has been run, the estimated amount of written data per year can be computed as:

D = (TEC * PEBsize) * (365 / DAYS)

This already includes WAF and, thus, we can estimate life-time, in year, as:

LF = [capacity * P/E cycles] / D

In the same case above, if we have 30000 TEC/day we have

LF = (1GiB * 100k) / ((30k * 128KiB) * (365 / 1)) ~ 74 years

Example: embedded Linux system equipped with SanDisk SDINBDG4-8G-XI1 eMMC and ext4 file system[edit | edit source]

Introduction[edit | edit source]

As stated previously, eMMC's and SD cards are block devices. As such, they are operated in tandem with file systems that have been developed for hard disks and solid-state drives. ext4 is one of them and one of the most popular in the Linux world.

Lauterbach-eMMC-schema.png


From system integrators' perspective, eMMC's and SD cards are easier to use than raw NAND's because they hide most of the complexity regarding the management of the underlying memory. On the other hand, the architecture of these devices could make it difficult to retrieve data regarding the actual usage of the memory. There are some techniques available, however, to address this issue when working with an embedded Linux platform. This section will illustrate the following ones:

  • Logging the accesses to the storage device: The idea of this approach is to log all the accesses triggered by the host and isolate the write operations in order to determine the actual amount of data written onto the device. Two different methods are compared. The first one makes use of a hardware-based tracing tool while the other exploits a software tracer, namely the Linux kernel's Function Tracer (aka ftrace).
  • Exploiting the storage device's built-in advanced functionalities.

These approaches are illustrated in more detail in the rest of the document with the help of specific tests conducted on a real target.

Testbed[edit | edit source]

Specifically, these tests were run on the Evaluation Kit of the Mito8M SoM running Yocto Linux and featuring a SanDisk SDINBDG4-8G-XI1 eMMC operated with an ext4 file system. It is worth remembering that the same testbed was used for this Application Note as well.

The evaluation kit consists of three boards: the SoM, the SBCX carrier board, and an adapter board. This setup provides off-chip trace via a parallel trace port or a PCIe interface. The SoM is equipped with the NXP i.MX8M SoC, which in turn is based on the Quad-Core ARM® Cortex-A53 CPU. The SOC features two Ultra Secured Digital Host Controllers (uSDHC) supporting SD/SDIO/MMC cards and devices. For the purpose of the tests under discussion, the uSDHC ports were used as depicted in the following image.

eMMC and microSD card interfacing

The microSD card connected to uSDHC1 was used for the bootloader, the Linux kernel, and the root file system. The eMMC device connected to uSDHC2 was used for the main workload to be analyzed. The Linux kernel version is 4.14.98.

Logging the accesses[edit | edit source]

As is known, the specific architecture of a managed-NAND device can be extremely sensitive to certain read and write access sequences performed by the host processor under the direction of the application software, especially if these are frequently iterated.

A classic software-based recording method (log) of these accesses requires the implementation of additional code that captures information and saves it securely. The information can be saved on another permanent storage device, for example, an external USB drive. This software method is intrusive and in addition to the overhead of monitoring the eMMC access, additional overhead is added in order to save the data.

Besides a traditional software-based approach, this example shows also a different method of capturing and saving such information through the use of a hardware-based trace tool. This can be done with minimal intrusion on the software and, in some cases, almost zero. This tool captures the program and data trace transmitted by the cores of a system-on-chip (SoC) through a dedicated trace port and records it to its own dedicated memory. To do that, advanced hardware functionalities of modern SoC's are exploited.

Arm CoreSight™[edit | edit source]

Many embedded microprocessors and microcontrollers are able to trace information related to the program execution flow. This allows the sequence of instructions executed by the program to be reconstructed and examined in great detail. In some configurations it is also possible to record the data related to the read and/or write cycles performed by the program.

CoreSight™ is the name of the on-chip debug and trace technology provided by Arm®. CoreSight™ is not intended as a default logic block but, like a construction kit, it provides many different components. This allows the SoC designer to define the debug and trace resources that they want to provide. Program flow (and sometimes data flow) information is output through a resource called ETM (Embedded Trace Macrocell). The ETM trace information flow can be stored internally (on-chip trace) or can be exported outside of the SoC (off-chip trace). Arm® provides several ways for exporting a trace flow: through a parallel trace port (TPIU, Trace Port Interface Unit), or serial trace port (HSSTP, High-Speed Serial Trace Port) or through a PCIe interface.

When data trace is not available, Arm® provides the Context ID register. This is often used by an Operating System (OS) to indicate that a task switch has occurred. This is done by code in the OS kernel writing the task identifier to this register. In a multicore Arm®/Cortex® SoC, each core implements this register.

Lauterbach TRACE32 development tools[edit | edit source]

Lauterbach's TRACE32 development tools enable hardware-based debug and trace of a wide range of embedded microprocessors and microcontrollers and support debug technologies such as JTAG or SWD, as well as trace technologies such as NEXUS or ETM.

The TRACE32 tools support all Arm® CoreSight™ configurations. A TRACE32 development tool for debug and trace is typically comprised of these units:

  • a universal PowerDebug module connected to the host computer via USB3 or Ethernet;
  • a debugger (debug cable) for the specific architecture of the microprocessor or microcontroller under debug;
  • for the off-chip trace, a universal PowerTrace II or PowerTrace III module providing 4GB or 8GB memory, complemented by a parallel or serial pre-processor to access the trace data;
  • or a dedicated PowerTrace Serial module for serial or PCIe trace data.

TRACE32-based eMMC access log solution[edit | edit source]

In all operating systems or device drivers that manage an eMMC memory device, some functions are provided for device access which incorporate the eMMC JEDEC standard commands. Long-term monitoring of the execution of these commands and their parameters is the best way to collect the data necessary for the access analysis. After accessing the eMMC device, a function or a code point is usually available where the eMMC command is completed. Monitoring this code point allows the detection of additional information, such as the execution time of the command.

The TRACE32 trace tool can sample the code points where eMMC accesses start and finish. By adding a tiny amount of instrumentation to your source code, you can also trace device access data. In cases where data trace is not available, the instrumentation code writes the access data to the ContextID register, allowing both types of system to be adapted to use this technique.

The following data is traced in the TRACE32-based log solution:

  • at the beginning of eMMC access: eMMC device id, command executed and related flags, access address, number of accessed memory blocks and their size;
  • at the end of the eMMC access: eMMC device id, command executed, result code and other return codes;
  • access duration.

A possible example of access monitoring is shown below, as it appears in the trace views available in TRACE32:

2| ptrace  \\vmlinux\core_core\mmc_start_request  24.228827980s
2| info                                           24.228828005s         31636D6D
2| info                                           24.228828030s         00000019
2| info                                           24.228828055s         01620910
2| info                                           24.228828080s         000000B5
2| info                                           24.228828105s         00000200
2| info                                           24.228828130s         00000010
0| ptrace  \\vmlinux\core_core\mmc_request_done   24.231239610s
0| info                                           24.231241385s         31636D6D
0| info                                           24.231241410s         00000019
0| info                                           24.231241435s         00000000
0| info                                           24.231308085s         00000900
0| info                                           24.231308210s         00000000

This is, typically, a few trace records for each eMMC access. Stress tests have verified that logging an eMMC access (functions mmc_start_request() and mmc_request_done() with related data) requires about 416 trace records in the PowerTrace memory and these accesses occur on average every 4 mSec.

This corresponds to approximately 1GB/416 = 2.5 million eMMC logs, or approximately 10,000 seconds (2h45min) for each gigabyte of trace storage. The PowerTrace family provides either 10 million eMMC logs (11h) for a 4GB PowerTrace or 20 million (22h) for an 8GB module. By extending the trace duration with trace streaming, the limit becomes the size of the computer hard-disk/SSD or the TRACE32 limit which is 1 Tera-frame, i.e., 2.5 billion eMMC logs (over 100 days!).

The trace data can be filtered and saved on disk, and then converted into a more suitable format for analysis using a TRACE32 script (PRACTICE script), Python script, or an external conversion program.

For example, the trace shown above can be converted into the format shown below, which is more suitable for importing into specific eMMC analysis tools:

24.228827980 mmc_start_req_cmd: host=mmc1 CMD25 arg=01620910 flags=000000B5 blksz=00000200 blks=00000010
24.231239610 mmc_request_done: host=mmc1 CMD25 err=00000000 resp1=00000900 resp2=00000000

These tools perform a complete analysis of the eMMC device application accesses, in terms of addresses accessed, frequency and access methods.

The end goal is to calculate the Write Amplification Factor (WAF) seen by the eMMC (or by any other managed-NAND block device). WAF is defined as the ratio of physical data written onto the NAND and the data written by the host.

When the host writes logical sectors of the eMMC, the internal eMMC controller erases and re-programs physical pages of the NAND device. This could cause a management overhead. Large sequential writes aligned to physical page boundaries typically result in minimal overhead and optimal NAND write activity (WAF=~1). Small chunks of random writes could result in a higher overhead (WAF>>1).

This becomes important when considering the life of the raw-NAND memory inside the eMMC, which has a finite number of Program/Erase cycles. See the example below:

Lauterbach-eMMC-WAF-example.png

To estimate the WAF for any particular eMMC device, and hence its expected lifetime on your application, you can capture the log file of the activity.

Once a log is obtained, it's recommended to contact your eMMC vendor to get more information about the log analysis tools required for analyzing the specific eMMC product.

Implementation example for GNU/Linux o.s.[edit | edit source]

Below is an example of how the TRACE32-based log method can be applied to a Linux system. The solution is based on light instrumentation of the mmc_start_request() and mmc_request_done() functions defined in the Linux drivers/mmc/core/core.c source code file. Relevant eMMC device accesses are captured through the instrumentation code and they are written to a static data structure making them immediately traceable if data trace is available in the SoC. If data tracing is not possible, the instrumentation code writes the data to the Arm®/Cortex® Context ID register.

The solution was successfully tested on the aforementioned embedded platform. The instrumentation code is provided in Appendix 1. The zero initialization of the T32_mmc structure is guaranteed by Linux, since this variable is allocated in the bss section. The instrumentation is normally disabled but can be enabled by writing the value "1" into the enable field of the T32_mmc structure. The identifier of the eMMC device to be traced must be written into the dev field. Both of these operations can be performed from a TRACE32 script with the following commands:

Var.set T32_mmc.enable = 1
Var.set T32_mmc.dev = 0x30636D6D   // e.g.: "mmc0" in reverse ASCII order

The infoBit field can be written as follows:

Var.set T32_mmc.infoBit = 0x80000000

This allows the user and the tools to distinguish between data written in the Context ID register by the instrumentation code from those written by Linux for task switches. In this case, the range of values must also be reserved so that they are not interpreted as task switch identifiers. The command to do this is shown below:

ETM.ReserveContextID 0x80000000--0xffffffff

It’s important to note that the Linux kernel must be compiled for debugging (see the Training Linux Debugging manual at [1]). The TRACE32 debugger also offers extensions for many different operating systems, known as an “OS awareness”. These add OS-specific features to the TRACE32 debugger such as the display of OS resources (tasks, queues, semaphores, ...) or support for MMU management in the OS. In TRACE32, the ability to trace tasks and execute code is based on task switch information in the trace flow. The command ETM.ReserveContextID allows simultaneous use of the Linux OS awareness support and the instrumentation for eMMC access analysis.

To reduce the amount of trace information generated by the target and to allow long-term trace via streaming, filters can be applied to isolate just the instrumentation code and its writes to the Context ID register. For example:

Break.REset
Break.Set  mmc_request_done     /Program /TraceON
Break.Set  mmc_request_done\94  /Program /TraceOFF
Break.Set  mmc_start_request    /Program /TraceON
Break.Set  mmc_start_request\38 /Program /TraceOFF

where the filters marked as /TraceOFF are mapped to program addresses immediately after the instrumentation.

To ensure the task switch data generated by the OS is included in the filtered trace flow, add an additional filter to the __switch_to() function (arch/arm64/kernel/process.c) where it calls the static inline contextidr_thread_switch() function:

Break.Set     __switch_to+0x74 /Program /TraceON
Break.Set     __switch_to+0x80 /Program /TraceOFF

The trace flow recorded by TRACE32 can be arranged into a view suitable for exporting by post-processing with the command:

Trace.FindAll , Address address.offset(mmc_start_request) OR Address address.offset(mmc_request_done) OR Cycle info OR Cycle task /List run cycle symbol %TimeFixed TIme.Zero data

NOTE: ‘OR Cycle task’ is optional.

This implementation along with the software-based method was tested on the following use case:

  • Read/write workload to the mmc0 device was issued by using stressapptest application (stressapptest -s 20 -f /mnt/mmc0/file1 -f /mnt/mmc0/file2) resulting in the creation of two files, 16 MByte each
-rw-r--r-- 1 root root 8388608 Dec  3 16:30 file1
-rw-r--r-- 1 root root 8388608 Dec  3 16:30 file2
  • To setup ftrace, the following commands were run (please note that the ftrace pipe is saved to a file on a different memory device i.e. mmc1 purposely):
echo 1 > /sys/kernel/debug/tracing/tracing_on
echo 1 > /sys/kernel/debug/tracing/events/mmc/enable
echo 20000 > /sys/kernel/debug/tracing/buffer_size_kb        ; 20MB buffer size
echo > /sys/kernel/debug/tracing/trace
cat /sys/kernel/debug/tracing/trace_pipe > /home/root/prove/ftrace.txt
Verification[edit | edit source]

To verify the implementation of the TRACE32-based method, a specific test was run. In essence, the testbed was configured in order to run TRACE32 and ftrace tracing — more details in the following section — simultaneously for analyzing the same workload. The logs produced by the two methods were then compared to ensure they match.

Results and comparison with the software-based method (ftrace)[edit | edit source]

In Linux, eMMC access log solutions based on purely software methods are already available. The ftrace framework provides this capability, as well as being able to log many other events. The term ftrace stands for “function tracer” and basically allows you to examine and record the execution flow of kernel functions. The dynamic tracing mode of ftrace is implemented through dynamic probes injected into the code, which allow the definition of the code to be traced at runtime. When tracing is enabled, all the collected data is stored by ftrace in a circular memory buffer. In the framework, there is a virtual filesystem called tracefs (usually mounted in /sys/kernel/tracing) which is used to configure ftrace and collect the trace data. All management is done with simple operations on the files in this directory.

Comparative tests performed on the DAVE Embedded Systems “MITO 8M Evaluation Kit” target showed that the ftrace impact compared to the TRACE32-based log solution is considerably higher in several respects. This is understandable, considering that ftrace is a general-purpose trace framework designed to trace many possible events, while the instrumentation required for the TRACE32 log method is specific and limited to the pertinent functions. Moreover, ftrace requires some buffering (ring buffer) and saving data to a target's persistent memory, while the solution based on TRACE32 uses off-chip trace to save the data externally in real-time. The following tables show a comparison between ftrace and the TRACE32 solution.

Table 1: Instrumentation size
Setup vmlinux code size vmlinux data size vmlinux source files instrumentation code size (*) instrumentation data size (*)
Asbolute

[MByte]

Increment w.r.t. the baseline

[%]

Absolute

[MByte]

Increment w.r.t. the baseline setup

[%]

# Increment w.r.t. the baseline setup
Baseline

(no instrumentation)

12.79 n/a 10.78 n/a 4640 n/a n/a n/a
TRACE32 12.79 0 10.78 0 4640 0

(41 source code lines in mmc driver)

372 bytes 64 bytes
ftrace 14.78 15.6 11.77 9 5476 836 1.99 MByte 0,99MByte + ??MByte ring buffer (**)

(*)    ftrace instrumentation applies to the whole Linux kernel. TRACE32 instrumentation applies to the functions mmc_start_request() and mmc_request_done() only.

(**)    the actual size of the ftrace ring buffer can be configured during runtime but is typically in the 10—100 MByte range .

In the ftrace-based solution, an increase in kernel size of approximately 15% (code) and 9% (data) is observed compared to the kernel without ftrace. During the execution of ftrace it’s also necessary to reserve additional memory for the ring buffer. The number of source files used in building the kernel increases by 18% when the ftrace framework is included. The weight of the instrumentation required by TRACE32, on the other hand, is practically negligible both in terms of code and data.

Table 2: Instrumentation overhead
Measuring points (*) Average duration

[us]

No ftrace

No TRACE32 instr. (baseline)

No ftrace

With TRACE32 instr.

ftrace enabled

No TRACE32 instr.

Absolute Increment w.r.t. the baseline Absolute Increment w.r.t. the baseline
mmc_start_request 6.950 8.108 1.158 36.875 29.925
mmc_request_done 0.770 1.364 0.594 63.031 62.261

(*) measuring points are the part of functions where the instrumentation is added.

The functions average duration analysis of eMMC accesses highlights the greater weight required by ftrace. Additional, detailed charts are shown in the following section. They allow to determine that using ftrace also involves a greater dispersion of the execution times compared to both the kernel without ftrace and the kernel instrumented only with the code for TRACE32. In particular, the functions mmc_start_request() and mmc_request_done() have a few microseconds constant execution time without ftrace, and show a very variable execution time with ftrace, with a maximum time up to 279 us and 285 us respectively.

In conclusion, the hardware method based on TRACE32 provides the same log data as recorded by ftrace but with minimal changes to the kernel (a few lines in a file) and a tiny time penalty. It also does not use any additional memory (in terms of RAM and file system storage) and allows for extremely long measurement times.

The following table summarizes the advantages and disadvantages of the two considered solutions: TRACE32 vs ftrace.

Table 3: Pros and cons
Method Pros Cons
TRACE32
  • Light kernel instrumentation
  • No additional memory required
  • Long-term analysis (few hours up to over 100 days)
  • Can be ported to other OS's / eMMC device drivers.
  • Hardware-based solution: requires a debug and trace tool + offchip-trace capable processor and target
ftrace
  • Software-based solution
  • Available for Linux kernel only
  • Heavy kernel instrumentation
  • Time intrusion in eMMC operations
  • Kernel program and data size increase
  • 10—100 MB of RAM required for ring buffer
  • Additional storage device to save the ring buffer
  • For each eMMC operation, ftrace saves roughly 876 bytes of log information.
Detailed time analysis[edit | edit source]
mmc_start_request[edit | edit source]
No ftrace, no TRACE32 instrumentation
No ftrace, with TRACE32 instrumentation
With ftrace, no TRACE32 instrumentation
mmc_request_done[edit | edit source]
No ftrace, no TRACE32 instrumentation
No ftrace, with TRACE32 instrumentation
With ftrace, no TRACE32 instrumentation

Analysis of the logs and conclusions[edit | edit source]

No matter how the accesses to the e.MMC are traced, once the logs are available they can be processed thoroughly to produce reports that are very useful to analyze how the host actually operates the device.

The following are some such reports from a test conducted on a e.MMC partition (mmcblk0p1) formatted with ext4 file system:

root@imx8mqevk:~# mkfs.ext4 /dev/mmcblk0p1

Please note that this formatting results in an ext4 4-kByte block size:

root@imx8mqevk:~# dumpe2fs -x /dev/mmcblk0p1

dumpe2fs 1.43.5 (04-Aug-2017)
Filesystem volume name:   <none>
...
Block size:               4096
...

The ext4 block must not be confused with the e.MMC blocks, which are 512 bytes as per JEDEC specifications and are addressed according to the LBA scheme.

The analyzed workload is the result of a combination of different tools performing read and write accesses (/mnt/mmc0 is the mount point of the partition being tested):

root@imx8mqevk:~# stressapptest -s 20 -f /mnt/mmc0/file1 -f /mnt/mmc0/file2
root@imx8mqevk:~# find / -name "*" > /mnt/mmc0/find_results.txt
root@imx8mqevk:~# dd if=/dev/urandom of=/mnt/mmc0/dummyfile.bin bs=4k count=25000
root@imx8mqevk:~# rm /mnt/mmc0/dummyfile.bin
root@imx8mqevk:~# dd if=/dev/zero of=/mnt/mmc0/dummyfile.bin bs=4k count=25000
root@imx8mqevk:~# sync

The following chart shows the e.MMC accesses over time during the execution of the workload along with other measurements such as read/write throughput.

e.MMC accesses over time

It is also possible to extrapolate the latency of the operations.

Latency

Another extremely useful graphical depiction is the chunk size distribution. For instance, this information is often used to understand how efficient the user application is when it comes to optimize the write operations for maximizing the e.MMC lifetime. The pie on the left refers to the read operations, while the other one refers to the write operations.

Chunk size distribution

To interpret the result, one needs to take into account how the workload was implemented. In the example under discussion, the workload basically makes use of two applications: dd and stressapptest. dd was specified to use 4-kByte data chunks (bs=4k). stressapptest uses 512-byte chunks instead because the --write-block-size parameter was not used (for more details please refer to the source code). As a result, one would expect that the majority of accesses are 512 bytes and 4 kByte. The charts clearly show that this is not the case. Most of the accesses are 512kB instead. This is a blatant example of how the algorithms of the file systems and the kernel block driver can alter the accesses issued at application level for optimization purposes.

Appendix 1: source code example[edit | edit source]

static struct T32_mmc_struct {
	unsigned int   enable;
	unsigned int   infoBit;
	unsigned int   dev;
	unsigned int * pHost;
	unsigned int   cmd;
	unsigned int   arg;
	unsigned int   flags;
	unsigned int   blksz;
	unsigned int   blocks;
	unsigned int   err;
	unsigned int   resp0;
	unsigned int   resp1;
	unsigned int   resp2;
	unsigned int   resp3;
} T32_mmc;

int mmc_start_request(struct mmc_host *host, struct mmc_request *mrq)
{
	int err;

	mmc_retune_hold(host);

	if (mmc_card_removed(host->card))
		return -ENOMEDIUM;

	mmc_mrq_pr_debug(host, mrq, false);

	WARN_ON(!host->claimed);

	if (T32_mmc.enable) {
		T32_mmc.pHost = (unsigned int *)mmc_hostname(host);
		if ((*T32_mmc.pHost)==T32_mmc.dev) {
			if (mrq->cmd) {
				write_sysreg((*T32_mmc.pHost)|T32_mmc.infoBit, 	
						  contextidr_el1);
				isb();
				T32_mmc.cmd = (mrq->cmd->opcode)|T32_mmc.infoBit;
				write_sysreg(T32_mmc.cmd, contextidr_el1);
				isb();
				T32_mmc.arg = (mrq->cmd->arg)|T32_mmc.infoBit;
				write_sysreg(T32_mmc.arg, contextidr_el1);
				isb();
				T32_mmc.flags = (mrq->cmd->flags)|T32_mmc.infoBit;
				write_sysreg(T32_mmc.flags, contextidr_el1);
				isb();
			}

			if (mrq->data) {
				T32_mmc.blksz = (mrq->data->blksz)|T32_mmc.infoBit;
				write_sysreg(T32_mmc.blksz, contextidr_el1);
				isb();
				T32_mmc.blocks = (mrq->data->blocks)|T32_mmc.infoBit;
				write_sysreg(T32_mmc.blocks, contextidr_el1);
				isb();
			}
		}

	}

	err = mmc_mrq_prep(host, mrq);
	if (err)
		return err;
...




void mmc_request_done(struct mmc_host *host, struct mmc_request *mrq)
{
	struct mmc_command *cmd = mrq->cmd;
	int err = cmd->error;
...

...

	if (!err || !cmd->retries || mmc_card_removed(host->card)) {
		mmc_should_fail_request(host, mrq);

		if (!host->ongoing_mrq)
			led_trigger_event(host->led, LED_OFF);

		if (mrq->sbc) {
			pr_debug("%s: req done <CMD%u>: %d: %08x %08x %08x %08x\n",
				mmc_hostname(host), mrq->sbc->opcode,
				mrq->sbc->error,
				mrq->sbc->resp[0], mrq->sbc->resp[1],
				mrq->sbc->resp[2], mrq->sbc->resp[3]);
		}

		pr_debug("%s: req done (CMD%u): %d: %08x %08x %08x %08x\n",
			mmc_hostname(host), cmd->opcode, err,
			cmd->resp[0], cmd->resp[1],
			cmd->resp[2], cmd->resp[3]);

		if (mrq->data) {
			pr_debug("%s:     %d bytes transferred: %d\n",
				mmc_hostname(host),
				mrq->data->bytes_xfered, mrq->data->error);
		}

		if (mrq->stop) {
			pr_debug("%s:     (CMD%u): %d: %08x %08x %08x %08x\n",
				mmc_hostname(host), mrq->stop->opcode,
				mrq->stop->error,
				mrq->stop->resp[0], mrq->stop->resp[1],
				mrq->stop->resp[2], mrq->stop->resp[3]);
		}

		if (T32_mmc.enable) {
			T32_mmc.pHost = (unsigned int *)mmc_hostname(host);
			if ((*T32_mmc.pHost)==T32_mmc.dev) {
				write_sysreg((*T32_mmc.pHost)|T32_mmc.infoBit,
						  contextidr_el1);
				isb();
				T32_mmc.cmd = (cmd->opcode)|T32_mmc.infoBit;
				write_sysreg(T32_mmc.cmd, contextidr_el1);
				isb();
				T32_mmc.err = (err)|T32_mmc.infoBit;
				write_sysreg(T32_mmc.err, contextidr_el1);
				isb();
				T32_mmc.resp0 = (cmd->resp[0])|T32_mmc.infoBit;
				write_sysreg(T32_mmc.resp0, contextidr_el1);
				isb();
			}
		}
	}
	/*
	 * Request starter must handle retries - see
	 * mmc_wait_for_req_done().
	 */
	if (mrq->done)
		mrq->done(mrq);
}

Appendix 2: Video[edit | edit source]

Technical Note presentation by Lauterbach (Language: Italian; Subtitles: English and Italian)

Persistent storage and read-write file systems

Device's built-in advanced functionalities[edit | edit source]

e.MMC's feature advanced functionalities that are useful for monitoring wear-out and, in general, the health of the device. For more details, please see the section this section.

Power failures[edit | edit source]

Even though modern file systems are usually tolerant w.r.t. power failures (*), in general, sudden power cuts should be avoided. The system should always be turned off cleanly. As this is not always possible, several techniques can be put in place to mitigate the effects of a power failure. For instance, see this section of the carrier board design guidelines.

(*) Roughly speaking, this means that these file systems are able to keep their consistency across such events. They can not avoid data loss if a power failure occurs in the middle of a write operation, however. For this reason, further countermeasures, such as data redundancy and/or the use of error-detecting/correcting codes, should be implemented at the application level for particularly important data. At the hardware level, DAVE Embedded Systems products usually leverage the "write protect" feature of flash memories in order to prevent erase/program operations during power transitions.

Example: embedded Linux system equipped with a raw NAND flash memory and UBIFS file system over UBI partition[edit | edit source]

Even though both UBI and UBIFS are designed with power-cut tolerance in mind without having support from additional hardware (e.g. supercap, battery power supply, and so on) some data might be lost and some weird effect happens when not performing a clean shutdown of the system.

E.g.:

Additional failures like UBIFS mounted as read-only at boot time usually do not depend only on power-cut but are symptom of major failures (buggy MTD device driver, storage device hardware failure, device wear-out, major EMI and so on).

When designing application to be as safe as possible w.r.t. power-cuts, please also take care of:

Memory health monitoring[edit | edit source]

Although implementing a mechanism for monitoring the health of flash memories is not required strictly speaking, it is recommended. Think about it as a sort of life insurance to cope with unpredictable events that might occur during the life of the product. As a result of a on-the-field software upgrade, for example, new features could be added leading to an increase of data rate written onto the flash memories. Consequently, the lifetime expectancy calculated when the product was designed is not valid anymore. In such a case, a properly designed monitoring system would alert the personnel devoted to the maintenance who could take measures before it is too late (see for instance the case of eMMC's used in Tesla cars). The following section details an example of such a system.

Example: embedded Linux system equipped with a raw NAND flash memory and UBIFS file system[edit | edit source]

There are two main indicators of NAND device health:

  • current ECC corrected errors
  • block erase counter.

We will focus on the latter because it is easy to extract and give a good lifetime expectation of the device.

UBI put its header on top of each NAND physical erase block (PEB) and here, among the other fields, user can find the erase counter (EC). By Comparing the sum of the EC of all PEB's with the nominal expected maximum erase count, user can estimate the usage of the whole NAND device.

To read EC directly from PEB at runtime, user can rely on ubidumpec tool: this is not yet merged in mtd-utils package, but is provided as RFC on linux-mtd mailing list (it is also provided by default on most of DAVE Linux Embedded development kit).

UBI partition expected remaining life in percentage can be calculated with a simple formula:

RL = ((MaxEC * nr_blocks) - sum(EC)) / (MaxEC * nr_blocks)) * 100

Where:

  • MaxEC is the maximum erase count supported by raw NAND
  • nr_blocks is the count of PEB that are contained on this partition

E.g. in case of a "standard" SLC NAND, which usually has 100k maximum erase count, this can be implemented as simple bash pipe between ubidumpec and awk:

ubidumpec /dev/ubi0 | awk -v MAXEC=100000 '{ s+=$1; n=n+1} END {print s, n*MAXEC, (((n*MAXEC)-s)/(n*MAXEC))*100 }'

This command prints:

  • sum of EC (in this /dev/ubi0 partition)
  • total number of erase/program cycle allowed by this partition
  • expected lifetime left to be used (in percentage).

Running on a (nearly) 1GiB partition on a brand new SLC NAND flash gives:

root@axel:~# ubinfo /dev/ubi0
ubi0
Volumes count:                           3
Logical eraseblock size:                 126976 bytes, 124.0 KiB
Total amount of logical eraseblocks:     8112 (1030029312 bytes, 982.3 MiB)
Amount of available logical eraseblocks: 0 (0 bytes)
Maximum count of volumes                 128
Count of bad physical eraseblocks:       0
Count of reserved physical eraseblocks:  160
Current maximum erase counter value:     2
Minimum input/output unit size:          2048 bytes
Character device major/minor:            248:0
Present volumes:                         0, 1, 2

root@axel:~# ubidumpec /dev/ubi0 | awk -v MAXEC=100000 '{ s+=$1; n=n+1} END {print s, n*MAXEC, (((n*MAXEC)-s)/(n*MAXEC))*100 }'
8161 811200000 99.999

As a confirmation of this data, the maximum EC of a given UBI partition can be read directly from sysfs:

root@axel:~# cat /sys/class/ubi/ubi0/max_ec 
2

Example: embedded Linux system equipped with an e.MMC[edit | edit source]

As explained in this section, e.MMC's provide specific functionalities for device's health monitoring. In practice, these components expose some registers that make health-related information available to the host.

Following is a dump of the such registers regarding the wear-out status of the device, namely DEVICE_LIFE_TIME_EST_TYP_B, DEVICE_LIFE_TIME_EST_TYP_B, and PRE_EOL_INFO:

root@desk-mx8mp:~# mmc extcsd read /dev/mmcblk2 | grep LIFE
eMMC Life Time Estimation A [EXT_CSD_DEVICE_LIFE_TIME_EST_TYP_A]: 0x01
eMMC Life Time Estimation B [EXT_CSD_DEVICE_LIFE_TIME_EST_TYP_B]: 0x01
root@desk-mx8mp:~# mmc extcsd read /dev/mmcblk2 | grep EOL
eMMC Pre EOL information [EXT_CSD_PRE_EOL_INFO]: 0x01

This dump refers to the same testbed described here.

Some manufacturers use additional proprietary registers also for providing information about the amount of data that have been actually written onto the device. If available, this number allows to calculate the WAF given that the amount of data written by the applications of the test workload is known too.

The health status registers can be exploited to implement a monitoring mechanism as well. For example, a user-space application can poll periodically the status of the device and take actions accordingly if the wear-out exceeds predefined thresholds.

Last but not least, it is worth remembering that advanced proprietary off-line tools may also be available for health monitoring. For instance, Western Digital provides such tools for its devices. For more information, please contact our Sales Department.

References[edit | edit source]

Credits[edit | edit source]

Lauterbach Italian branch office[edit | edit source]

Lauterbach SRL

Via Caldera 21

20153 Milan (Italy)

Tel. +39 02 45490282

email info_it@lauterbach.it

Web www.lauterbach.it