MISC-TN-010: Using NXP eIQ Machine Learning Development Environment with Mito8M SoM

From DAVE Developer's Wiki
Revision as of 15:50, 29 January 2020 by U0001 (talk | contribs) (Testbed general configuration)

Jump to: navigation, search
Info Box
DMI-Mito-top.png Applies to MITO 8M
NeuralNetwork.png Applies to Machine Learning
Warning-icon.png This technical note was validated against specific versions of hardware and software. What is described here may not work with other versions. Warning-icon.png

History[edit | edit source]

Version Date Notes
1.0.0 January 2020 First public release

Introduction[edit | edit source]

According to NXP documentation, eIQ Machine Learning Software is a collection of software and development tools for NXP microprocessors and microcontrollers to do inference of neural network AI models on embedded systems.

This Technical Note (TN for short) illustrates hot to use eIQ in combination with Mito8M, DAVE Embedded Systems's latest SoM, which is built upon the i.MX8M processor by NXP.

Testbed[edit | edit source]

With regards to the hardware, the testbed consists of the same platform described MISC-TN-008:_Running_Debian_Buster_(armbian)_on_Mito8M here.

Concerning the software, the following combination was used:

  • U-Boot 2018.03 retrieved from the standard Mito8M Yocto-based Board Support Package (BSP)
  • Device tree retrieved from the standard Mito8M Yocto-based BSP
  • Linux kernel imx8qmmek 4.14.98-imx_4.14.98_2.0.0
  • eIQ-enabled Yocot-based root file system.

For more information about the kernel and the root file system, please refer to the following section.

Building NXP eIQ software[edit | edit source]

NXP document UM11226 describes how to build


This processor is capable of running either at 800 MHz or 1.3 GHz. The tests were performed at either frequencies in order to determine how the it affects the RAM bandwidth.

The following table details the characteristics of the SDRAM bank connected to the SoC.

Subsystem Feature Platform
Mito8M
SoC SoC NXP i.MX8M Quad
ARM core(s) 4 x Cortex A53
ARM core frequency

[MHz]

800 or 1300
L1 cache (D)

[kB]

32
L1 cache (I)

[kB]

32
L2 cache

[MB]

1
SDRAM Type LPDDR4
Frequency

[MHz]

1600
Bus witdth

[bit]

32
Theoretical bandwidth

[Gb/s]

102.4
Theoretical bandwidth

[GB/s]

12.8
Size

[MB]

3072

Software configuration[edit | edit source]

  • Linux kernel: 4.14.98
  • Root file system: Debian GNU/Linux 10 (buster)
  • Architecture: aarch64
  • Governor: userspace @ 800 MHz or 1300 MHz
root@Mito8M:~# echo userspace > /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
root@Mito8M:~# cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
userspace
root@Mito8M:~# cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_cur_freq
800000


Some benchmarks were built natively on the platform under test. For the sake of completeness, the version of the GCC compiler is then indicated as well:

armbian@Mito8M:~/devel/lmbench$ gcc -v
Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/usr/lib/gcc/aarch64-linux-gnu/8/lto-wrapper
Target: aarch64-linux-gnu
Configured with: ../src/configure -v --with-pkgversion='Debian 8.3.0-6' --with-bugurl=file:///usr/share/doc/gcc-8/README.Bugs --enable-languages=c,ada,c++,go,d,fortran,objc,obj-c++ --prefix=/usr --with-gcc-major-version-only --program-suffix=-8 --program-prefix=aarch64-linux-gnu- --enable-shared --enable-linker-build-id --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix --libdir=/usr/lib --enable-nls --enable-bootstrap --enable-clocale=gnu --enable-libstdcxx-debug --enable-libstdcxx-time=yes --with-default-libstdcxx-abi=new --enable-gnu-unique-object --disable-libquadmath --disable-libquadmath-support --enable-plugin --enable-default-pie --with-system-zlib --disable-libphobos --enable-multiarch --enable-fix-cortex-a53-843419 --disable-werror --enable-checking=release --build=aarch64-linux-gnu --host=aarch64-linux-gnu --target=aarch64-linux-gnu
Thread model: posix
gcc version 8.3.0 (Debian 8.3.0-6)

Overall results[edit | edit source]

This section illustrates the overall results achieved by the benchmarks.

STREAM[edit | edit source]

Overall results (ARM core frequency = 800 MHz)
Function Mito8M Axel Lite

efficiency

[%]

Best rate

[MB/s]

Efficiency

[%]

Copy 6770 51.7 14.0
Scale 6093 46.5 13.8
Add 5263 40.1 14.6
Triad 4820 36.8 14.9
Overall results (ARM core frequency = 1300 MHz)
Function Mito8M Axel Lite

efficiency

[%]

Best rate

[MB/s]

Efficiency

[%]

Copy 7125 54.3 14.0
Scale 7501 57.2 13.8
Add 6762 51.6 14.6
Triad 6354 48.5 14.9

Apart from the increase over Axel Lite in absolute terms, it is noteworthy that Mito8M exhibits a significant improvement in terms of efficiency too, as shown in the above tables. This is especially true in the case of ARM core frequency set to 1300 MHz.

Another interesting thing to note is how the bandwidth is affected by the ARM core frequency. If it scaled linearly, we should have an improvement of 62.5% from 800 to 1300 MHz. The average bandwidth at 800 MHz is 5761 MB/s. At 1300 MHz, it is 6935 MB/s. Therefore, the increase is 20.4%. With regard to STREAM benchmark, the achieved bandwidth does not scale linearly with ARM core frequency.

Please see this page for more details about STREAM benchmark.

LMbench[edit | edit source]

For what regards the memory bandwidth, LMbench provides many results organized in different categories. For the sake of simplicity, the following tables details just a couple of categories. The full results are available for download here (ARM core frequency set to 800 MHz) and here (ARM core frequency set to 1300 MHz).

Memory read bandwidth
Buffer size Bandwitdth

[MB/s]

ARM core frequency = 800 MHz ARM core frequency = 1300 MHz
512B 1553 2521
1kB 1567 2546
2kB 1575 2560
4kB 1575 2564
8kB 1577 2564
16kB 1577 2567
32kB 1528 2490
64kB 1531 2494
128kB 1547 2530
256kB 1552 2526
512kB 1514 2518
1MB 1318 2181
2MB 1430 2148
4MB 1420 2108
8MB 1423 2038
16MB 1420 2116
32MB 1365 2117
64MB 1393 2035
128MB 1382 2035
256MB 1372 2050
512MB 1367 1998
Memory write bandwidth
Buffer size Bandwitdth

[MB/s]

ARM core frequency = 800 MHz ARM core frequency = 1300 MHz
512B 2932 4771
1kB 3048 4956
2kB 3100 5046
4kB 3136 5097
8kB 3135 5101
16kB 3150 5120
32kB 2864 5127
64kB 3033 5071
128kB 3093 4886
256kB 2956 5056
512kB 3024 5054
1MB 3075 5092
2MB 3095 5116
4MB 3121 5118
8MB 3137 5120
16MB 3145 5121
32MB 3146 5120
64MB 3146 5125
128MB 3147 5123
256MB 3150 5124
512MB 3144 5125
1GB 3146 5124

There are some interesting facts to stress:

  • Read and write bandwitdth are not effected by the buffer size.
  • They are significantly affected by the ARM core frequency. For instance, the improvement of the write bandwidth (about 62% when the buffer is 1GB) is practically the same of the increase in frequency.

For more information regarding LMbench, please see this page.

pmbw[edit | edit source]

As defined by the author, pmbw is "a set of assembler routines to measure the parallel memory (cache and RAM) bandwidth of modern multi-core machines." It performs a myriad of tests. Luckily, it comes with a handful tool that plots the results—which are stored in a text file—in a series of charts. Again,the benchmark was run at two different ARM core frequencies, 800 and 1300 MHz.

The complete results and the charts are available at the following links:

Generally speaking, the charts exhibit significant declines in the performances when the array size is around the L1 and the L2 cache size.

For more details about pmbw, please refer to this page.

stressapptest[edit | edit source]

According to the documentation, stressapptest—which was developed at Google—is "a memory interface test. It tries to maximize randomized traffic to memory from processor and I/O, with the intent of creating a realistic high load situation in order to test the existing hardware devices in a computer."

Test Bandwidth

[MB/s]

ARM core frequency = 800 MHz ARM core frequency = 1300 MHz
Memory copy 5483 5804


The above table lists the achieved results when the benchmark was run as detailed in this section. In this case, the different when running at different ARM core frequencies is very little.

For more information about stressapptest, please refer to this page.

Useful links[edit | edit source]

Appendix A: Detailed testing procedures[edit | edit source]

This section details how the benchmarks were configured and run on the testbed.

STREAM[edit | edit source]

Building[edit | edit source]

To build STREAM:

  • clone its git repository
  • modify the Makefile as shown below
  • issue the make command.