{{WarningMessage|text=This technical note was validated against specific versions of hardware and software. What is described here may not work with other versions.}}
|January 2020
|First public release
|-
|1.0.1
|March 2020
|Added more details about the software configuration
|}
==Introduction==
According to [https://community.nxp.com/docs/DOC-343798 NXP documentation], ''eIQ Machine Learning Software is a collection of software and development tools for NXP microprocessors and microcontrollers to do inference of neural network AI models on embedded systems.''
This Technical Note (TN for short) illustrates hot how to use [https://www.nxp.com/design/software/development-software/eiq-ml-development-environment:EIQ eIQ] in combination with Mito8M, one of the DAVE Embedded Systems's latest SoM's, which is built upon the [https://www.nxp.com/products/processors-and-microcontrollers/arm-processors/i.mx-applications-processors/i.mx-8-processors/i.mx-8m-family-armcortex-a53-cortex-m4-audio-voice-video:i.MX8M i.MX8M processor by NXP].
==Testbed==
With regards regard to the hardware, the testbed consists of the same platform described [[MISC-TN-008:_Running_Debian_Buster_(armbian)_on_Mito8M |here]].
Concerning the software, the following combination was used:
* U-Boot 2018.03 retrieved from the standard Mito8M Yocto-based Board Support Package (BSP)
* Device tree retrieved from the standard Mito8M Yocto-based BSP
* Linux kernel imx8qmmek 4.14.98-imx_4.14.98_2.0.0(built with the Linux L4.14.98 GA Yocto BSP release for i.MX 8 family of devices with support for NXP eIQ software)* eIQ-enabled YocotYocto-based root file system(built with the Linux L4.14.98 GA Yocto BSP release for i.MX 8 family of devices with support for NXP eIQ software); as such, this root file system includes the following packages:**OpenCV 4.0.1**Arm Compute Library 19.02**Arm NN 19.02**ONNX runtime 0.3.0**TensorFlow 1.12**TensorFlow Lite 1.12.
For more information about the kernel and the root file system, please refer to the following section.
==Building NXP eIQ software==
NXP document [https://www.nxp.com/docs/en/nxp/user-guides/UM11226.pdf UM11226Rev. 2, 06/2019] illustrates how to build eIQ software support using Yocto Project tools. Even though the official procedure was tested against Ubuntu 16.04, a the build was completed successfully on the host running Ubuntu 18.04 was used successfullyas well. By following step by step the official procedure,
Regarding this TN, two of them are of interest: the Linux kernel image (<code>Image--4.14.98-r0-imx8qmmek-20200127085034.bin</code>) and the ext4 root file system image (<code>fsl-image-qt5-imx8qmmek-20200128141054.rootfs.ext4</code>).
Please note that the building process takes several hours to complete and that almost 180 GB of disk space are required:
<pre class="board-terminal">
~/devel/eIQ/fsl-arm-yocto-bsp$ du -ch --max-depth=1
313M ./.repo
99M ./sources
30G ./downloads
147G ./build-xwayland
177G .
177G total
</pre>
==Configuring the target==
The procedure described by NXP makes use of an SD card to store all the software. For convenience, a different approach was followed to test eIQ with Mito8M. While the internal eMMC was used to store U-Boot, the device tree and the Linux kernel image were retrieved via TFTP over the Ethernet connection. Also, the board was configured to mount the root file system via NFS. The resulting configuration reminds the one described [[Deploying_Embedded_Linux_Systems#The_development_environment|here]].
For a detailed dump of the full bootstrap process, please refer to the following section.
===Bootstrap process===
Please click on ''Expand'' on the top right corner to open the box.
==Running TensorFlow and TensorFlow Lite examples==
To verify that the root file system was generated properly, a couple of ready-to-use examples were run. Again, to execute them, please follow the procedure described in [https://www.nxp.com/docs/en/nxp/user-guides/UM11226.pdf UM11226].
The first one example makes use of TensorFlow:
<pre class="board-terminal">
root@imx8qmmek:~/devel/tensorflow# ls -la
0.00784314: 835 suit
</pre>
== Overall results ==
This section illustrates the overall results achieved by the benchmarks.
===STREAM===
{| class="wikitable"
|+
Overall results (ARM core frequency = 800 MHz)
! rowspan="2" |Function
! colspan="2" |Mito8M
! rowspan="2" |Axel Lite
efficiency
[%]
|-
!Best rate
[MB/s]
!Efficiency
[%]
|-
|Copy
|6770
|51.7
|14.0
|-
|Scale
|6093
|46.5
|13.8
|-
|Add
|5263
|40.1
|14.6
|-
|Triad
|4820
|36.8
|14.9
|}
{| class="wikitable"
|+
Overall results (ARM core frequency = 1300 MHz)
! rowspan="2" |Function
! colspan="2" |Mito8M
! rowspan="2" |Axel Lite
efficiency
[%]
|-
!Best rate
[MB/s]
!Efficiency
[%]
|-
|Copy
|7125
|54.3
|14.0
|-
|Scale
|7501
|57.2
|13.8
|-
|Add
|6762
|51.6
|14.6
|-
|Triad
|6354
|48.5
|14.9
|}
Apart from the increase over Axel Lite in absolute terms, it is noteworthy that Mito8M exhibits a significant improvement in terms of efficiency too, as shown in the above tables. This is especially true in the case of ARM core frequency set to 1300 MHz.
Another interesting thing to note is how the bandwidth is affected by the ARM core frequency. If it scaled linearly, we should have an improvement of 62.5% from 800 to 1300 MHz. The average bandwidth at 800 MHz is 5761 MB/s. At 1300 MHz, it is 6935 MB/s. Therefore, the increase is 20.4%. With regard to STREAM benchmark, the achieved bandwidth does not scale linearly with ARM core frequency.
Please see [https://www.cs.virginia.edu/stream/ this page] for more details about STREAM benchmark.
===LMbench===
For what regards the memory bandwidth, LMbench provides many results organized in different categories. For the sake of simplicity, the following tables details just a couple of categories. The full results are available for download [http://mirror.dave.eu/mito/Mito8M/lmbench-Mito8M.0-800MHz.txt here (ARM core frequency set to 800 MHz)] and [http://mirror.dave.eu/mito/Mito8M/lmbench-Mito8M.0-1300MHz.txt here (ARM core frequency set to 1300 MHz)].
{| class="wikitable"
|+Memory read bandwidth
! rowspan="2" |Buffer size
! colspan="2" |Bandwitdth
[MB/s]
|-
!ARM core frequency = 800 MHz
!ARM core frequency = 1300 MHz
|-
|512B
|1553
|2521
|-
|1kB
|1567
|2546
|-
|2kB
|1575
|2560
|-
|4kB
|1575
|2564
|-
|8kB
|1577
|2564
|-
|16kB
|1577
|2567
|-
|32kB
|1528
|2490
|-
| 0.065536 |64kB
|1531
|2494
|-
|128kB
|1547
|2530
|-
|256kB
|1552
|2526
|-
|512kB
|1514
|2518
|-
|1MB
|1318
|2181
|-
|2MB
|1430
|2148
|-
|4MB
|1420
|2108
|-
|8MB
|1423
|2038
|-
|16MB
|1420
|2116
|-
|32MB
|1365
|2117
|-
|64MB
|1393
|2035
|-
|128MB
|1382
|2035
|-
|256MB
|1372
|2050
|-
|512MB
|1367
|1998
|}
{| class="wikitable"
|+Memory write bandwidth
! rowspan="2" |Buffer size
! colspan="2" |Bandwitdth
[MB/s]
|-
!ARM core frequency = 800 MHz
!ARM core frequency = 1300 MHz
|-
|512B
|2932
|4771
|-
|1kB
|3048
|4956
|-
|2kB
|3100
|5046
|-
|4kB
|3136
|5097
|-
|8kB
|3135
|5101
|-
|16kB
|3150
|5120
|-
|32kB
|2864
|5127
|-
|64kB
|3033
|5071
|-
|128kB
|3093
|4886
|-
|256kB
|2956
|5056
|-
|512kB
|3024
|5054
|-
|1MB
|3075
|5092
|-
|2MB
|3095
|5116
|-
|4MB
|3121
|5118
|-
|8MB
|3137
|5120
|-
|16MB
|3145
|5121
|-
|32MB
|3146
|5120
|-
|64MB
|3146
|5125
|-
|128MB
|3147
|5123
|-
|256MB
|3150
|5124
|-
|512MB
|3144
|5125
|-
|1GB
|3146
|5124
|}
There are some interesting facts to stress:
* Read and write bandwitdth are not effected by the buffer size.
* They are significantly affected by the ARM core frequency. For instance, the improvement of the write bandwidth (about 62% when the buffer is 1GB) is practically the same of the increase in frequency.
For more information regarding LMbench, please see [http://lmbench.sourceforge.net/ this page].
===pmbw===
As defined by the author, <code>pmbw</code> is "a set of assembler routines to measure the parallel memory (cache and RAM) bandwidth of modern multi-core machines." It performs a myriad of tests. Luckily, it comes with a handful tool that plots the results—which are stored in a text file—in a series of charts. Again,the benchmark was run at two different ARM core frequencies, 800 and 1300 MHz.
The complete results and the charts are available at the following links:
Generally speaking, the charts exhibit significant declines in the performances when the array size is around the L1 and the L2 cache size.
For more details about <code>pmbw</code>, please refer to [https://panthema.net/2013/pmbw/ this page].
===stressapptest===
According to the documentation, stressapptest—which was developed at Google—is "a memory interface test. It tries to maximize randomized traffic to memory from processor and I/O, with the intent of creating a realistic high load situation in order to test the existing hardware devices in a computer."
{| class="wikitable"
|+
! rowspan="2" |Test
! colspan="2" |Bandwidth
[MB/s]
|-
!ARM core frequency = 800 MHz
!ARM core frequency = 1300 MHz
|-
|Memory copy
|5483
|5804
|}
The above table lists the achieved results when the benchmark was run as detailed in [[#Running_the_tests_4|this section]]. In this case, the different when running at different ARM core frequencies is very little.
For more information about stressapptest, please refer to [https://github.com/stressapptest/stressapptest this page].
==Useful links==
*Joshua Wyatt Smith and Andrew Hamilton, [http://inspirehep.net/record/1424637/files/1719033_626-630.pdf Parallel benchmarks for ARM processors in the highenergy context]
*T Wrigley, G Harmsen and B Mellado, [http://inspirehep.net/record/1424631/files/1719033_275-280.pdf Memory performance of ARM processors and itsrelevance to High Energy Physics]
*G. T. Wrigley, R. G. Reed, B. Mellado, [http://inspirehep.net/record/1424637/files/1719033_626-630.pdf Memory benchmarking characterisation of ARM-based SoCs]
==Appendix A: Detailed testing procedures==
This section details how the benchmarks were configured and run on the testbed.