Open main menu

DAVE Developer's Wiki β

Changes

Created page with "{{InfoBoxTop}} {{AppliesToMito8M}} {{AppliesToMachineLearning}} {{InfoBoxBottom}} {{WarningMessage|text=This technical note was validated against specific versions of hardware..."
{{InfoBoxTop}}
{{AppliesToMito8M}}
{{AppliesToMachineLearning}}
{{InfoBoxBottom}}
{{WarningMessage|text=This technical note was validated against specific versions of hardware and software. What is described here may not work with other versions.}}
[[Category:MISC-AN-TN]]
[[Category:MISC-TN]]

== History ==
{| class="wikitable" border="1"
!Version
!Date
!Notes
|-
|1.0.0
|January 2020
|First public release
|}
==Introduction==
Mito8M is the first DAVE Embedded Systems' system-on-module (SoM) based on a core implementing the [https://en.wikipedia.org/wiki/ARM_architecture#64/32-bit_architecture ARMv8-A] architecture. Traditionally, ARM cores that are based on 32-bit [https://en.wikipedia.org/wiki/ARM_architecture#AArch32 ARMv7-A] architecture exhibit a limited RAM bandwidth even if they are coupled with 64-bit width SDRAM banks. As an example, please see [[SBCX-TN-006:_Characterizing_the_RAM_bandwidth_of_Axel_Lite_SoM#Testbed_general_configuration|this Technical Note]] where we characterized the SDRAM bandwidth of Cortex A9-based Axel Lite SoM. When dealing with computationally heavy tasks, a limited RAM bandwidth efficiency may turn out to be a severe bottleneck bounding the overall performance.

Besides an intrinsic increased computational power over their predecessors, ARMv8-A-based SoC's are also expected to improve RAM bandwidth significantly. This technical note (TN for short) illustrates several benchmark tests that were run on Mito8M SoM to characterize its bandwidth. It is worth to remember that Mito8M is built upon the [https://www.nxp.com/products/processors-and-microcontrollers/arm-processors/i.mx-applications-processors/i.mx-8-processors/i.mx-8m-family-armcortex-a53-cortex-m4-audio-voice-video:i.MX8M i.MX8M processor by NXP].

==Testbed general configuration==
This section illustrates the configuration settings common to all the tests performed. Basically, the testbed that was used is the same described in [[MISC-TN-008:_Running_Debian_Buster_(armbian)_on_Mito8M|this TN]].

===SoC and SDRAM bank===
The SoC model is i.MX8M Quad:
<pre class="board-terminal">
armbian@Mito8M:~/devel/lmbench/tmp$ lscpu
Architecture: aarch64
Byte Order: Little Endian
CPU(s): 4
On-line CPU(s) list: 0-3
Thread(s) per core: 1
Core(s) per socket: 4
Socket(s): 1
NUMA node(s): 1
Vendor ID: ARM
Model: 4
Model name: Cortex-A53
Stepping: r0p4
CPU max MHz: 1300.0000
CPU min MHz: 800.0000
BogoMIPS: 16.66
L1d cache: unknown size
L1i cache: unknown size
L2 cache: unknown size
NUMA node0 CPU(s): 0-3
Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 cpuid
</pre>

This processor is capable of running either at 800 MHz or 1.3 GHz. The tests were performed at either frequencies in order to determine how the it affects the RAM bandwidth.

The following table details the characteristics of the SDRAM bank connected to the SoC.

{| class="wikitable"
|+
! rowspan="2" |Subsystem
! rowspan="2" |Feature
!Platform
|-
!Mito8M
|-
| rowspan="6" |SoC
|SoC
|NXP i.MX8M Quad
|-
|ARM core(s)
|4 x Cortex A53
|-
|ARM core frequency
[MHz]
|800 or 1300
|-
|L1 cache (D)
[kB]
|32
|-
|L1 cache (I)
[kB]
|32
|-
|L2 cache
[MB]
|1
|-
| rowspan="6" |SDRAM
|Type
|LPDDR4
|-
|Frequency
[MHz]
|1600
|-
|Bus witdth
[bit]
|32
|-
|Theoretical bandwidth
[Gb/s]
|102.4
|-
|Theoretical bandwidth
[GB/s]
|12.8
|-
|Size
[MB]
|3072
|}

===Software configuration===

* Linux kernel: 4.14.98
*Root file system: Debian GNU/Linux 10 (buster)
* Architecture: aarch64
* Governor: userspace @ 800 MHz or 1300 MHz
<pre class="board-terminal">
root@Mito8M:~# echo userspace > /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
root@Mito8M:~# cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
userspace
root@Mito8M:~# cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_cur_freq
800000
</pre>


Some benchmarks were built natively on the platform under test. For the sake of completeness, the version of the GCC compiler is then indicated as well:
<pre class="board-terminal">
armbian@Mito8M:~/devel/lmbench$ gcc -v
Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/usr/lib/gcc/aarch64-linux-gnu/8/lto-wrapper
Target: aarch64-linux-gnu
Configured with: ../src/configure -v --with-pkgversion='Debian 8.3.0-6' --with-bugurl=file:///usr/share/doc/gcc-8/README.Bugs --enable-languages=c,ada,c++,go,d,fortran,objc,obj-c++ --prefix=/usr --with-gcc-major-version-only --program-suffix=-8 --program-prefix=aarch64-linux-gnu- --enable-shared --enable-linker-build-id --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix --libdir=/usr/lib --enable-nls --enable-bootstrap --enable-clocale=gnu --enable-libstdcxx-debug --enable-libstdcxx-time=yes --with-default-libstdcxx-abi=new --enable-gnu-unique-object --disable-libquadmath --disable-libquadmath-support --enable-plugin --enable-default-pie --with-system-zlib --disable-libphobos --enable-multiarch --enable-fix-cortex-a53-843419 --disable-werror --enable-checking=release --build=aarch64-linux-gnu --host=aarch64-linux-gnu --target=aarch64-linux-gnu
Thread model: posix
gcc version 8.3.0 (Debian 8.3.0-6)
</pre>

== Overall results ==
This section illustrates the overall results achieved by the benchmarks.

===STREAM===
{| class="wikitable"
|+
Overall results (ARM core frequency = 800 MHz)
! rowspan="2" |Function
! colspan="2" |Mito8M
! rowspan="2" |Axel Lite
efficiency

[%]
|-
!Best rate
[MB/s]
!Efficiency

[%]
|-
|Copy
|6770
|51.7
|14.0
|-
|Scale
|6093
|46.5
|13.8
|-
|Add
|5263
|40.1
|14.6
|-
|Triad
|4820
|36.8
|14.9
|}

{| class="wikitable"
|+
Overall results (ARM core frequency = 1300 MHz)
! rowspan="2" |Function
! colspan="2" |Mito8M
! rowspan="2" |Axel Lite
efficiency

[%]
|-
!Best rate
[MB/s]
!Efficiency

[%]
|-
|Copy
|7125
|54.3
|14.0
|-
|Scale
|7501
|57.2
|13.8
|-
|Add
|6762
|51.6
|14.6
|-
|Triad
|6354
|48.5
|14.9
|}

Apart from the increase over Axel Lite in absolute terms, it is noteworthy that Mito8M exhibits a significant improvement in terms of efficiency too, as shown in the above tables. This is especially true in the case of ARM core frequency set to 1300 MHz.

Another interesting thing to note is how the bandwidth is affected by the ARM core frequency. If it scaled linearly, we should have an improvement of 62.5% from 800 to 1300 MHz. The average bandwidth at 800 MHz is 5761 MB/s. At 1300 MHz, it is 6935 MB/s. Therefore, the increase is 20.4%. With regard to STREAM benchmark, the achieved bandwidth does not scale linearly with ARM core frequency.

Please see [https://www.cs.virginia.edu/stream/ this page] for more details about STREAM benchmark.

===LMbench===
For what regards the memory bandwidth, LMbench provides many results organized in different categories. For the sake of simplicity, the following tables details just a couple of categories. The full results are available for download [http://mirror.dave.eu/mito/Mito8M/lmbench-Mito8M.0-800MHz.txt here (ARM core frequency set to 800 MHz)] and [http://mirror.dave.eu/mito/Mito8M/lmbench-Mito8M.0-1300MHz.txt here (ARM core frequency set to 1300 MHz)].

{| class="wikitable"
|+Memory read bandwidth
! rowspan="2" |Buffer size
! colspan="2" |Bandwitdth
[MB/s]
|-
!ARM core frequency = 800 MHz
!ARM core frequency = 1300 MHz
|-
|512B
|1553
|2521
|-
|1kB
|1567
|2546
|-
|2kB
|1575
|2560
|-
|4kB
|1575
|2564
|-
|8kB
|1577
|2564
|-
|16kB
|1577
|2567
|-
|32kB
|1528
|2490
|-
| 0.065536 |64kB
|1531
|2494
|-
|128kB
|1547
|2530
|-
|256kB
|1552
|2526
|-
|512kB
|1514
|2518
|-
|1MB
|1318
|2181
|-
|2MB
|1430
|2148
|-
|4MB
|1420
|2108
|-
|8MB
|1423
|2038
|-
|16MB
|1420
|2116
|-
|32MB
|1365
|2117
|-
|64MB
|1393
|2035
|-
|128MB
|1382
|2035
|-
|256MB
|1372
|2050
|-
|512MB
|1367
|1998
|}

{| class="wikitable"
|+Memory write bandwidth
! rowspan="2" |Buffer size
! colspan="2" |Bandwitdth
[MB/s]
|-
!ARM core frequency = 800 MHz
!ARM core frequency = 1300 MHz
|-
|512B
|2932
|4771
|-
|1kB
|3048
|4956
|-
|2kB
|3100
|5046
|-
|4kB
|3136
|5097
|-
|8kB
|3135
|5101
|-
|16kB
|3150
|5120
|-
|32kB
|2864
|5127
|-
|64kB
|3033
|5071
|-
|128kB
|3093
|4886
|-
|256kB
|2956
|5056
|-
|512kB
|3024
|5054
|-
|1MB
|3075
|5092
|-
|2MB
|3095
|5116
|-
|4MB
|3121
|5118
|-
|8MB
|3137
|5120
|-
|16MB
|3145
|5121
|-
|32MB
|3146
|5120
|-
|64MB
|3146
|5125
|-
|128MB
|3147
|5123
|-
|256MB
|3150
|5124
|-
|512MB
|3144
|5125
|-
|1GB
|3146
|5124
|}

There are some interesting facts to stress:
* Read and write bandwitdth are not effected by the buffer size.
* They are significantly affected by the ARM core frequency. For instance, the improvement of the write bandwidth (about 62% when the buffer is 1GB) is practically the same of the increase in frequency.

For more information regarding LMbench, please see [http://lmbench.sourceforge.net/ this page].

===pmbw===
As defined by the author, <code>pmbw</code> is "a set of assembler routines to measure the parallel memory (cache and RAM) bandwidth of modern multi-core machines." It performs a myriad of tests. Luckily, it comes with a handful tool that plots the results—which are stored in a text file—in a series of charts. Again,the benchmark was run at two different ARM core frequencies, 800 and 1300 MHz.

The complete results and the charts are available at the following links:
*http://mirror.dave.eu/mito/Mito8M/pmbw-stats-Mito8M-800MHz.txt
*http://mirror.dave.eu/mito/Mito8M/pmbw-plots-Mito8M-800MHz.pdf
*http://mirror.dave.eu/mito/Mito8M/pmbw-stats-Mito8M-1300MHz.txt
*http://mirror.dave.eu/mito/Mito8M/pmbw-plots-Mito8M-1300MHz.pdf

Generally speaking, the charts exhibit significant declines in the performances when the array size is around the L1 and the L2 cache size.

For more details about <code>pmbw</code>, please refer to [https://panthema.net/2013/pmbw/ this page].

===stressapptest===
According to the documentation, stressapptest—which was developed at Google—is "a memory interface test. It tries to maximize randomized traffic to memory from processor and I/O, with the intent of creating a realistic high load situation in order to test the existing hardware devices in a computer."
{| class="wikitable"
|+
! rowspan="2" |Test
! colspan="2" |Bandwidth
[MB/s]
|-
!ARM core frequency = 800 MHz
!ARM core frequency = 1300 MHz
|-
|Memory copy
|5483
|5804
|}


The above table lists the achieved results when the benchmark was run as detailed in [[#Running_the_tests_4|this section]]. In this case, the different when running at different ARM core frequencies is very little.

For more information about stressapptest, please refer to [https://github.com/stressapptest/stressapptest this page].

==Useful links==
*Joshua Wyatt Smith and Andrew Hamilton, [http://inspirehep.net/record/1424637/files/1719033_626-630.pdf Parallel benchmarks for ARM processors in the highenergy context]
*T Wrigley, G Harmsen and B Mellado, [http://inspirehep.net/record/1424631/files/1719033_275-280.pdf Memory performance of ARM processors and itsrelevance to High Energy Physics]
*G. T. Wrigley, R. G. Reed, B. Mellado, [http://inspirehep.net/record/1424637/files/1719033_626-630.pdf Memory benchmarking characterisation of ARM-based SoCs]

==Appendix A: Detailed testing procedures==
This section details how the benchmarks were configured and run on the testbed.
===STREAM===

====Building====
To build STREAM:
* clone its git repository
*modify the <code>Makefile</code> as shown below
*issue the <code>make</code> command.
4,650
edits