Changes

Jump to: navigation, search

MISC-TN-009: Characterizing the RAM bandwidth of Mito8M SoM

7,165 bytes added, 11:07, 15 January 2020
Created page with "{{InfoBoxTop}} {{AppliesToMito8M}} {{InfoBoxBottom}} {{WarningMessage|text=This technical note was validated against specific versions of hardware and software. What is descri..."
{{InfoBoxTop}}
{{AppliesToMito8M}}
{{InfoBoxBottom}}
{{WarningMessage|text=This technical note was validated against specific versions of hardware and software. What is described here may not work with other versions.}}
[[Category:MISC-AN-TN]]
[[Category:MISC-TN]]

__FORCETOC__
== History ==
{| class="wikitable" border="1"
!Version
!Date
!Notes
|-
|1.0.0
|January 2020
|First public release
|}
==Introduction==
Mito8M is the first DAVE Embedded Systems' product based on a core implementing the [https://en.wikipedia.org/wiki/ARM_architecture#64/32-bit_architecture ARMv8-A] architecture. Traditionally, ARM cores that are based on 32-bit [https://en.wikipedia.org/wiki/ARM_architecture#AArch32 ARMv7-A] architecture exhibit a limited RAM bandwidth even if they are coupled with 64-bit witdh SDRAM banks. When dealing with computationally heavy tasks, this factor may turn out to be a severe bottleneck limiting the overall performance.

Beside an intrinsic increased computational power, ARMv8-A-based SoC's are expected to improve significantly RAM bandwidth as well. This technical note (TN for short) illustrates several benchmarking tests that were run on Mito8M SoM, which is built upon [https://www.nxp.com/products/processors-and-microcontrollers/arm-processors/i.mx-applications-processors/i.mx-8-processors/i.mx-8m-family-armcortex-a53-cortex-m4-audio-voice-video:i.MX8M NXP i.MX8M Quad].

==Testbed general configuration==
This section illustrates the configuration settings common to all the tests performed.

====SoC and SDRAM bank organization====
{| class="wikitable"
|+
!
!
!Mito8M
!
|-
| rowspan="2" |SoC
|SoC
|NXP i.MX8M Quad
|
|-
|ARM frequency
[MHz]
|800
|
|-
| rowspan="5" |SDRAM
|Type
|LPDDR4
|
|-
|Frequency
[MHz]
|1600
|
|-
|Bus witdth
[bit]
|32
|
|-
|Theoretical bandiwidth
[Gb/s]
|102.4
|
|-
|Size
[MB]
|3072
|
|}

====Software configuration====

* Linux kernel: 4.14.98
* Architecture: aarch64
* Governor: userspace @ 800 MHz
<pre class="board-terminal">
root@Mito8M:~# echo userspace > /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
root@Mito8M:~# cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
userspace
root@Mito8M:~# cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_cur_freq
800000
</pre>

GCC
<pre class="board-terminal">
armbian@Mito8M:~/devel/lmbench$ gcc -v
Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/usr/lib/gcc/aarch64-linux-gnu/8/lto-wrapper
Target: aarch64-linux-gnu
Configured with: ../src/configure -v --with-pkgversion='Debian 8.3.0-6' --with-bugurl=file:///usr/share/doc/gcc-8/README.Bugs --enable-languages=c,ada,c++,go,d,fortran,objc,obj-c++ --prefix=/usr --with-gcc-major-version-only --program-suffix=-8 --program-prefix=aarch64-linux-gnu- --enable-shared --enable-linker-build-id --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix --libdir=/usr/lib --enable-nls --enable-bootstrap --enable-clocale=gnu --enable-libstdcxx-debug --enable-libstdcxx-time=yes --with-default-libstdcxx-abi=new --enable-gnu-unique-object --disable-libquadmath --disable-libquadmath-support --enable-plugin --enable-default-pie --with-system-zlib --disable-libphobos --enable-multiarch --enable-fix-cortex-a53-843419 --disable-werror --enable-checking=release --build=aarch64-linux-gnu --host=aarch64-linux-gnu --target=aarch64-linux-gnu
Thread model: posix
gcc version 8.3.0 (Debian 8.3.0-6)
</pre>


==Results==
This section details the results that were achieved by the different benchmarks

===General configuration===

===Testbed #1===

{| class="wikitable"
|+
!
!
!Mito8M
!
|-
|
|ARM frequency
[MHz]
|792
|
|-
|
|Frequency
[MHz]
|1600
|
|-
|
|Bus witdth
[bit]
|32
|
|}

==Detailed testing procedures==
This sections details how the benchmarks were configured and run on the testbed.
===STREAM===

====Building====
<pre class="board-terminal">
git clone https://github.com/jeffhammond/STREAM.git
make
</pre>

<syntaxhighlight lang="makefile" line="line">
armbian@Mito8M:~/devel/STREAM$ cat Makefile
CC = gcc
CFLAGS = -O2 -fopenmp

FC = gfortran-4.9
FFLAGS = -O2 -fopenmp

all: stream_c.exe

stream_f.exe: stream.f mysecond.o
$(CC) $(CFLAGS) -c mysecond.c
$(FC) $(FFLAGS) -c stream.f
$(FC) $(FFLAGS) stream.o mysecond.o -o stream_f.exe

stream_c.exe: stream.c
$(CC) $(CFLAGS) stream.c -o stream_c.exe

clean:
rm -f stream_f.exe stream_c.exe *.o

# an example of a more complex build line for the Intel icc compiler
stream.icc: stream.c
icc -O3 -xCORE-AVX2 -ffreestanding -qopenmp -DSTREAM_ARRAY_SIZE=80000000 -DNTIMES=20 stream.c -o stream.omp.AVX2.80M.20x.icc
</syntaxhighlight>

====Running====
<pre class="board-terminal">
armbian@Mito8M:~/devel/STREAM$ ./stream_c.exe
-------------------------------------------------------------
STREAM version $Revision: 5.10 $
-------------------------------------------------------------
This system uses 8 bytes per array element.
-------------------------------------------------------------
Array size = 10000000 (elements), Offset = 0 (elements)
Memory per array = 76.3 MiB (= 0.1 GiB).
Total memory required = 228.9 MiB (= 0.2 GiB).
Each kernel will be executed 10 times.
The *best* time for each kernel (excluding the first iteration)
will be used to compute the reported bandwidth.
-------------------------------------------------------------
Number of Threads requested = 4
Number of Threads counted = 4
-------------------------------------------------------------
Your clock granularity/precision appears to be 1 microseconds.
Each test below will take on the order of 46427 microseconds.
(= 46427 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function Best Rate MB/s Avg time Min time Max time
Copy: 6770.5 0.024010 0.023632 0.025117
Scale: 6093.2 0.027474 0.026259 0.029142
Add: 5263.5 0.046008 0.045597 0.046230
Triad: 4820.0 0.050297 0.049793 0.050723
-------------------------------------------------------------
Solution Validates: avg error less than 1.000000e-13 on all three arrays
-------------------------------------------------------------
</pre>

==Useful links==
*[https://www.cs.virginia.edu/stream/ STREAM benchmark]
*[http://lmbench.sourceforge.net/ LM Bench benchmark]
*[https://panthema.net/2013/pmbw/ pmbw benchmark ]
*Joshua Wyatt Smith and Andrew Hamilton, [http://inspirehep.net/record/1424637/files/1719033_626-630.pdf Parallel benchmarks for ARM processors in the highenergy context]
*T Wrigley, G Harmsen and B Mellado, [http://inspirehep.net/record/1424631/files/1719033_275-280.pdf Memory performance of ARM processors and itsrelevance to High Energy Physics]
*G. T. Wrigley, R. G. Reed, B. Mellado, [http://inspirehep.net/record/1424637/files/1719033_626-630.pdf Memory benchmarking characterisation of ARM-based SoCs]
4,650
edits

Navigation menu