Difference between revisions of "MISC-TN-009: Characterizing the RAM bandwidth of Mito8M SoM"

Revision as of 11:09, 15 January 2020

HOME	SOMs	SBCs	ToloMEO Embedded Assistant	GET A QUOTE	ONLINE HELPDESK
	Roadmap		IoT Services			ML/AI services	Embedded Design Services

Info Box

Applies to MITO 8M

This technical note was validated against specific versions of hardware and software. What is described here may not work with other versions.

History[edit | edit source]

Version	Date	Notes
1.0.0	January 2020	First public release

Introduction[edit | edit source]

Mito8M is the first DAVE Embedded Systems' product based on a core implementing the ARMv8-A architecture. Traditionally, ARM cores that are based on 32-bit ARMv7-A architecture exhibit a limited RAM bandwidth even if they are coupled with 64-bit witdh SDRAM banks. When dealing with computationally heavy tasks, this factor may turn out to be a severe bottleneck limiting the overall performance.

Beside an intrinsic increased computational power, ARMv8-A-based SoC's are expected to improve significantly RAM bandwidth as well. This technical note (TN for short) illustrates several benchmarking tests that were run on Mito8M SoM, which is built upon NXP i.MX8M Quad.

Testbed general configuration[edit | edit source]

This section illustrates the configuration settings common to all the tests performed.

SoC and SDRAM bank organization[edit | edit source]


		Mito8M
SoC	SoC	NXP i.MX8M Quad
SoC	ARM frequency [MHz]	800
SDRAM	Type	LPDDR4
	Frequency [MHz]	1600
	Bus witdth [bit]	32
	Theoretical bandiwidth [Gb/s]	102.4
	Size [MB]	3072

Software configuration[edit | edit source]

Linux kernel: 4.14.98
Architecture: aarch64
Governor: userspace @ 800 MHz

root@Mito8M:~# echo userspace > /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
root@Mito8M:~# cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
userspace
root@Mito8M:~# cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_cur_freq
800000

GCC

armbian@Mito8M:~/devel/lmbench$ gcc -v
Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/usr/lib/gcc/aarch64-linux-gnu/8/lto-wrapper
Target: aarch64-linux-gnu
Configured with: ../src/configure -v --with-pkgversion='Debian 8.3.0-6' --with-bugurl=file:///usr/share/doc/gcc-8/README.Bugs --enable-languages=c,ada,c++,go,d,fortran,objc,obj-c++ --prefix=/usr --with-gcc-major-version-only --program-suffix=-8 --program-prefix=aarch64-linux-gnu- --enable-shared --enable-linker-build-id --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix --libdir=/usr/lib --enable-nls --enable-bootstrap --enable-clocale=gnu --enable-libstdcxx-debug --enable-libstdcxx-time=yes --with-default-libstdcxx-abi=new --enable-gnu-unique-object --disable-libquadmath --disable-libquadmath-support --enable-plugin --enable-default-pie --with-system-zlib --disable-libphobos --enable-multiarch --enable-fix-cortex-a53-843419 --disable-werror --enable-checking=release --build=aarch64-linux-gnu --host=aarch64-linux-gnu --target=aarch64-linux-gnu
Thread model: posix
gcc version 8.3.0 (Debian 8.3.0-6)

Results[edit | edit source]

This section details the results that were achieved by the different benchmarks

General configuration[edit | edit source]

Testbed #1[edit | edit source]


		Mito8M
	ARM frequency [MHz]	792
	Frequency [MHz]	1600
	Bus witdth [bit]	32

Detailed testing procedures[edit | edit source]

This sections details how the benchmarks were configured and run on the testbed.

STREAM[edit | edit source]

Building[edit | edit source]

git clone https://github.com/jeffhammond/STREAM.git
make

 1 armbian@Mito8M:~/devel/STREAM$ cat Makefile 
 2 CC = gcc
 3 CFLAGS = -O2 -fopenmp
 4 
 5 FC = gfortran-4.9
 6 FFLAGS = -O2 -fopenmp
 7 
 8 all: stream_c.exe
 9 
10 stream_f.exe: stream.f mysecond.o
11         $(CC) $(CFLAGS) -c mysecond.c
12         $(FC) $(FFLAGS) -c stream.f
13         $(FC) $(FFLAGS) stream.o mysecond.o -o stream_f.exe
14 
15 stream_c.exe: stream.c
16         $(CC) $(CFLAGS) stream.c -o stream_c.exe
17 
18 clean:
19         rm -f stream_f.exe stream_c.exe *.o
20 
21 # an example of a more complex build line for the Intel icc compiler
22 stream.icc: stream.c
23         icc -O3 -xCORE-AVX2 -ffreestanding -qopenmp -DSTREAM_ARRAY_SIZE=80000000 -DNTIMES=20 stream.c -o stream.omp.AVX2.80M.20x.icc

Running[edit | edit source]

armbian@Mito8M:~/devel/STREAM$ ./stream_c.exe 
-------------------------------------------------------------
STREAM version $Revision: 5.10 $
-------------------------------------------------------------
This system uses 8 bytes per array element.
-------------------------------------------------------------
Array size = 10000000 (elements), Offset = 0 (elements)
Memory per array = 76.3 MiB (= 0.1 GiB).
Total memory required = 228.9 MiB (= 0.2 GiB).
Each kernel will be executed 10 times.
 The *best* time for each kernel (excluding the first iteration)
 will be used to compute the reported bandwidth.
-------------------------------------------------------------
Number of Threads requested = 4
Number of Threads counted = 4
-------------------------------------------------------------
Your clock granularity/precision appears to be 1 microseconds.
Each test below will take on the order of 46427 microseconds.
   (= 46427 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function    Best Rate MB/s  Avg time     Min time     Max time
Copy:            6770.5     0.024010     0.023632     0.025117
Scale:           6093.2     0.027474     0.026259     0.029142
Add:             5263.5     0.046008     0.045597     0.046230
Triad:           4820.0     0.050297     0.049793     0.050723
-------------------------------------------------------------
Solution Validates: avg error less than 1.000000e-13 on all three arrays
-------------------------------------------------------------

LM Bench[edit | edit source]

armbian@Mito8M:~/devel/lmbench$ sudo lmbench-run
[sudo] password for armbian: 
/usr/lib/lmbench/scripts/gnu-os: unable to guess system type

This script, last modified 2004-08-18, has failed to recognize
the operating system you are using. It is advised that you
download the most up to date version of the config scripts from

    ftp://ftp.gnu.org/pub/gnu/config/

If the version you run (/usr/lib/lmbench/scripts/gnu-os) is already up to date, please
send the following data and any information you think might be
pertinent to <config-patches@gnu.org> in order to provide the needed
information to handle your system.

config.guess timestamp = 2004-08-18

uname -m = aarch64
uname -r = 4.14.98-g4c94e1dbaec2
uname -s = Linux
uname -v = #1 SMP PREEMPT Mon Sep 30 14:46:22 CEST 2019

/usr/bin/uname -p = 
/bin/uname -X     = 

hostinfo               = 
/bin/universe          = 
/usr/bin/arch -k       = 
/bin/arch              = 
/usr/bin/oslevel       = 
/usr/convex/getsysinfo = 

UNAME_MACHINE = aarch64
UNAME_RELEASE = 4.14.98-g4c94e1dbaec2
UNAME_SYSTEM  = Linux
UNAME_VERSION = #1 SMP PREEMPT Mon Sep 30 14:46:22 CEST 2019
=====================================================================

                L M B E N C H   C ON F I G U R A T I O N
                ----------------------------------------

You need to configure some parameters to lmbench.  Once you have configured
these parameters, you may do multiple runs by saying

        "make rerun"

in the src subdirectory.

NOTICE: please do not have any other activity on the system if you can
help it.  Things like the second hand on your xclock or X perfmeters
are not so good when benchmarking.  In fact, X is not so good when
benchmarking.

=====================================================================

If you are running on an MP machine and you want to try running
multiple copies of lmbench in parallel, you can specify how many here.

Using this option will make the benchmark run 100x slower (sorry).

NOTE:  WARNING! This feature is experimental and many results are 
        known to be incorrect or random!

MULTIPLE COPIES [default 1]: 
=====================================================================

Options to control job placement
1) Allow scheduler to place jobs
2) Assign each benchmark process with any attendent child processes
   to its own processor
3) Assign each benchmark process with any attendent child processes
   to its own processor, except that it will be as far as possible
   from other processes
4) Assign each benchmark and attendent processes to their own
   processors
5) Assign each benchmark and attendent processes to their own
   processors, except that they will be as far as possible from
   each other and other processes
6) Custom placement: you assign each benchmark process with attendent
   child processes to processors
7) Custom placement: you assign each benchmark and attendent
   processes to processors

Note: some benchmarks, such as bw_pipe, create attendent child
processes for each benchmark process.  For example, bw_pipe
needs a second process to send data down the pipe to be read
by the benchmark process.  If you have three copies of the
benchmark process running, then you actually have six processes;
three attendent child processes sending data down the pipes and 
three benchmark processes reading data and doing the measurements.

Job placement selection [default 1]: 
=====================================================================

Hang on, we are calculating your timing granularity.
OK, it looks like you can time stuff down to 5000 usec resolution.

Hang on, we are calculating your timing overhead.
OK, it looks like your gettimeofday() costs 0 usecs.

Hang on, we are calculating your loop overhead.
OK, it looks like your benchmark loop costs 0.00000136 usecs.

=====================================================================

Several benchmarks operate on a range of memory.  This memory should be
sized such that it is at least 4 times as big as the external cache[s]
on your system.   It should be no more than 80% of your physical memory.

The bigger the range, the more accurate the results, but larger sizes
take somewhat longer to run the benchmark.

MB [default 2097]: 1024
Checking to see if you have 1024 MB; please wait for a moment...
1024MB OK
1024MB OK
1024MB OK
Hang on, we are calculating your cache line size.
OK, it looks like your cache line is 64 bytes.

=====================================================================

lmbench measures a wide variety of system performance, and the full suite
of benchmarks can take a long time on some platforms.  Consequently, we
offer the capability to run only predefined subsets of benchmarks, one
for operating system specific benchmarks and one for hardware specific
benchmarks.  We also offer the option of running only selected benchmarks
which is useful during operating system development.

Please remember that if you intend to publish the results you either need
to do a full run or one of the predefined OS or hardware subsets.

SUBSET (ALL|HARWARE|OS|DEVELOPMENT) [default all]: 
=====================================================================

This benchmark measures, by default, memory latency for a number of
different strides.  That can take a long time and is most useful if you
are trying to figure out your cache line size or if your cache line size
is greater than 128 bytes.

If you are planning on sending in these results, please don't do a fast
run.

Answering yes means that we measure memory latency with a 128 byte stride.  

FASTMEM [default no]: 
=====================================================================

This benchmark measures, by default, file system latency.  That can
take a long time on systems with old style file systems (i.e., UFS,
FFS, etc.).  Linux' ext2fs and Sun's tmpfs are fast enough that this
test is not painful.

If you are planning on sending in these results, please don't do a fast
run.

If you want to skip the file system latency tests, answer "yes" below.

SLOWFS [default no]: yes
=====================================================================

This benchmark can measure disk zone bandwidths and seek times.  These can
be turned into whizzy graphs that pretty much tell you everything you might
need to know about the performance of your disk.  

This takes a while and requires read access to a disk drive.  
Write is not measured, see disk.c to see how if you want to do so.

If you want to skip the disk tests, hit return below.

If you want to include disk tests, then specify the path to the disk
device, such as /dev/sda.  For each disk that is readable, you'll be
prompted for a one line description of the drive, i.e., 

        Iomega IDE ZIP
or
        HP C3725S 2GB on 10MB/sec NCR SCSI bus

DISKS [default none]: 
=====================================================================

If you are running on an idle network and there are other, identically
configured systems, on the same wire (no gateway between you and them),
and you have rsh access to them, then you should run the network part
of the benchmarks to them.  Please specify any such systems as a space
separated list such as: ether-host fddi-host hippi-host.

REMOTE [default none]: 
=====================================================================

Calculating mhz, please wait for a moment...
I think your CPU mhz is 

        798 MHz, 1.2531 nanosec clock

but I am frequently wrong.  If that is the wrong Mhz, type in your
best guess as to your processor speed.  It doesn't have to be exact,
but if you know it is around 800, say 800.  

Please note that some processors, such as the P4, have a core which
is double-clocked, so on those processors the reported clock speed
will be roughly double the advertised clock rate.  For example, a
1.8GHz P4 may be reported as a 3592MHz processor.

Processor mhz [default 798 MHz, 1.2531 nanosec clock]: 
=====================================================================

We need a place to store a 1024 Mbyte file as well as create and delete a
large number of small files.  We default to /var/tmp.  If /var/tmp is a
memory resident file system (i.e., tmpfs), pick a different place.
Please specify a directory that has enough space and is a local file
system.

FSDIR [default /var/tmp/lmbench]: /tmp/lmbench
=====================================================================

lmbench outputs status information as it runs various benchmarks.
By default this output is sent to /dev/tty, but you may redirect
it to any file you wish (such as /dev/null...).

Status output file [default /dev/tty]: 
=====================================================================

There is a database of benchmark results that is shipped with new
releases of lmbench.  Your results can be included in the database
if you wish.  The more results the better, especially if they include
remote networking.  If your results are interesting, i.e., for a new
fast box, they may be made available on the lmbench web page, which is

        http://www.bitmover.com/lmbench

Mail results [default yes]: no
OK, no results mailed.
=====================================================================

Confguration done, thanks.

There is a mailing list for discussing lmbench hosted at BitMover. 
Send mail to majordomo@bitmover.com to join the list.

/usr/lib/lmbench/scripts/gnu-os: unable to guess system type

This script, last modified 2004-08-18, has failed to recognize
the operating system you are using. It is advised that you
download the most up to date version of the config scripts from

    ftp://ftp.gnu.org/pub/gnu/config/

If the version you run (/usr/lib/lmbench/scripts/gnu-os) is already up to date, please
send the following data and any information you think might be
pertinent to <config-patches@gnu.org> in order to provide the needed
information to handle your system.

config.guess timestamp = 2004-08-18

uname -m = aarch64
uname -r = 4.14.98-g4c94e1dbaec2
uname -s = Linux
uname -v = #1 SMP PREEMPT Mon Sep 30 14:46:22 CEST 2019

/usr/bin/uname -p = 
/bin/uname -X     = 

hostinfo               = 
/bin/universe          = 
/usr/bin/arch -k       = 
/bin/arch              = 
/usr/bin/oslevel       = 
/usr/convex/getsysinfo = 

UNAME_MACHINE = aarch64
UNAME_RELEASE = 4.14.98-g4c94e1dbaec2
UNAME_SYSTEM  = Linux
UNAME_VERSION = #1 SMP PREEMPT Mon Sep 30 14:46:22 CEST 2019
Using config in CONFIG.Mito8M
Wed Jan 15 10:56:54 CET 2020
Latency measurements
Wed Jan 15 10:57:29 CET 2020
Local networking
Wed Jan 15 10:58:36 CET 2020
Bandwidth measurements
Wed Jan 15 11:03:02 CET 2020
Calculating context switch overhead
Wed Jan 15 11:03:09 CET 2020
Calculating effective TLB size
Wed Jan 15 11:03:10 CET 2020
Calculating memory load parallelism
Wed Jan 15 11:14:34 CET 2020
McCalpin's STREAM benchmark
Wed Jan 15 11:15:30 CET 2020
Calculating memory load latency
Wed Jan 15 11:35:54 CET 2020
Benchmark run finished....
Remember you can find the results of the benchmark 
under /var/lib/lmbench/results

Useful links[edit | edit source]

STREAM benchmark
LM Bench benchmark
pmbw benchmark
Joshua Wyatt Smith and Andrew Hamilton, Parallel benchmarks for ARM processors in the highenergy context
T Wrigley, G Harmsen and B Mellado, Memory performance of ARM processors and itsrelevance to High Energy Physics
G. T. Wrigley, R. G. Reed, B. Mellado, Memory benchmarking characterisation of ARM-based SoCs

HOME	SOMs	SBCs	ToloMEO Embedded Assistant	GET A QUOTE	ONLINE HELPDESK
	Roadmap		IoT Services			ML/AI services	Embedded Design Services

Difference between revisions of "MISC-TN-009: Characterizing the RAM bandwidth of Mito8M SoM"

Revision as of 11:09, 15 January 2020

Contents

History[edit | edit source]

Introduction[edit | edit source]

Testbed general configuration[edit | edit source]

SoC and SDRAM bank organization[edit | edit source]

Software configuration[edit | edit source]

Results[edit | edit source]

General configuration[edit | edit source]

Testbed #1[edit | edit source]

Detailed testing procedures[edit | edit source]

STREAM[edit | edit source]

Building[edit | edit source]

Running[edit | edit source]

LM Bench[edit | edit source]

Useful links[edit | edit source]

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Quick Links

Contact us

How to use wiki

Advanced Search

Tools