Difference between revisions of "MISC-TN-009: Characterizing the RAM bandwidth of Mito8M SoM"

From DAVE Developer's Wiki
Jump to: navigation, search
(SoC and SDRAM bank organization)
(SoC and SDRAM bank)
Line 52: Line 52:
  
 
This processor is capable of running either at 800 MHz or 1.3 GHz. All the tests were conducted at 800 MHz.
 
This processor is capable of running either at 800 MHz or 1.3 GHz. All the tests were conducted at 800 MHz.
 +
 +
The following table details the characteristics of the SDRAM bank connected to the SoC.
 +
 +
 
{| class="wikitable"
 
{| class="wikitable"
 
|+
 
|+

Revision as of 15:00, 20 January 2020

Info Box
DMI-Mito-top.png Applies to MITO 8M
Warning-icon.png This technical note was validated against specific versions of hardware and software. What is described here may not work with other versions. Warning-icon.png


History[edit | edit source]

Version Date Notes
1.0.0 January 2020 First public release

Introduction[edit | edit source]

Mito8M is the first DAVE Embedded Systems' system-on-module (SoM) based on a core implementing the ARMv8-A architecture. Traditionally, ARM cores that are based on 32-bit ARMv7-A architecture exhibit a limited RAM bandwidth even if they are coupled with 64-bit width SDRAM banks. When dealing with computationally heavy tasks, this factor may turn out to be a severe bottleneck limiting the overall performance.

Besides an intrinsic increased computational power over their predecessors, ARMv8-A-based SoC's are also expected to improve RAM bandwidth significantly. This technical note (TN for short) illustrates several benchmarking tests that were run on Mito8M SoM. It is worth to remember that this product is built upon the i.MX8M processor by NXP.

Testbed general configuration[edit | edit source]

This section illustrates the configuration settings common to all the tests performed. Basically, the testbed that was used is the same described in this TN.

SoC and SDRAM bank[edit | edit source]

The SoC model is i.MX8M Quad:

armbian@Mito8M:~/devel/lmbench/tmp$ lscpu
Architecture:        aarch64
Byte Order:          Little Endian
CPU(s):              4
On-line CPU(s) list: 0-3
Thread(s) per core:  1
Core(s) per socket:  4
Socket(s):           1
NUMA node(s):        1
Vendor ID:           ARM
Model:               4
Model name:          Cortex-A53
Stepping:            r0p4
CPU max MHz:         1300.0000
CPU min MHz:         800.0000
BogoMIPS:            16.66
L1d cache:           unknown size
L1i cache:           unknown size
L2 cache:            unknown size
NUMA node0 CPU(s):   0-3
Flags:               fp asimd evtstrm aes pmull sha1 sha2 crc32 cpuid

This processor is capable of running either at 800 MHz or 1.3 GHz. All the tests were conducted at 800 MHz.

The following table details the characteristics of the SDRAM bank connected to the SoC.


Mito8M
SoC SoC NXP i.MX8M Quad
ARM frequency

[MHz]

800
SDRAM Type LPDDR4
Frequency

[MHz]

1600
Bus witdth

[bit]

32
Theoretical bandiwidth

[Gb/s]

102.4
Size

[MB]

3072

Software configuration[edit | edit source]

  • Linux kernel: 4.14.98
  • Architecture: aarch64
  • Governor: userspace @ 800 MHz
root@Mito8M:~# echo userspace > /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
root@Mito8M:~# cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
userspace
root@Mito8M:~# cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_cur_freq
800000

GCC

armbian@Mito8M:~/devel/lmbench$ gcc -v
Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/usr/lib/gcc/aarch64-linux-gnu/8/lto-wrapper
Target: aarch64-linux-gnu
Configured with: ../src/configure -v --with-pkgversion='Debian 8.3.0-6' --with-bugurl=file:///usr/share/doc/gcc-8/README.Bugs --enable-languages=c,ada,c++,go,d,fortran,objc,obj-c++ --prefix=/usr --with-gcc-major-version-only --program-suffix=-8 --program-prefix=aarch64-linux-gnu- --enable-shared --enable-linker-build-id --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix --libdir=/usr/lib --enable-nls --enable-bootstrap --enable-clocale=gnu --enable-libstdcxx-debug --enable-libstdcxx-time=yes --with-default-libstdcxx-abi=new --enable-gnu-unique-object --disable-libquadmath --disable-libquadmath-support --enable-plugin --enable-default-pie --with-system-zlib --disable-libphobos --enable-multiarch --enable-fix-cortex-a53-843419 --disable-werror --enable-checking=release --build=aarch64-linux-gnu --host=aarch64-linux-gnu --target=aarch64-linux-gnu
Thread model: posix
gcc version 8.3.0 (Debian 8.3.0-6)

Results[edit | edit source]

This section details the results that were achieved by the different benchmarks

General configuration[edit | edit source]

Testbed #1[edit | edit source]

Mito8M
ARM frequency

[MHz]

792
Frequency

[MHz]

1600
Bus witdth

[bit]

32

Detailed testing procedures[edit | edit source]

This sections details how the benchmarks were configured and run on the testbed.

STREAM[edit | edit source]

Building[edit | edit source]

git clone https://github.com/jeffhammond/STREAM.git
make
 1 armbian@Mito8M:~/devel/STREAM$ cat Makefile 
 2 CC = gcc
 3 CFLAGS = -O2 -fopenmp
 4 
 5 FC = gfortran-4.9
 6 FFLAGS = -O2 -fopenmp
 7 
 8 all: stream_c.exe
 9 
10 stream_f.exe: stream.f mysecond.o
11         $(CC) $(CFLAGS) -c mysecond.c
12         $(FC) $(FFLAGS) -c stream.f
13         $(FC) $(FFLAGS) stream.o mysecond.o -o stream_f.exe
14 
15 stream_c.exe: stream.c
16         $(CC) $(CFLAGS) stream.c -o stream_c.exe
17 
18 clean:
19         rm -f stream_f.exe stream_c.exe *.o
20 
21 # an example of a more complex build line for the Intel icc compiler
22 stream.icc: stream.c
23         icc -O3 -xCORE-AVX2 -ffreestanding -qopenmp -DSTREAM_ARRAY_SIZE=80000000 -DNTIMES=20 stream.c -o stream.omp.AVX2.80M.20x.icc

Running[edit | edit source]

armbian@Mito8M:~/devel/STREAM$ ./stream_c.exe 
-------------------------------------------------------------
STREAM version $Revision: 5.10 $
-------------------------------------------------------------
This system uses 8 bytes per array element.
-------------------------------------------------------------
Array size = 10000000 (elements), Offset = 0 (elements)
Memory per array = 76.3 MiB (= 0.1 GiB).
Total memory required = 228.9 MiB (= 0.2 GiB).
Each kernel will be executed 10 times.
 The *best* time for each kernel (excluding the first iteration)
 will be used to compute the reported bandwidth.
-------------------------------------------------------------
Number of Threads requested = 4
Number of Threads counted = 4
-------------------------------------------------------------
Your clock granularity/precision appears to be 1 microseconds.
Each test below will take on the order of 46427 microseconds.
   (= 46427 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function    Best Rate MB/s  Avg time     Min time     Max time
Copy:            6770.5     0.024010     0.023632     0.025117
Scale:           6093.2     0.027474     0.026259     0.029142
Add:             5263.5     0.046008     0.045597     0.046230
Triad:           4820.0     0.050297     0.049793     0.050723
-------------------------------------------------------------
Solution Validates: avg error less than 1.000000e-13 on all three arrays
-------------------------------------------------------------

LMbench[edit | edit source]

Running the test[edit | edit source]

armbian@Mito8M:~/devel/lmbench$ sudo lmbench-run
[sudo] password for armbian: 
/usr/lib/lmbench/scripts/gnu-os: unable to guess system type

This script, last modified 2004-08-18, has failed to recognize
the operating system you are using. It is advised that you
download the most up to date version of the config scripts from

    ftp://ftp.gnu.org/pub/gnu/config/

If the version you run (/usr/lib/lmbench/scripts/gnu-os) is already up to date, please
send the following data and any information you think might be
pertinent to <config-patches@gnu.org> in order to provide the needed
information to handle your system.

config.guess timestamp = 2004-08-18

uname -m = aarch64
uname -r = 4.14.98-g4c94e1dbaec2
uname -s = Linux
uname -v = #1 SMP PREEMPT Mon Sep 30 14:46:22 CEST 2019

/usr/bin/uname -p = 
/bin/uname -X     = 

hostinfo               = 
/bin/universe          = 
/usr/bin/arch -k       = 
/bin/arch              = 
/usr/bin/oslevel       = 
/usr/convex/getsysinfo = 

UNAME_MACHINE = aarch64
UNAME_RELEASE = 4.14.98-g4c94e1dbaec2
UNAME_SYSTEM  = Linux
UNAME_VERSION = #1 SMP PREEMPT Mon Sep 30 14:46:22 CEST 2019
=====================================================================

                L M B E N C H   C ON F I G U R A T I O N
                ----------------------------------------

You need to configure some parameters to lmbench.  Once you have configured
these parameters, you may do multiple runs by saying

        "make rerun"

in the src subdirectory.

NOTICE: please do not have any other activity on the system if you can
help it.  Things like the second hand on your xclock or X perfmeters
are not so good when benchmarking.  In fact, X is not so good when
benchmarking.

=====================================================================

If you are running on an MP machine and you want to try running
multiple copies of lmbench in parallel, you can specify how many here.

Using this option will make the benchmark run 100x slower (sorry).

NOTE:  WARNING! This feature is experimental and many results are 
        known to be incorrect or random!

MULTIPLE COPIES [default 1]: 
=====================================================================

Options to control job placement
1) Allow scheduler to place jobs
2) Assign each benchmark process with any attendent child processes
   to its own processor
3) Assign each benchmark process with any attendent child processes
   to its own processor, except that it will be as far as possible
   from other processes
4) Assign each benchmark and attendent processes to their own
   processors
5) Assign each benchmark and attendent processes to their own
   processors, except that they will be as far as possible from
   each other and other processes
6) Custom placement: you assign each benchmark process with attendent
   child processes to processors
7) Custom placement: you assign each benchmark and attendent
   processes to processors

Note: some benchmarks, such as bw_pipe, create attendent child
processes for each benchmark process.  For example, bw_pipe
needs a second process to send data down the pipe to be read
by the benchmark process.  If you have three copies of the
benchmark process running, then you actually have six processes;
three attendent child processes sending data down the pipes and 
three benchmark processes reading data and doing the measurements.

Job placement selection [default 1]: 
=====================================================================

Hang on, we are calculating your timing granularity.
OK, it looks like you can time stuff down to 5000 usec resolution.

Hang on, we are calculating your timing overhead.
OK, it looks like your gettimeofday() costs 0 usecs.

Hang on, we are calculating your loop overhead.
OK, it looks like your benchmark loop costs 0.00000136 usecs.

=====================================================================

Several benchmarks operate on a range of memory.  This memory should be
sized such that it is at least 4 times as big as the external cache[s]
on your system.   It should be no more than 80% of your physical memory.

The bigger the range, the more accurate the results, but larger sizes
take somewhat longer to run the benchmark.

MB [default 2097]: 1024
Checking to see if you have 1024 MB; please wait for a moment...
1024MB OK
1024MB OK
1024MB OK
Hang on, we are calculating your cache line size.
OK, it looks like your cache line is 64 bytes.

=====================================================================

lmbench measures a wide variety of system performance, and the full suite
of benchmarks can take a long time on some platforms.  Consequently, we
offer the capability to run only predefined subsets of benchmarks, one
for operating system specific benchmarks and one for hardware specific
benchmarks.  We also offer the option of running only selected benchmarks
which is useful during operating system development.

Please remember that if you intend to publish the results you either need
to do a full run or one of the predefined OS or hardware subsets.

SUBSET (ALL|HARWARE|OS|DEVELOPMENT) [default all]: 
=====================================================================

This benchmark measures, by default, memory latency for a number of
different strides.  That can take a long time and is most useful if you
are trying to figure out your cache line size or if your cache line size
is greater than 128 bytes.

If you are planning on sending in these results, please don't do a fast
run.

Answering yes means that we measure memory latency with a 128 byte stride.  

FASTMEM [default no]: 
=====================================================================

This benchmark measures, by default, file system latency.  That can
take a long time on systems with old style file systems (i.e., UFS,
FFS, etc.).  Linux' ext2fs and Sun's tmpfs are fast enough that this
test is not painful.

If you are planning on sending in these results, please don't do a fast
run.

If you want to skip the file system latency tests, answer "yes" below.

SLOWFS [default no]: yes
=====================================================================

This benchmark can measure disk zone bandwidths and seek times.  These can
be turned into whizzy graphs that pretty much tell you everything you might
need to know about the performance of your disk.  

This takes a while and requires read access to a disk drive.  
Write is not measured, see disk.c to see how if you want to do so.

If you want to skip the disk tests, hit return below.

If you want to include disk tests, then specify the path to the disk
device, such as /dev/sda.  For each disk that is readable, you'll be
prompted for a one line description of the drive, i.e., 

        Iomega IDE ZIP
or
        HP C3725S 2GB on 10MB/sec NCR SCSI bus

DISKS [default none]: 
=====================================================================

If you are running on an idle network and there are other, identically
configured systems, on the same wire (no gateway between you and them),
and you have rsh access to them, then you should run the network part
of the benchmarks to them.  Please specify any such systems as a space
separated list such as: ether-host fddi-host hippi-host.

REMOTE [default none]: 
=====================================================================

Calculating mhz, please wait for a moment...
I think your CPU mhz is 

        798 MHz, 1.2531 nanosec clock

but I am frequently wrong.  If that is the wrong Mhz, type in your
best guess as to your processor speed.  It doesn't have to be exact,
but if you know it is around 800, say 800.  

Please note that some processors, such as the P4, have a core which
is double-clocked, so on those processors the reported clock speed
will be roughly double the advertised clock rate.  For example, a
1.8GHz P4 may be reported as a 3592MHz processor.

Processor mhz [default 798 MHz, 1.2531 nanosec clock]: 
=====================================================================

We need a place to store a 1024 Mbyte file as well as create and delete a
large number of small files.  We default to /var/tmp.  If /var/tmp is a
memory resident file system (i.e., tmpfs), pick a different place.
Please specify a directory that has enough space and is a local file
system.

FSDIR [default /var/tmp/lmbench]: /tmp/lmbench
=====================================================================

lmbench outputs status information as it runs various benchmarks.
By default this output is sent to /dev/tty, but you may redirect
it to any file you wish (such as /dev/null...).

Status output file [default /dev/tty]: 
=====================================================================

There is a database of benchmark results that is shipped with new
releases of lmbench.  Your results can be included in the database
if you wish.  The more results the better, especially if they include
remote networking.  If your results are interesting, i.e., for a new
fast box, they may be made available on the lmbench web page, which is

        http://www.bitmover.com/lmbench

Mail results [default yes]: no
OK, no results mailed.
=====================================================================

Confguration done, thanks.

There is a mailing list for discussing lmbench hosted at BitMover. 
Send mail to majordomo@bitmover.com to join the list.

/usr/lib/lmbench/scripts/gnu-os: unable to guess system type

This script, last modified 2004-08-18, has failed to recognize
the operating system you are using. It is advised that you
download the most up to date version of the config scripts from

    ftp://ftp.gnu.org/pub/gnu/config/

If the version you run (/usr/lib/lmbench/scripts/gnu-os) is already up to date, please
send the following data and any information you think might be
pertinent to <config-patches@gnu.org> in order to provide the needed
information to handle your system.

config.guess timestamp = 2004-08-18

uname -m = aarch64
uname -r = 4.14.98-g4c94e1dbaec2
uname -s = Linux
uname -v = #1 SMP PREEMPT Mon Sep 30 14:46:22 CEST 2019

/usr/bin/uname -p = 
/bin/uname -X     = 

hostinfo               = 
/bin/universe          = 
/usr/bin/arch -k       = 
/bin/arch              = 
/usr/bin/oslevel       = 
/usr/convex/getsysinfo = 

UNAME_MACHINE = aarch64
UNAME_RELEASE = 4.14.98-g4c94e1dbaec2
UNAME_SYSTEM  = Linux
UNAME_VERSION = #1 SMP PREEMPT Mon Sep 30 14:46:22 CEST 2019
Using config in CONFIG.Mito8M
Wed Jan 15 10:56:54 CET 2020
Latency measurements
Wed Jan 15 10:57:29 CET 2020
Local networking
Wed Jan 15 10:58:36 CET 2020
Bandwidth measurements
Wed Jan 15 11:03:02 CET 2020
Calculating context switch overhead
Wed Jan 15 11:03:09 CET 2020
Calculating effective TLB size
Wed Jan 15 11:03:10 CET 2020
Calculating memory load parallelism
Wed Jan 15 11:14:34 CET 2020
McCalpin's STREAM benchmark
Wed Jan 15 11:15:30 CET 2020
Calculating memory load latency
Wed Jan 15 11:35:54 CET 2020
Benchmark run finished....
Remember you can find the results of the benchmark 
under /var/lib/lmbench/results


Results[edit | edit source]

/usr/lib/lmbench/scripts/gnu-os: unable to guess system type

This script, last modified 2004-08-18, has failed to recognize
the operating system you are using. It is advised that you
download the most up to date version of the config scripts from

    ftp://ftp.gnu.org/pub/gnu/config/

If the version you run (/usr/lib/lmbench/scripts/gnu-os) is already up to date, please
send the following data and any information you think might be
pertinent to <config-patches@gnu.org> in order to provide the needed
information to handle your system.

config.guess timestamp = 2004-08-18

uname -m = aarch64
uname -r = 4.14.98-g4c94e1dbaec2
uname -s = Linux
uname -v = #1 SMP PREEMPT Mon Sep 30 14:46:22 CEST 2019

/usr/bin/uname -p = 
/bin/uname -X     = 

hostinfo               = 
/bin/universe          = 
/usr/bin/arch -k       = 
/bin/arch              = 
/usr/bin/oslevel       = 
/usr/convex/getsysinfo = 

UNAME_MACHINE = aarch64
UNAME_RELEASE = 4.14.98-g4c94e1dbaec2
UNAME_SYSTEM  = Linux
UNAME_VERSION = #1 SMP PREEMPT Mon Sep 30 14:46:22 CEST 2019
[lmbench3.0 results for Linux Mito8M 4.14.98-g4c94e1dbaec2 #1 SMP PREEMPT Mon Sep 30 14:46:22 CEST 2019 aarch64 GNU/Linux]
[LMBENCH_VER: 3.0-a9]
[BENCHMARK_HARDWARE: YES]
[BENCHMARK_OS: YES]
[ALL: 512 1k 2k 4k 8k 16k 32k 64k 128k 256k 512k 1m 2m 4m 8m 16m 32m 64m 128m 256m 512m 1024m]
[DISKS: ]
[DISK_DESC: ]
[ENOUGH: 5000]
[FAST: ]
[FASTMEM: NO]
[FILE: /tmp/lmbench/XXX]
[FSDIR: /tmp/lmbench]
[HALF: 512 1k 2k 4k 8k 16k 32k 64k 128k 256k 512k 1m 2m 4m 8m 16m 32m 64m 128m 256m 512m]
[INFO: INFO.Mito8M]
[LINE_SIZE: 64]
[LOOP_O: 0.00000136]
[MB: 1024]
[MHZ: 798 MHz, 1.2531 nanosec clock]
[MOTHERBOARD: ]
[NETWORKS: ]
[PROCESSORS: 4]
[REMOTE: ]
[SLOWFS: YES]
[OS: ]
[SYNC_MAX: 1]
[LMBENCH_SCHED: DEFAULT]
[TIMING_O: 0]
[LMBENCH VERSION: 3.0-20200115]
[USER: root]
[HOSTNAME: Mito8M]
[NODENAME: Mito8M]
[SYSNAME: Linux]
[PROCESSOR: unknown]
[MACHINE: aarch64]
[RELEASE: 4.14.98-g4c94e1dbaec2]
[VERSION: #1 SMP PREEMPT Mon Sep 30 14:46:22 CEST 2019]
[Wed Jan 15 10:56:54 CET 2020]
[ 10:56:54 up 1:19, 2 users, load average: 0.14, 0.08, 0.09]
[net: Kernel Interface table]
[net: Iface      MTU    RX-OK RX-ERR RX-DRP RX-OVR    TX-OK TX-ERR TX-DRP TX-OVR Flg]
[net: eth0      1500    69223      0      0 0          4476      0      0      0 BMRU]
[if: eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500]
[if: inet 192.168.0.81  netmask 255.255.255.0  broadcast 192.168.0.255]
[if: inet6 fe80::250:c2ff:fe1e:afb2  prefixlen 64  scopeid 0x20<link>]
[if: ether 00:50:c2:1e:af:b2  txqueuelen 1000  (Ethernet)]
[if: RX packets 69223  bytes 7960607 (7.5 MiB)]
[if: RX errors 0  dropped 0  overruns 0  frame 0]
[if: TX packets 4476  bytes 526438 (514.0 KiB)]
[if: TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0]
[if: ]
[net: lo       65536        0      0      0 0             0      0      0      0 LRU]
[if: lo: flags=73<UP,LOOPBACK,RUNNING>  mtu 65536]
[if: inet 127.0.0.1  netmask 255.0.0.0]
[if: inet6 ::1  prefixlen 128  scopeid 0x10<host>]
[if: loop  txqueuelen 1000  (Local Loopback)]
[if: RX packets 0  bytes 0 (0.0 B)]
[if: RX errors 0  dropped 0  overruns 0  frame 0]
[if: TX packets 0  bytes 0 (0.0 B)]
[if: TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0]
[if: ]
[mount: /dev/mmcblk1p2 on / type ext4 (rw,relatime,data=ordered)]
[mount: devtmpfs on /dev type devtmpfs (rw,relatime,size=1042644k,nr_inodes=260661,mode=755)]
[mount: sysfs on /sys type sysfs (rw,nosuid,nodev,noexec,relatime)]
[mount: proc on /proc type proc (rw,nosuid,nodev,noexec,relatime)]
[mount: tmpfs on /dev/shm type tmpfs (rw,nosuid,nodev)]
[mount: devpts on /dev/pts type devpts (rw,nosuid,noexec,relatime,gid=5,mode=620,ptmxmode=000)]
[mount: tmpfs on /run type tmpfs (rw,nosuid,nodev,mode=755)]
[mount: tmpfs on /run/lock type tmpfs (rw,nosuid,nodev,noexec,relatime,size=5120k)]
[mount: tmpfs on /sys/fs/cgroup type tmpfs (ro,nosuid,nodev,noexec,mode=755)]
[mount: cgroup2 on /sys/fs/cgroup/unified type cgroup2 (rw,nosuid,nodev,noexec,relatime,nsdelegate)]
[mount: cgroup on /sys/fs/cgroup/systemd type cgroup (rw,nosuid,nodev,noexec,relatime,xattr,name=systemd)]
[mount: pstore on /sys/fs/pstore type pstore (rw,nosuid,nodev,noexec,relatime)]
[mount: cgroup on /sys/fs/cgroup/blkio type cgroup (rw,nosuid,nodev,noexec,relatime,blkio)]
[mount: cgroup on /sys/fs/cgroup/cpu,cpuacct type cgroup (rw,nosuid,nodev,noexec,relatime,cpu,cpuacct)]
[mount: cgroup on /sys/fs/cgroup/memory type cgroup (rw,nosuid,nodev,noexec,relatime,memory)]
[mount: cgroup on /sys/fs/cgroup/devices type cgroup (rw,nosuid,nodev,noexec,relatime,devices)]
[mount: cgroup on /sys/fs/cgroup/cpuset type cgroup (rw,nosuid,nodev,noexec,relatime,cpuset)]
[mount: cgroup on /sys/fs/cgroup/perf_event type cgroup (rw,nosuid,nodev,noexec,relatime,perf_event)]
[mount: cgroup on /sys/fs/cgroup/hugetlb type cgroup (rw,nosuid,nodev,noexec,relatime,hugetlb)]
[mount: cgroup on /sys/fs/cgroup/pids type cgroup (rw,nosuid,nodev,noexec,relatime,pids)]
[mount: hugetlbfs on /dev/hugepages type hugetlbfs (rw,relatime,pagesize=2M)]
[mount: mqueue on /dev/mqueue type mqueue (rw,relatime)]
[mount: debugfs on /sys/kernel/debug type debugfs (rw,relatime)]
[mount: configfs on /sys/kernel/config type configfs (rw,relatime)]
[mount: tmpfs on /tmp type tmpfs (rw,nosuid,relatime)]
[mount: /dev/mmcblk1p2 on /var/log.hdd type ext4 (rw,relatime,data=ordered)]
[mount: armbian-ramlog on /var/log type tmpfs (rw,nosuid,nodev,noexec,relatime,size=51200k,mode=755)]
[mount: tmpfs on /run/user/0 type tmpfs (rw,nosuid,nodev,relatime,size=306960k,mode=700)]
[mount: tmpfs on /run/user/1000 type tmpfs (rw,nosuid,nodev,relatime,size=306960k,mode=700,uid=1000,gid=1000)]
Simple syscall: 0.4661 microseconds
Simple read: 0.9448 microseconds
Simple write: 0.6909 microseconds
Simple stat: 3.9654 microseconds
Simple fstat: 0.8820 microseconds
Simple open/close: 9.8745 microseconds
Select on 10 fd's: 1.5150 microseconds
Select on 100 fd's: 9.7926 microseconds
Select on 250 fd's: 23.1519 microseconds
Select on 500 fd's: 45.9008 microseconds
Select on 10 tcp fd's: 1.9205 microseconds
Select on 100 tcp fd's: 28.1860 microseconds
Select on 250 tcp fd's: 71.2308 microseconds
Select on 500 tcp fd's: 143.9744 microseconds
Signal handler installation: 0.7864 microseconds
Signal handler overhead: 6.5095 microseconds
Protection fault: 0.6260 microseconds
Pipe latency: 123.2715 microseconds
AF_UNIX sock stream latency: 244.8182 microseconds
Process fork+exit: 789.7143 microseconds
Process fork+execve: 835.7143 microseconds
sh: 1: /var/tmp/lmbench/hello: not found
sh: 1: /var/tmp/lmbench/hello: not found
sh: 1: /var/tmp/lmbench/hello: not found
sh: 1: /var/tmp/lmbench/hello: not found
sh: 1: /var/tmp/lmbench/hello: not found
sh: 1: /var/tmp/lmbench/hello: not found
sh: 1: /var/tmp/lmbench/hello: not found
sh: 1: /var/tmp/lmbench/hello: not found
sh: 1: /var/tmp/lmbench/hello: not found
sh: 1: /var/tmp/lmbench/hello: not found
sh: 1: /var/tmp/lmbench/hello: not found
sh: 1: /var/tmp/lmbench/hello: not found
sh: 1: /var/tmp/lmbench/hello: not found
sh: 1: /var/tmp/lmbench/hello: not found
sh: 1: /var/tmp/lmbench/hello: not found
sh: 1: /var/tmp/lmbench/hello: not found
sh: 1: /var/tmp/lmbench/hello: not found
sh: 1: /var/tmp/lmbench/hello: not found
sh: 1: /var/tmp/lmbench/hello: not found
sh: 1: /var/tmp/lmbench/hello: not found
sh: 1: /var/tmp/lmbench/hello: not found
sh: 1: /var/tmp/lmbench/hello: not found
sh: 1: /var/tmp/lmbench/hello: not found
sh: 1: /var/tmp/lmbench/hello: not found
sh: 1: /var/tmp/lmbench/hello: not found
Process fork+/bin/sh -c: 3027.0000 microseconds
integer bit: 0.84 nanoseconds
integer add: 1.25 nanoseconds
integer mul: 0.04 nanoseconds
integer div: 7.54 nanoseconds
integer mod: 7.95 nanoseconds
int64 bit: 0.84 nanoseconds
uint64 add: 1.25 nanoseconds
int64 mul: 0.04 nanoseconds
int64 div: 11.91 nanoseconds
int64 mod: 9.21 nanoseconds
float add: 5.02 nanoseconds
float mul: 5.01 nanoseconds
float div: 16.29 nanoseconds
double add: 5.01 nanoseconds
double mul: 5.01 nanoseconds
double div: 27.61 nanoseconds
float bogomflops: 30.26 nanoseconds
double bogomflops: 41.62 nanoseconds
integer bit parallelism: 2.54
integer add parallelism: 1.74
integer mul parallelism: 16.00
integer div parallelism: 1.00
integer mod parallelism: 1.16
int64 bit parallelism: 1.30
int64 add parallelism: 1.74
int64 mul parallelism: 16.00
int64 div parallelism: 1.00
int64 mod parallelism: 1.19
float add parallelism: 7.65
float mul parallelism: 1.99
float div parallelism: 1.30
double add parallelism: 7.65
double mul parallelism: 1.99
double div parallelism: 1.16
File /tmp/lmbench/XXX write bandwidth: 514687 KB/sec
Pagefaults on /tmp/lmbench/XXX: 1.7197 microseconds

"mappings
0.524288 24
1.048576 34
2.097152 81
4.194304 139
8.388608 255
16.777216 520
33.554432 823
67.108864 1558
134.217728 3012
268.435456 5890
536.870912 12064
1073.741824 24550

Cannot register service: RPC: Unable to receive; errno = Connection refused
unable to register (XACT_PROG, XACT_VERS, udp).
UDP latency using localhost: 152.9154 microseconds
TCP latency using localhost: 165.0752 microseconds
localhost: RPC: Port mapper failure - RPC: Unable to receive
localhost: RPC: Remote system error - Connection refused
: RPC: Port mapper failure - RPC: Unable to receive
TCP/IP connection cost to localhost: 238.0435 microseconds

Socket bandwidth using localhost
0.000001 0.36 MB/sec
0.000064 19.56 MB/sec
0.000128 38.65 MB/sec
0.000256 74.58 MB/sec
0.000512 107.61 MB/sec
0.001024 183.67 MB/sec
0.001437 225.20 MB/sec
10.000000 519.78 MB/sec

Avg xfer: 3.2KB, 41.8KB in 9.6060 millisecs, 4.35 MB/sec
AF_UNIX sock stream bandwidth: 1151.04 MB/sec
Pipe bandwidth: 648.43 MB/sec

"read bandwidth
0.000512 178.52
0.001024 322.46
0.002048 535.57
0.004096 807.32
0.008192 971.25
0.016384 1010.48
0.032768 1057.22
0.065536 1108.18
0.131072 1119.05
0.262144 1122.57
0.524288 1094.55
1.05 938.91
2.10 904.07
4.19 886.18
8.39 886.18
16.78 890.84
33.55 886.96
67.11 889.93
134.22 891.07
268.44 891.93
536.87 887.85
1073.74 887.79

"read open2close bandwidth
0.000512 38.80
0.001024 75.98
0.002048 141.66
0.004096 266.11
0.008192 431.55
0.016384 590.86
0.032768 786.29
0.065536 940.36
0.131072 1020.48
0.262144 1074.57
0.524288 1056.45
1.05 924.02
2.10 896.22
4.19 872.18
8.39 882.73
16.78 884.45
33.55 875.52
67.11 889.55
134.22 879.22
268.44 890.73
536.87 890.38
1073.74 890.74


"Mmap read bandwidth
0.000512 2602.40
0.001024 2959.46
0.002048 3053.70
0.004096 3105.90
0.008192 3132.37
0.016384 3136.92
0.032768 2947.17
0.065536 2956.24
0.131072 3029.47
0.262144 3018.66
0.524288 2980.50
1.05 2376.83
2.10 2085.34
4.19 2007.80
8.39 1882.12
16.78 1984.06
33.55 1600.57
67.11 1680.20
134.22 1882.81
268.44 1666.52
536.87 1791.82
1073.74 1797.60

"Mmap read open2close bandwidth
0.000512 19.78
0.001024 40.29
0.002048 78.50
0.004096 153.44
0.008192 278.99
0.016384 458.79
0.032768 665.80
0.065536 925.98
0.131072 1077.11
0.262144 1257.16
0.524288 1166.60
1.05 1091.70
2.10 1014.75
4.19 1016.31
8.39 1064.00
16.78 1080.03
33.55 1101.01
67.11 1113.45
134.22 1130.64
268.44 1131.09
536.87 1144.06
1073.74 1144.29


"libc bcopy unaligned
0.000512 3551.59
0.001024 3722.84
0.002048 3962.06
0.004096 4104.18
0.008192 4177.65
0.016384 4209.88
0.032768 4162.83
0.065536 3524.59
0.131072 3373.95
0.262144 3463.24
0.524288 3492.59
1.05 2583.68
2.10 1920.82
4.19 1847.71
8.39 1881.06
16.78 1900.89
33.55 1902.93
67.11 1921.57
134.22 1911.66
268.44 1842.60
536.87 1869.13

"libc bcopy aligned
0.000512 3550.95
0.001024 3722.84
0.002048 3961.20
0.004096 4098.96
0.008192 4168.56
0.016384 4203.77
0.032768 3620.61
0.065536 3517.67
0.131072 3506.09
0.262144 3500.30
0.524288 3497.25
1.05 2784.78
2.10 1919.41
4.19 1851.79
8.39 1897.02
16.78 1891.24
33.55 1916.08
67.11 1831.27
134.22 1865.95
268.44 1872.73
536.87 1843.40

Memory bzero bandwidth
0.000512 6004.59
0.001024 5740.71
0.002048 6220.07
0.004096 6494.53
0.008192 6481.06
0.016384 6437.86
0.032768 6417.37
0.065536 6673.53
0.131072 6313.59
0.262144 6413.60
0.524288 6359.03
1.05 6340.31
2.10 6342.99
4.19 6347.53
8.39 6351.40
16.78 6349.39
33.55 6353.80
67.11 6350.80
134.22 6353.80
268.44 6350.80
536.87 6353.20
1073.74 6353.05

"unrolled bcopy unaligned
0.000512 1551.08
0.001024 1567.98
0.002048 1580.01
0.004096 1582.92
0.008192 1580.63
0.016384 1576.81
0.032768 1526.29
0.065536 1547.63
0.131072 1542.32
0.262144 1531.61
0.524288 1522.16
1.05 1477.91
2.10 1396.47
4.19 1399.27
8.39 1384.26
16.78 1393.11
33.55 1360.79
67.11 1355.46
134.22 1347.74
268.44 1346.34
536.87 1349.81

"unrolled partial bcopy unaligned
0.000512 5366.87
0.001024 5732.92
0.002048 5950.96
0.004096 6061.81
0.008192 6062.68
0.016384 4149.98
0.032768 2630.01
0.065536 2331.18
0.131072 2374.49
0.262144 2283.65
0.524288 1714.02
1.05 823.70
2.10 700.92
4.19 712.23
8.39 705.34
16.78 715.20
33.55 710.54
67.11 702.98
134.22 702.67
268.44 702.66
536.87 702.67

Memory read bandwidth
0.000512 1553.33
0.001024 1567.32
0.002048 1575.47
0.004096 1575.84
0.008192 1577.44
0.016384 1577.32
0.032768 1528.31
0.065536 1531.16
0.131072 1547.31
0.262144 1552.68
0.524288 1514.46
1.05 1318.73
2.10 1430.28
4.19 1420.11
8.39 1423.49
16.78 1420.11
33.55 1365.61
67.11 1393.40
134.22 1382.68
268.44 1372.59
536.87 1367.69
1073.74 1383.41

Memory partial read bandwidth
0.000512 5754.53
0.001024 5952.22
0.002048 6061.44
0.004096 6116.52
0.008192 6110.93
0.016384 6146.99
0.032768 5219.52
0.065536 4957.59
0.131072 4960.55
0.262144 4864.33
0.524288 4747.76
1.05 3665.68
2.10 2242.14
4.19 2065.82
8.39 2085.68
16.78 2082.32
33.55 2090.62
67.11 2054.21
134.22 2034.93
268.44 2022.62
536.87 2010.74
1073.74 2035.27

Memory write bandwidth
0.000512 2932.69
0.001024 3048.35
0.002048 3100.01
0.004096 3136.82
0.008192 3135.21
0.016384 3150.89
0.032768 2864.03
0.065536 3033.63
0.131072 3093.21
0.262144 2956.69
0.524288 3024.21
1.05 3075.53
2.10 3095.43
4.19 3121.69
8.39 3137.88
16.78 3145.92
33.55 3146.51
67.11 3146.81
134.22 3147.55
268.44 3150.76
536.87 3144.84
1073.74 3146.70

Memory partial write bandwidth
0.000512 8862.29
0.001024 9370.34
0.002048 9667.70
0.004096 9789.29
0.008192 9742.84
0.016384 9860.45
0.032768 6392.60
0.065536 5964.29
0.131072 5614.61
0.262144 5208.16
0.524288 4765.41
1.05 4142.23
2.10 1082.49
4.19 1022.63
8.39 1110.04
16.78 1216.71
33.55 1299.30
67.11 1393.17
134.22 1392.62
268.44 1377.13
536.87 1372.36
1073.74 1342.90

Memory partial read/write bandwidth
0.000512 3958.33
0.001024 4058.12
0.002048 4114.18
0.004096 4139.05
0.008192 4130.17
0.016384 4138.21
0.032768 3709.83
0.065536 3598.56
0.131072 3662.55
0.262144 3611.19
0.524288 3581.97
1.05 2936.60
2.10 1064.36
4.19 1098.27
8.39 1111.81
16.78 1142.94
33.55 1275.88
67.11 1231.78
134.22 1296.18
268.44 1295.56
536.87 1278.34
1073.74 1270.79



"size=0k ovr=3.79
2 56.44
4 56.59
8 124.81
16 58.93
24 110.52
32 60.00
64 61.85
96 62.82

"size=4k ovr=5.00
2 57.77
4 56.95
8 82.19
16 59.53
24 59.88
32 61.65
64 64.32
96 66.80

"size=8k ovr=6.43
2 56.43
4 57.34
8 58.47
16 59.56
24 60.87
32 62.95
64 65.05
96 68.21

"size=16k ovr=9.51
2 56.54
4 57.72
8 58.46
16 60.25
24 63.60
32 122.58
64 147.96
96 213.98

"size=32k ovr=15.74
2 55.58
4 126.61
8 58.38
16 62.03
24 62.60
32 171.45
64 76.13
96 190.07

"size=64k ovr=28.57
2 265.77
4 55.57
8 95.64
16 63.61
24 93.98
32 100.57
64 89.26
96 89.15

tlb: 10 pages

Memory load parallelism
0.001024 3.00
0.002048 3.00
0.004096 3.00
0.008192 3.00
0.016384 3.64
0.032768 1.66
0.065536 1.55
0.131072 1.49
0.262144 1.57
0.524288 1.68
1.048576 1.70
2.097152 2.59
4.194304 2.73
8.388608 2.80
16.777216 2.80
33.554432 2.78
67.108864 2.78
134.217728 2.73
268.435456 2.79
536.870912 2.78

STREAM copy latency: 5.68 nanoseconds
STREAM copy bandwidth: 2816.97 MB/sec
STREAM scale latency: 10.43 nanoseconds
STREAM scale bandwidth: 1533.70 MB/sec
STREAM add latency: 16.36 nanoseconds
STREAM add bandwidth: 1467.01 MB/sec
STREAM triad latency: 19.24 nanoseconds
STREAM triad bandwidth: 1247.44 MB/sec
STREAM2 fill latency: 2.52 nanoseconds
STREAM2 fill bandwidth: 3179.01 MB/sec
STREAM2 copy latency: 5.66 nanoseconds
STREAM2 copy bandwidth: 2828.07 MB/sec
STREAM2 daxpy latency: 19.93 nanoseconds
STREAM2 daxpy bandwidth: 1204.13 MB/sec
STREAM2 sum latency: 5.52 nanoseconds
STREAM2 sum bandwidth: 1448.21 MB/sec

Memory load latency
"stride=16
0.00049 3.774
0.00098 3.774
0.00195 3.768
0.00293 3.768
0.00391 3.766
0.00586 3.765
0.00781 3.764
0.01172 3.766
0.01562 3.781
0.02344 3.779
0.03125 3.829
0.04688 4.245
0.06250 3.855
0.09375 3.991
0.12500 3.872
0.18750 3.874
0.25000 3.877
0.37500 3.879
0.50000 3.881
0.75000 4.945
1.00000 4.956
1.50000 5.936
2.00000 7.350
3.00000 7.674
4.00000 7.709
6.00000 7.675
8.00000 7.675
12.00000 7.689
16.00000 7.682
24.00000 7.691
32.00000 7.687
48.00000 7.732
64.00000 7.672
96.00000 7.731
128.00000 7.706
192.00000 7.742
256.00000 7.690
384.00000 7.781
512.00000 7.808
768.00000 7.680
1024.00000 7.739

"stride=32
0.00049 3.774
0.00098 3.774
0.00195 3.774
0.00293 3.774
0.00391 3.768
0.00586 3.768
0.00781 3.766
0.01172 3.765
0.01562 3.764
0.02344 3.765
0.03125 3.794
0.04688 5.374
0.06250 5.448
0.09375 5.505
0.12500 5.561
0.18750 5.567
0.25000 5.554
0.37500 5.630
0.50000 5.610
0.75000 7.099
1.00000 7.991
1.50000 11.211
2.00000 14.416
3.00000 15.328
4.00000 15.526
6.00000 15.409
8.00000 15.383
12.00000 15.366
16.00000 15.366
24.00000 15.382
32.00000 15.326
48.00000 15.382
64.00000 15.542
96.00000 15.480
128.00000 15.327
192.00000 15.358
256.00000 15.389
384.00000 15.418
512.00000 15.508
768.00000 15.563
1024.00000 15.503

"stride=64
0.00049 3.774
0.00098 3.774
0.00195 3.774
0.00293 3.775
0.00391 3.775
0.00586 3.774
0.00781 3.768
0.01172 3.768
0.01562 3.767
0.02344 3.770
0.03125 3.790
0.04688 8.755
0.06250 9.246
0.09375 9.513
0.12500 9.662
0.18750 9.906
0.25000 10.109
0.37500 10.350
0.50000 10.461
0.75000 13.299
1.00000 14.995
1.50000 22.196
2.00000 28.417
3.00000 30.733
4.00000 30.937
6.00000 30.878
8.00000 30.714
12.00000 30.718
16.00000 30.670
24.00000 30.858
32.00000 30.684
48.00000 30.711
64.00000 30.965
96.00000 30.775
128.00000 30.880
192.00000 30.766
256.00000 30.742
384.00000 30.793
512.00000 30.677
768.00000 30.766
1024.00000 30.829

"stride=128
0.00049 3.774
0.00098 3.777
0.00195 3.775
0.00293 3.774
0.00391 3.774
0.00586 3.775
0.00781 3.775
0.01172 3.775
0.01562 3.769
0.02344 3.787
0.03125 3.800
0.04688 8.927
0.06250 9.363
0.09375 9.554
0.12500 9.705
0.18750 9.938
0.25000 10.172
0.37500 10.418
0.50000 10.539
0.75000 14.301
1.00000 16.553
1.50000 24.201
2.00000 30.302
3.00000 32.699
4.00000 32.816
6.00000 32.822
8.00000 32.771
12.00000 32.656
16.00000 32.742
24.00000 32.623
32.00000 32.660
48.00000 32.643
64.00000 32.613
96.00000 32.652
128.00000 32.632
192.00000 32.610
256.00000 32.662
384.00000 32.663
512.00000 32.690
768.00000 32.667
1024.00000 32.655

"stride=256
0.00049 3.774
0.00098 3.774
0.00195 3.774
0.00293 3.774
0.00391 3.774
0.00586 3.774
0.00781 3.774
0.01172 3.774
0.01562 3.774
0.02344 3.775
0.03125 3.793
0.04688 9.306
0.06250 9.544
0.09375 9.767
0.12500 9.954
0.18750 10.175
0.25000 10.324
0.37500 10.450
0.50000 10.530
0.75000 13.414
1.00000 15.577
1.50000 27.112
2.00000 34.680
3.00000 36.844
4.00000 36.874
6.00000 36.947
8.00000 36.965
12.00000 36.938
16.00000 36.921
24.00000 36.950
32.00000 36.949
48.00000 36.940
64.00000 36.938
96.00000 36.948
128.00000 36.948
192.00000 36.956
256.00000 36.953
384.00000 36.955
512.00000 36.940
768.00000 36.954
1024.00000 36.954

"stride=512
0.00049 3.774
0.00098 3.774
0.00195 3.774
0.00293 3.774
0.00391 3.774
0.00586 3.774
0.00781 3.774
0.01172 3.774
0.01562 3.774
0.02344 3.774
0.03125 3.973
0.04688 11.946
0.06250 14.065
0.09375 16.036
0.12500 16.886
0.18750 17.677
0.25000 18.112
0.37500 18.531
0.50000 18.806
0.75000 34.679
1.00000 67.490
1.50000 93.222
2.00000 122.788
3.00000 138.216
4.00000 138.596
6.00000 138.892
8.00000 139.118
12.00000 139.110
16.00000 139.136
24.00000 139.142
32.00000 139.196
48.00000 139.158
64.00000 139.170
96.00000 139.181
128.00000 139.170
192.00000 139.158
256.00000 139.161
384.00000 139.179
512.00000 139.175
768.00000 139.191
1024.00000 139.190

"stride=1024
0.00098 3.774
0.00195 3.774
0.00293 3.774
0.00391 3.774
0.00586 3.774
0.00781 3.774
0.01172 3.774
0.01562 3.775
0.02344 3.777
0.03125 4.059
0.04688 12.131
0.06250 13.972
0.09375 15.913
0.12500 16.793
0.18750 17.656
0.25000 18.123
0.37500 18.563
0.50000 18.770
0.75000 28.357
1.00000 40.553
1.50000 96.994
2.00000 129.405
3.00000 138.209
4.00000 138.525
6.00000 138.829
8.00000 138.928
12.00000 139.041
16.00000 139.089
24.00000 139.174
32.00000 139.179
48.00000 139.187
64.00000 139.178
96.00000 139.204
128.00000 139.231
192.00000 139.212
256.00000 139.251
384.00000 139.243
512.00000 139.265
768.00000 139.243
1024.00000 139.243


Random load latency
"stride=16
0.00049 3.774
0.00098 3.774
0.00195 3.768
0.00293 3.768
0.00391 3.766
0.00586 3.765
0.00781 3.764
0.01172 3.764
0.01562 3.774
0.02344 3.777
0.03125 7.474
0.04688 14.559
0.06250 17.335
0.09375 19.694
0.12500 20.071
0.18750 20.085
0.25000 20.076
0.37500 20.080
0.50000 20.131
0.75000 25.664
1.00000 40.659
1.50000 100.242
2.00000 131.377
3.00000 142.919
4.00000 143.486
6.00000 143.804
8.00000 144.001
12.00000 144.033
16.00000 144.182
24.00000 144.215
32.00000 144.321
48.00000 144.168
64.00000 144.308
96.00000 144.381
128.00000 144.247
192.00000 144.276
256.00000 144.213
384.00000 144.338
512.00000 144.268
768.00000 144.337
1024.00000 144.361



[Wed Jan 15 11:35:54 CET 2020]

Useful links[edit | edit source]