Difference between revisions of "ML-TN-001 - AI at the edge: comparison of different embedded platforms - Part 4"

From DAVE Developer's Wiki
Jump to: navigation, search
m
m
Line 81: Line 81:
 
The following sections detail the execution of the first version of the classifier on the embedded platform. The number of threads was also tweaked in order to test different configurations. During the execution, the well-know <code>[https://en.wikipedia.org/wiki/Htop htop]</code> utility was used to monitor the system. This tool is very convenient to get some useful information such as cores allocation, processor load, and number of running threads.
 
The following sections detail the execution of the first version of the classifier on the embedded platform. The number of threads was also tweaked in order to test different configurations. During the execution, the well-know <code>[https://en.wikipedia.org/wiki/Htop htop]</code> utility was used to monitor the system. This tool is very convenient to get some useful information such as cores allocation, processor load, and number of running threads.
  
=== Floating-point model ===
+
==== <big>Floating-point model</big> ====
 
<pre class="board-terminal">
 
<pre class="board-terminal">
 
TBD
 
TBD
 
</pre>
 
</pre>
  
==== Tweaking the number of threads ====
+
===== Tweaking the number of threads =====
 
The following screenshots show the system status while executing the application with different values of the thread parameter.
 
The following screenshots show the system status while executing the application with different values of the thread parameter.
  
Line 95: Line 95:
 
[IMAGE "Thread parameter set to 2"]
 
[IMAGE "Thread parameter set to 2"]
  
=== Half-quantized model ===
+
==== <big>Half-quantized model</big> ====
 
<pre class="board-terminal">
 
<pre class="board-terminal">
 
TBD
 
TBD
Line 102: Line 102:
 
[IMAGE "Thread parameter unspecified"]
 
[IMAGE "Thread parameter unspecified"]
  
=== Fully-quantized model ===
+
==== <big>Fully-quantized model</big> ====
 
<pre class="board-terminal">
 
<pre class="board-terminal">
 
TBD
 
TBD
 
</pre>
 
</pre>
  
==== Tweaking the number of threads ====
+
===== Tweaking the number of threads =====
 
The following screenshots show the system status while executing the application with different values of the thread parameter.
 
The following screenshots show the system status while executing the application with different values of the thread parameter.
  

Revision as of 12:37, 12 October 2020

Info Box
NeuralNetwork.png Applies to Machine Learning
Work in progress


History[edit | edit source]

Version Date Notes
1.0.0 September 2020 First public release

Introduction[edit | edit source]

This Technical Note (TN for short) belongs to the series introduced here. In particular, it illustrates the execution of different versions of an inference application (fruit classifier) that makes use of the model described in this section, when executed on the NXP i.MX8M Plus EVK. In addition, this document compares the results achieved to the ones produced by the platforms that were considered in the previous articles of this series.

Specifically, the following versions of the application were tested:

  • Version 1: This version is the same described in this article. As such, inference in implemented in software and is applied to images retrieved from files.
  • Version 2: This version is functionally equivalent to the version 1, but it leverages the Neural Processing Unit (NPU) to hardware accelerate the inference.
  • Version 3: This is like version 3, but the inference is applied to the frames captured from an image sensor.

Test Bed[edit | edit source]

The kernel and the root file system of the tested platform were built with the L5.4.24_2.1.0 release of the Yocto Board Support Package (BSP) for i.MX 8 family of devices. They were built with support for eIQ: "a collection of software and development tools for NXP microprocessors and microcontrollers to do inference of neural network models on embedded systems".

The following table details the relevant specs of the test bed.

NXP Linux BSP release L5.4.24_2.1.0
Inference engine TensorFlow Lite 2.1
Maximum ARM cores frequency

[MHz]

1800
SDRAM memory frequency (LPDDR4)

[MHz]

TBD
Governor ondemand

Model deployment and inference applications[edit | edit source]

Version 1[edit | edit source]

The C++ application previously used and described here was adapted to work with the new NXP Linux BSP release. Now it uses OpenCV 4.2.0 to pre-process the input image and TensorFlow Lite (TFL) 2.1 as inference engine. It still supports all the 3 TFL models previously tested on the Mito8M SoM:

  • 32-bit floating-point model;
  • half-quantized model (post-training 8-bit quantization of the weights only);
  • fully-quantized model (TensorFlow v1 quantization-aware training and 8-bit quantization of the weights and activations).

Version 2[edit | edit source]

The version 1 application was then modified to accelerate the inference using the NPU (ML module) of the i.MX8M Plus Soc. This is possible because "the TensorFlow Lite library uses the Android NN API driver implementation from the GPU/ML module driver for running inference using the GPU/ML module".

Neither the floating-point nor the half-quantized models work in NPU. Moreover, "the GPU/ML module driver does not support per-channel quantization yet. Therefore post-training quantization of models with TensorFlow v2 cannot be used if the model is supposed to run on the GPU/ML module (inference on CPU does not have this limitation). TensorFlow v1 quantization-aware training and model conversion is recommended in this case".

So, only the fully-quantized model was tested with the version 2 application.

Version 3[edit | edit source]

A new C++ application was written to apply the inference to the frames captured from an image sensor (OV5640) instead of images retrieved from files. Like version 2, inference run on NPU, so only the fully-quantized model was tested with the version 3 application.

Note that with this image sensor, the frame rate is capped at 30 fps.

Running the applications[edit | edit source]

As stated in the first article of this series, one of the goals is to evaluate the performances of the inference applications. As known, before and after the execution of the inference, other operations, generally referred to as pre/post-processing, are performed. Technically speaking, these operations are not part of the actual inference and are measured separately.

In order to have reproducible and reliable results, some measures were taken:

  • When possible, the inference was repeated several times and the average execution time was computed
  • All the files required to run the test—the executable, the image files, etc.—are stored on a tmpfs RAM disk in order to make file system/storage medium overhead neglectable.

Version 1[edit | edit source]

The following sections detail the execution of the first version of the classifier on the embedded platform. The number of threads was also tweaked in order to test different configurations. During the execution, the well-know htop utility was used to monitor the system. This tool is very convenient to get some useful information such as cores allocation, processor load, and number of running threads.

Floating-point model[edit | edit source]

TBD
Tweaking the number of threads[edit | edit source]

The following screenshots show the system status while executing the application with different values of the thread parameter.

[IMAGE "Thread parameter unspecified"]

[IMAGE "Thread parameter set to 1"]

[IMAGE "Thread parameter set to 2"]

Half-quantized model[edit | edit source]

TBD

The following screenshot shows the system status while executing the application. In this case, the thread parameter was unspecified.

[IMAGE "Thread parameter unspecified"]

Fully-quantized model[edit | edit source]

TBD
Tweaking the number of threads[edit | edit source]

The following screenshots show the system status while executing the application with different values of the thread parameter.

[IMAGE "Thread parameter unspecified"]

[IMAGE "Thread parameter set to 4"]

Version 2[edit | edit source]

TBD

- "The first execution of model inference using the NN API always takes many times longer, because of model graph initialization needed by the GPU/ML module"

- Profiling?

Version 3[edit | edit source]

Results[edit | edit source]