Difference between revisions of "ML-TN-001 - AI at the edge: comparison of different embedded platforms - Part 4"

From DAVE Developer's Wiki
Jump to: navigation, search
Line 57: Line 57:
 
* 32-bit floating-point model;
 
* 32-bit floating-point model;
 
* half-quantized model (post-training 8-bit quantization of the weights only);
 
* half-quantized model (post-training 8-bit quantization of the weights only);
* fully-quantized model (quantization-aware training and 8-bit quantization of the weights and activations).
+
* fully-quantized model (TensorFlow v1 quantization-aware training and 8-bit quantization of the weights and activations).
  
 
=== Version 2 ===
 
=== Version 2 ===
The version 1 application was modified to accelerate the inference using the ML module (NPU) of the i.MX8M Plus Soc.
+
The version 1 application was then modified to accelerate the inference using the NPU of the i.MX8M Plus Soc.
  
Neither the floating-point nor the half-quantized models work in NPU. Moreover, "the GPU/ML module driver does not support per-channel quantization yet. Therefore post-training quantization of models with TensorFlow v2 cannot be used if the model is supposed to run on the GPU/ML module (inference on CPU does not have this limitation). TensorFlow v1 quantization-aware training and model conversion is recommended in this case".
+
Neither the floating-point nor the half-quantized models work in NPU (ML module). Moreover, "the GPU/ML module driver does not support per-channel quantization yet. Therefore post-training quantization of models with TensorFlow v2 cannot be used if the model is supposed to run on the GPU/ML module (inference on CPU does not have this limitation). TensorFlow v1 quantization-aware training and model conversion is recommended in this case".
  
So, only the fully-quantized model was tested.
+
So, only the fully-quantized model was tested with the version 2 application.
  
 
=== Version 3 ===
 
=== Version 3 ===
A new C++ application was written to apply the inference to the frames captured from an image sensor instead of images retrieved from files. Like version 2, inference run on NPU, so only the fully-quantized model was tested.
+
A new C++ application was written to apply the inference to the frames captured from an image sensor instead of images retrieved from files. Like version 2, inference run on NPU, so only the fully-quantized model was tested with the version 3 application.
  
 
== Running the applications ==
 
== Running the applications ==
Line 77: Line 77:
  
 
=== Version 1 ===
 
=== Version 1 ===
 +
The following sections detail the execution of the first version of the classifier on the embedded platform. The number of threads was also tweaked in order to test different configurations. During the execution, the well-know <code>[https://en.wikipedia.org/wiki/Htop htop]</code> utility was used to monitor the system. This tool is very convenient to get some useful information such as cores allocation, processor load, and number of running threads.
 +
 +
====== Floating-point model ======
 +
<pre class="board-terminal">
 +
TBD
 +
</pre>
 +
  
 
=== Version 2 ===
 
=== Version 2 ===

Revision as of 09:19, 12 October 2020

Info Box
NeuralNetwork.png Applies to Machine Learning
Work in progress


History[edit | edit source]

Version Date Notes
1.0.0 September 2020 First public release

Introduction[edit | edit source]

This Technical Note (TN for short) belongs to the series introduced here. In particular, it illustrates the execution of different versions of an inference application (fruit classifier) that makes use of the model described in this section, when executed on the NXP i.MX8M Plus EVK. In addition, this document compares the results achieved to the ones produced by the platforms that were considered in the previous articles of this series.

Specifically, the following versions of the application were tested:

  • Version 1: This version is the same described in this article. As such, inference in implemented in software and is applied to images retrieved from files.
  • Version 2: This version is functionally equivalent to the version 1, but it leverages the Neural Processing Unit (NPU) to hardware accelerate the inference.
  • Version 3: This is like version 3, but the inference is applied to the frames captured from an image sensor.

Test Bed[edit | edit source]

The kernel and the root file system of the tested platform were built with the L5.4.24_2.1.0 release of the Yocto Board Support Package (BSP) for i.MX 8 family of devices. They were built with support for eIQ: "a collection of software and development tools for NXP microprocessors and microcontrollers to do inference of neural network models on embedded systems".

The following table details the relevant specs of the test bed.

NXP Linux BSP release L5.4.24_2.1.0
Inference engine TensorFlow Lite 2.1
Maximum ARM cores frequency

[MHz]

1800
SDRAM memory frequency (LPDDR4)

[MHz]

TBD
Governor ondemand

Model deployment and inference applications[edit | edit source]

Version 1[edit | edit source]

The C++ application previously used and described here was adapted to work with the new NXP Linux BSP release. Now it uses OpenCV 4.2.0 to pre-process the input image and TensorFlow Lite (TFL) 2.1 as inference engine. It still supports all the 3 TFL models previously tested on the Mito8M SoM:

  • 32-bit floating-point model;
  • half-quantized model (post-training 8-bit quantization of the weights only);
  • fully-quantized model (TensorFlow v1 quantization-aware training and 8-bit quantization of the weights and activations).

Version 2[edit | edit source]

The version 1 application was then modified to accelerate the inference using the NPU of the i.MX8M Plus Soc.

Neither the floating-point nor the half-quantized models work in NPU (ML module). Moreover, "the GPU/ML module driver does not support per-channel quantization yet. Therefore post-training quantization of models with TensorFlow v2 cannot be used if the model is supposed to run on the GPU/ML module (inference on CPU does not have this limitation). TensorFlow v1 quantization-aware training and model conversion is recommended in this case".

So, only the fully-quantized model was tested with the version 2 application.

Version 3[edit | edit source]

A new C++ application was written to apply the inference to the frames captured from an image sensor instead of images retrieved from files. Like version 2, inference run on NPU, so only the fully-quantized model was tested with the version 3 application.

Running the applications[edit | edit source]

As stated in the first article of this series, one of the goals is to evaluate the performances of the inference applications. As known, before and after the execution of the inference, other operations, generally referred to as pre/post-processing, are performed. Technically speaking, these operations are not part of the actual inference and are measured separately.

In order to have reproducible and reliable results, some measures were taken:

  • When possible, the inference was repeated several times and the average execution time was computed
  • All the files required to run the test—the executable, the image files, etc.—are stored on a tmpfs RAM disk in order to make file system/storage medium overhead neglectable.

Version 1[edit | edit source]

The following sections detail the execution of the first version of the classifier on the embedded platform. The number of threads was also tweaked in order to test different configurations. During the execution, the well-know htop utility was used to monitor the system. This tool is very convenient to get some useful information such as cores allocation, processor load, and number of running threads.

Floating-point model[edit | edit source]
TBD


Version 2[edit | edit source]

"The first execution of model inference using the NN API always takes many times longer, because of model graph initialization needed by the GPU/ML module"

Version 3[edit | edit source]

Results[edit | edit source]