Difference between revisions of "ML-TN-001 - AI at the edge: comparison of different embedded platforms - Part 4"

From DAVE Developer's Wiki
Jump to: navigation, search
m
Line 22: Line 22:
  
 
Specifically, the following versions of the application were tested:
 
Specifically, the following versions of the application were tested:
* Version 1: This version is the same described in [[ML-TN-001 - AI at the edge: comparison of different embedded platforms - Part 2|this article]]. As such, inference in implemented in software and is applied to images retrieved from files.
+
* Version 1: This version is the same described in [[ML-TN-001 - AI at the edge: comparison of different embedded platforms - Part 2|this article]]. As such, inference is implemented in software and is applied to images retrieved from files.
 
* Version 2: This version is functionally equivalent to the version 1, but it leverages the Neural Processing Unit (NPU) to hardware accelerate the inference.
 
* Version 2: This version is functionally equivalent to the version 1, but it leverages the Neural Processing Unit (NPU) to hardware accelerate the inference.
 
* Version 3: This is like version 3, but the inference is applied to the frames captured from an image sensor.
 
* Version 3: This is like version 3, but the inference is applied to the frames captured from an image sensor.
Line 64: Line 64:
 
Neither the floating-point nor the half-quantized models work in NPU. Moreover, "the GPU/ML module driver does not support per-channel quantization yet. Therefore post-training quantization of models with TensorFlow v2 cannot be used if the model is supposed to run on the GPU/ML module (inference on CPU does not have this limitation). TensorFlow v1 quantization-aware training and model conversion is recommended in this case".
 
Neither the floating-point nor the half-quantized models work in NPU. Moreover, "the GPU/ML module driver does not support per-channel quantization yet. Therefore post-training quantization of models with TensorFlow v2 cannot be used if the model is supposed to run on the GPU/ML module (inference on CPU does not have this limitation). TensorFlow v1 quantization-aware training and model conversion is recommended in this case".
  
So, only the fully-quantized model was tested with the version 2 application.
+
Therefore, only the fully-quantized model was tested with the version 2 application.
  
 
=== Version 3 ===
 
=== Version 3 ===
Line 115: Line 115:
  
 
=== <big>Version 2</big> ===
 
=== <big>Version 2</big> ===
The following sections detail the execution of the second version of the classifier on the embedded platform. During the execution, <code>htop</code> was used to monitor the system ...
+
The execution of the second version of the classifier on the embedded platform is detailed below. During the execution, <code>htop</code> was used to monitor the system. Note that "the first execution of model inference using the NN API always takes many times longer, because of model graph initialization needed by the GPU/ML module". Therefore, the time needed for the first inference (warm up) is measured separately.<pre class="board-terminal">
 
 
 
TBD
 
TBD
 +
</pre>The following screenshot shows the system status while executing the application.
  
- "The first execution of model inference using the NN API always takes many times longer, because of model graph initialization needed by the GPU/ML module"
+
[IMAGE]
  
- Profiling? ma rallenta esecuzione... {
+
==== <big>Profiling model execution on NPU</big> ====
 +
The following block shows the profiler log. "The log captures detailed information of the execution clock cycles and DDR data transmission in each layer". Note that the time needed for inference is longer than usual while the profiler overhead is added.
 +
TBD
 +
{
  
 
$ export CNN_PERF=1 NN_EXT_SHOW_PERF=1 VIV_VX_DEBUG_LEVEL=1 VIV_VX_PROFILE=1
 
$ export CNN_PERF=1 NN_EXT_SHOW_PERF=1 VIV_VX_DEBUG_LEVEL=1 VIV_VX_PROFILE=1

Revision as of 11:04, 14 October 2020

Info Box
NeuralNetwork.png Applies to Machine Learning
Work in progress


History[edit | edit source]

Version Date Notes
1.0.0 September 2020 First public release

Introduction[edit | edit source]

This Technical Note (TN for short) belongs to the series introduced here. In particular, it illustrates the execution of different versions of an inference application (fruit classifier) that makes use of the model described in this section, when executed on the NXP i.MX8M Plus EVK. In addition, this document compares the results achieved to the ones produced by the platforms that were considered in the previous articles of this series.

Specifically, the following versions of the application were tested:

  • Version 1: This version is the same described in this article. As such, inference is implemented in software and is applied to images retrieved from files.
  • Version 2: This version is functionally equivalent to the version 1, but it leverages the Neural Processing Unit (NPU) to hardware accelerate the inference.
  • Version 3: This is like version 3, but the inference is applied to the frames captured from an image sensor.

Test Bed[edit | edit source]

The kernel and the root file system of the tested platform were built with the L5.4.24_2.1.0 release of the Yocto Board Support Package (BSP) for i.MX 8 family of devices. They were built with support for eIQ: "a collection of software and development tools for NXP microprocessors and microcontrollers to do inference of neural network models on embedded systems".

The following table details the relevant specs of the test bed.

NXP Linux BSP release L5.4.24_2.1.0
Inference engine TensorFlow Lite 2.1
Maximum ARM cores frequency

[MHz]

1800
SDRAM memory frequency (LPDDR4)

[MHz]

TBD
Governor ondemand

Model deployment and inference applications[edit | edit source]

Version 1[edit | edit source]

The C++ application previously used and described here was adapted to work with the new NXP Linux BSP release. Now it uses OpenCV 4.2.0 to pre-process the input image and TensorFlow Lite (TFL) 2.1 as inference engine. It still supports all the 3 TFL models previously tested on the Mito8M SoM:

  • 32-bit floating-point model;
  • half-quantized model (post-training 8-bit quantization of the weights only);
  • fully-quantized model (TensorFlow v1 quantization-aware training and 8-bit quantization of the weights and activations).

Version 2[edit | edit source]

The version 1 application was then modified to accelerate the inference using the NPU (ML module) of the i.MX8M Plus Soc. This is possible because "the TensorFlow Lite library uses the Android NN API driver implementation from the GPU/ML module driver for running inference using the GPU/ML module".

Neither the floating-point nor the half-quantized models work in NPU. Moreover, "the GPU/ML module driver does not support per-channel quantization yet. Therefore post-training quantization of models with TensorFlow v2 cannot be used if the model is supposed to run on the GPU/ML module (inference on CPU does not have this limitation). TensorFlow v1 quantization-aware training and model conversion is recommended in this case".

Therefore, only the fully-quantized model was tested with the version 2 application.

Version 3[edit | edit source]

A new C++ application was written to apply the inference to the frames captured from an image sensor (OV5640) instead of images retrieved from files. Like version 2, inference run on NPU, so only the fully-quantized model was tested with the version 3 application.

Note that with this image sensor, the frame rate is capped at 30 fps.

Running the applications[edit | edit source]

As stated in the first article of this series, one of the goals is to evaluate the performances of the inference applications. As known, before and after the execution of the inference, other operations, generally referred to as pre/post-processing, are performed. Technically speaking, these operations are not part of the actual inference and are measured separately.

In order to have reproducible and reliable results, some measures were taken:

  • When possible, the inference was repeated several times and the average execution time was computed
  • All the files required to run the test—the executable, the image files, etc.—are stored on a tmpfs RAM disk in order to make file system/storage medium overhead neglectable.

Version 1[edit | edit source]

The following sections detail the execution of the first version of the classifier on the embedded platform. The number of threads was also tweaked in order to test different configurations. During the execution, the well-know htop utility was used to monitor the system. This tool is very convenient to get some useful information such as cores allocation, processor load, and number of running threads.

Floating-point model[edit | edit source]

TBD
Tweaking the number of threads[edit | edit source]

The following screenshots show the system status while executing the application with different values of the thread parameter.

[IMAGE "Thread parameter unspecified"]

[IMAGE "Thread parameter set to 1"]

[IMAGE "Thread parameter set to 2"]

Half-quantized model[edit | edit source]

TBD

The following screenshot shows the system status while executing the application. In this case, the thread parameter was unspecified.

[IMAGE "Thread parameter unspecified"]

Fully-quantized model[edit | edit source]

TBD
Tweaking the number of threads[edit | edit source]

The following screenshots show the system status while executing the application with different values of the thread parameter.

[IMAGE "Thread parameter unspecified"]

[IMAGE "Thread parameter set to 4"]

Version 2[edit | edit source]

The execution of the second version of the classifier on the embedded platform is detailed below. During the execution, htop was used to monitor the system. Note that "the first execution of model inference using the NN API always takes many times longer, because of model graph initialization needed by the GPU/ML module". Therefore, the time needed for the first inference (warm up) is measured separately.

TBD

The following screenshot shows the system status while executing the application.

[IMAGE]

Profiling model execution on NPU[edit | edit source]

The following block shows the profiler log. "The log captures detailed information of the execution clock cycles and DDR data transmission in each layer". Note that the time needed for inference is longer than usual while the profiler overhead is added.

TBD

{

$ export CNN_PERF=1 NN_EXT_SHOW_PERF=1 VIV_VX_DEBUG_LEVEL=1 VIV_VX_PROFILE=1

$ build/image_classifier_cv ... > viv_test_app_profile.log 2>&1

}

Version 3[edit | edit source]

Results[edit | edit source]