Difference between revisions of "ML-TN-001 - AI at the edge: comparison of different embedded platforms - Part 2"

From DAVE Developer's Wiki
Jump to: navigation, search
 
(32 intermediate revisions by 2 users not shown)
Line 1: Line 1:
 
{{InfoBoxTop}}
 
{{InfoBoxTop}}
 
{{AppliesToMachineLearning}}
 
{{AppliesToMachineLearning}}
 +
{{AppliesTo Machine Learning TN}}
 
{{AppliesToMito8M}}
 
{{AppliesToMito8M}}
 +
{{AppliesTo MITO 8M TN}}
 
{{InfoBoxBottom}}
 
{{InfoBoxBottom}}
  
Line 19: Line 21:
 
==Introduction==
 
==Introduction==
 
This Technical Note (TN for short) belongs to the series introduced [[ML-TN-001_-_AI_at_the_edge:_comparison_of_different_embedded_platforms_-_Part_1|here]].
 
This Technical Note (TN for short) belongs to the series introduced [[ML-TN-001_-_AI_at_the_edge:_comparison_of_different_embedded_platforms_-_Part_1|here]].
Specifically, it illustrates the execution of an inference application (fruit classifier) that makes use of the model described in [[ML-TN-001_-_AI_at_the_edge:_comparison_of_different_embedded_platforms_-_Part_1#Reference_application_.231:_fruit_classifier|this section]] when executed on the [[:Category:Mito8M|Mito8M SoM]], a system-on-module based on the NXP [https://www.nxp.com/products/processors-and-microcontrollers/arm-processors/i-mx-applications-processors/i-mx-8-processors/i-mx-8m-family-armcortex-a53-cortex-m4-audio-voice-video:i.MX8M i.MX8M SoC].
+
Specifically, it illustrates the execution of [[ML-TN-001_-_AI_at_the_edge:_comparison_of_different_embedded_platforms_-_Part_1#Reference_application_.231:_fruit_classifier|this inference application (fruit classifier)]] on the [[:Category:Mito8M|Mito8M SoM]], a system-on-module based on the NXP [https://www.nxp.com/products/processors-and-microcontrollers/arm-processors/i-mx-applications-processors/i-mx-8-processors/i-mx-8m-family-armcortex-a53-cortex-m4-audio-voice-video:i.MX8M i.MX8M SoC].
  
 
=== Test bed ===
 
=== Test bed ===
 
The kernel and the root file system of the tested platform were built with the L4.14.98_2.0.0 release of the Yocto Board Support Package for i.MX 8 family of devices. They were built with support for [https://www.nxp.com/design/software/development-software/eiq-ml-development-environment:EIQ eIQ]: "a collection of software and development tools for NXP microprocessors and microcontrollers to do inference of neural network models on embedded systems".
 
The kernel and the root file system of the tested platform were built with the L4.14.98_2.0.0 release of the Yocto Board Support Package for i.MX 8 family of devices. They were built with support for [https://www.nxp.com/design/software/development-software/eiq-ml-development-environment:EIQ eIQ]: "a collection of software and development tools for NXP microprocessors and microcontrollers to do inference of neural network models on embedded systems".
  
{| class="wikitable"
+
The following table details the relevant specs of the test bed.
|+
+
 
!
+
{| class="wikitable" style="margin: auto;"
!
 
!
 
!
 
 
|-
 
|-
|BSP release
+
|'''NXP Linux BSP release'''
 
|L4.14.98_2.0.0
 
|L4.14.98_2.0.0
|
 
|
 
 
|-
 
|-
|Inference engine
+
|'''Inference engine'''
 
|TensorFlow Lite 1.12
 
|TensorFlow Lite 1.12
|
 
|
 
 
|-
 
|-
|
+
|'''Maximum ARM cores frequency'''
|
+
 
|
+
'''[MHz]'''
|
+
|1300
 +
|-
 +
|'''SDRAM memory frequency (LPDDR4)'''
 +
'''[MHz]'''
 +
|1600
 +
|-
 +
|'''[https://www.kernel.org/doc/Documentation/cpu-freq/governors.txt Governor]'''
 +
|ondemand
 
|}
 
|}
  
==Model deployment==
+
==Model deployment and inference application==
 
To run the model on the target, a new C++ application was written. After debugging this application on a host PC, it was migrated to the edge device where it was built natively. The root file system for eIQ, in fact, provides the native C++ compiler as well.
 
To run the model on the target, a new C++ application was written. After debugging this application on a host PC, it was migrated to the edge device where it was built natively. The root file system for eIQ, in fact, provides the native C++ compiler as well.
  
Line 57: Line 59:
  
 
So, in the end, three converted models were obtained: a regular 32-bit floating-point one, an 8-bit half-quantized (only the weights, not the activations) one, and a fully-quantized one.
 
So, in the end, three converted models were obtained: a regular 32-bit floating-point one, an 8-bit half-quantized (only the weights, not the activations) one, and a fully-quantized one.
 +
  
 
The following images show the graphs of the models before conversion (click to enlarge):
 
The following images show the graphs of the models before conversion (click to enlarge):
Line 95: Line 98:
 
* All the files required to run the test—the executable, the image files, etc.—are stored on a tmpfs RAM disk in order to make file system/storage medium overhead neglectable.
 
* All the files required to run the test—the executable, the image files, etc.—are stored on a tmpfs RAM disk in order to make file system/storage medium overhead neglectable.
  
The following blocks show the execution of the classifier on the embedded platform.
+
The following sections detail the execution of the classifier on the embedded platform. The [https://www.tensorflow.org/lite/performance/best_practices#tweak_the_number_of_threads number of threads] was also tweaked in order to test different configurations. During the execution, the well-know [https://en.wikipedia.org/wiki/Htop <code>htop</code>] utility was used to monitor the system. This tool is very convenient to get some useful information such as cores allocation, processor load, and number of running threads.
 
 
With the floating-point model:
 
  
 +
=== Floating-point model ===
 
<pre class="board-terminal">
 
<pre class="board-terminal">
 
root@imx8qmmek:~/devel/image_classifier_eIQ# ./image_classifier_cv 2 my_converted_model.tflite labels.txt testdata/red-apple1.jpg  
 
root@imx8qmmek:~/devel/image_classifier_eIQ# ./image_classifier_cv 2 my_converted_model.tflite labels.txt testdata/red-apple1.jpg  
Line 126: Line 128:
 
</pre>
 
</pre>
  
 +
==== Tweaking the number of threads ====
 +
The following screenshots show the system status while executing the application with different values of the thread parameter.
  
With the half-quantized model:
 
  
 +
[[File:ML-TN-001 2 float default.png|thumb|center|600px|Thread parameter unspecified]]
 +
 +
 +
[[File:ML-TN-001 2 float 1thread.png|thumb|center|600px|Thread parameter set to 1]]
 +
 +
 +
[[File:ML-TN-001 2 float 2threads.png|thumb|center|600px|Thread parameter set to 2]]
 +
 +
=== Half-quantized model ===
 
<pre class="board-terminal">
 
<pre class="board-terminal">
 
root@imx8qmmek:~/devel/image_classifier_eIQ# ./image_classifier_cv 2 my_fruits_model_1.12_quant.tflite labels.txt testdata/red-apple1.jpg  
 
root@imx8qmmek:~/devel/image_classifier_eIQ# ./image_classifier_cv 2 my_fruits_model_1.12_quant.tflite labels.txt testdata/red-apple1.jpg  
Line 157: Line 169:
  
  
With the fully-quantized model:
+
The following screenshot shows the system status while executing the application. In this case, the thread parameter was unspecified.
  
 +
[[File:ML-TN-001 2 weightsquant default.png|thumb|center|600px|Thread parameter unspecified]]
 +
 +
=== Fully-quantized model ===
 
<pre class="board-terminal">
 
<pre class="board-terminal">
 
root@imx8qmmek:~/devel/image_classifier_eIQ# ./image_classifier_cv 3 my_fruits_model_qatlegacy.tflite labels.txt testdata/red-apple1.jpg  
 
root@imx8qmmek:~/devel/image_classifier_eIQ# ./image_classifier_cv 3 my_fruits_model_qatlegacy.tflite labels.txt testdata/red-apple1.jpg  
Line 181: Line 196:
 
  1      Red Apple
 
  1      Red Apple
 
</pre>
 
</pre>
 +
 +
==== Tweaking the number of threads ====
 +
The following screenshots show the system status while executing the application with different values of the thread parameter.
 +
 +
[[File:ML-TN-001 2 fullquant default.png|thumb|center|600px|Thread parameter unspecified]]
 +
 +
 +
[[File:ML-TN-001 2 fullquant 4threads.png|thumb|center|600px|Thread parameter set to 4]]
  
 
== Results ==
 
== Results ==
As shown above, the total prediction times for a single image are:
+
The following table lists the prediction times for a single image depending on the model and the thread parameter.
* ~ 220 ms with the floating-point model;
+
 
* ~ 330 ms with the half-quantized model;
+
{| class="wikitable" style="margin: auto;"
* ~ 200 ms with the fully-quantized model.
+
|+
The total prediction time takes into account the time needed to fill the input tensor with the image and the average inference time over three predictions.
+
Inference times
 +
!Model
 +
!Threads parameter
 +
!Inference time
 +
[ms]
 +
!Notes
 +
|-
 +
| rowspan="3" |'''Floating-point'''
 +
|unspecified
 +
|220
 +
|
 +
|-
 +
|1
 +
|220
 +
|
 +
|-
 +
|2
 +
|390
 +
|
 +
|-
 +
|'''Half-quantized'''
 +
|unspecified
 +
|330
 +
|
 +
|-
 +
| rowspan="2" |'''Fully-quantized'''
 +
|unspecified
 +
|200
 +
|Four threads are created beside the main process (supposedly, this quantity is set accordingly to the number of physical cores available). Nevertheless, they seem to be constantly in sleep state.
 +
|-
 +
|4
 +
|84
 +
|Interestingly, 7 actual processes are created beside the main one. Four of them, however, seem to be constantly in sleep state.
 +
|}
 +
 
 +
The prediction time '''takes into account the time needed to fill the input tensor with the image'''. Furthermore, it is averaged over several predictions.
 +
 
 +
The same tests were repeated using a network file system (NFS) over an Ethernet connection, too. No significant variations in the prediction times were observed.
  
The same tests were repeated using an ext4 file system stored on a microSD card. No significant variations in the prediction times were observed.
+
In conclusion, to maximize the performance in terms of execution time, the model has to be fully-quantized and the number of threads has to be specified explicitly.

Latest revision as of 13:25, 5 January 2021

Info Box
NeuralNetwork.png Applies to Machine Learning
DMI-Mito-top.png Applies to MITO 8M


History[edit | edit source]

Version Date Notes
1.0.0 October 2020 First public release

Introduction[edit | edit source]

This Technical Note (TN for short) belongs to the series introduced here. Specifically, it illustrates the execution of this inference application (fruit classifier) on the Mito8M SoM, a system-on-module based on the NXP i.MX8M SoC.

Test bed[edit | edit source]

The kernel and the root file system of the tested platform were built with the L4.14.98_2.0.0 release of the Yocto Board Support Package for i.MX 8 family of devices. They were built with support for eIQ: "a collection of software and development tools for NXP microprocessors and microcontrollers to do inference of neural network models on embedded systems".

The following table details the relevant specs of the test bed.

NXP Linux BSP release L4.14.98_2.0.0
Inference engine TensorFlow Lite 1.12
Maximum ARM cores frequency

[MHz]

1300
SDRAM memory frequency (LPDDR4)

[MHz]

1600
Governor ondemand

Model deployment and inference application[edit | edit source]

To run the model on the target, a new C++ application was written. After debugging this application on a host PC, it was migrated to the edge device where it was built natively. The root file system for eIQ, in fact, provides the native C++ compiler as well.

The application uses OpenCV 4.0.1 to pre-process the input image and TensorFlow Lite (TFL) 1.12 as inference engine. The model, originally created and trained with Keras of TensorFlow (TF) 1.15, was therefore converted into the TFL format.

Then, the same model was recreated and retrained with Keras of TF 1.12. This allowed to convert it into TFL with post-training quantization of the weights without compatibility issues with the target inference engine version.

After that, it was also recreated and retrained with quantization-aware training of TF 1.15. In this way, a fully quantized model was obtained after conversion.

So, in the end, three converted models were obtained: a regular 32-bit floating-point one, an 8-bit half-quantized (only the weights, not the activations) one, and a fully-quantized one.


The following images show the graphs of the models before conversion (click to enlarge):

Originally created model

(Keras of TF 1.15)

Recreated model

(Keras of TF 1.12)

Quantization-aware trained model

(TF 1.15)

ML - Keras1.15 fruitsmodel.png
ML - Keras1.12 fruitsmodel.png
ML - TF1.15QAT fruitsmodel.png

The following images show the graphs of the models after conversion (click to enlarge):

Floating point model

(TFL)

Half quantized model

(TFL)

Fully quantized model

(TFL)

ML - TFL float fruitsmodel.png
ML - TFL halfquant fruitsmodel.png
ML - TFL QAT fruitsmodel.png

Running the application[edit | edit source]

In order to have reproducible and reliable results, some measures were taken:

  • The inference was repeated several times and the average execution time was computed
  • All the files required to run the test—the executable, the image files, etc.—are stored on a tmpfs RAM disk in order to make file system/storage medium overhead neglectable.

The following sections detail the execution of the classifier on the embedded platform. The number of threads was also tweaked in order to test different configurations. During the execution, the well-know htop utility was used to monitor the system. This tool is very convenient to get some useful information such as cores allocation, processor load, and number of running threads.

Floating-point model[edit | edit source]

root@imx8qmmek:~/devel/image_classifier_eIQ# ./image_classifier_cv 2 my_converted_model.tflite labels.txt testdata/red-apple1.jpg 
Number of threads: undefined
Warmup time: 233.403 ms
Original image size: 600x600x3
Cropped image size: 600x600x3
Resized image size: 224x224x3
Input tensor index: 1
Input tensor name: conv2d_8_input
Selected order of channels: RGB
Selected pixel values range: 0-1
Filling time: 1.06354 ms
Inference time 1: 219.723 ms
Inference time 2: 220.512 ms
Inference time 3: 221.897 ms
Average inference time: 220.711 ms
Total prediction time: 221.774 ms
Output tensor index: 0
Output tensor name: Identity
Top results:
 1      Red Apple
 1.13485e-10    Orange
 5.58774e-18    Avocado
 7.49395e-20    Hand
 1.40372e-22    Banana

Tweaking the number of threads[edit | edit source]

The following screenshots show the system status while executing the application with different values of the thread parameter.


Thread parameter unspecified


Thread parameter set to 1


Thread parameter set to 2

Half-quantized model[edit | edit source]

root@imx8qmmek:~/devel/image_classifier_eIQ# ./image_classifier_cv 2 my_fruits_model_1.12_quant.tflite labels.txt testdata/red-apple1.jpg 
Number of threads: undefined
Warmup time: 328.374 ms
Original image size: 600x600x3
Cropped image size: 600x600x3
Resized image size: 224x224x3
Input tensor index: 12
Input tensor name: conv2d_input
Selected order of channels: RGB
Selected pixel values range: 0-1
Filling time: 1.10302 ms
Inference time 1: 322.839 ms
Inference time 2: 322.694 ms
Inference time 3: 339.768 ms
Average inference time: 328.434 ms
Total prediction time: 329.537 ms
Output tensor index: 18
Output tensor name: dense_1/Softmax
Top results:
 1      Red Apple
 1.53349e-07    Orange
 1.67772e-15    Avocado
 7.44711e-18    Banana
 2.47029e-18    Hand


The following screenshot shows the system status while executing the application. In this case, the thread parameter was unspecified.

Thread parameter unspecified

Fully-quantized model[edit | edit source]

root@imx8qmmek:~/devel/image_classifier_eIQ# ./image_classifier_cv 3 my_fruits_model_qatlegacy.tflite labels.txt testdata/red-apple1.jpg 
Number of threads: undefined
Warmup time: 201.551 ms
Original image size: 600x600x3
Cropped image size: 600x600x3
Resized image size: 224x224x3
Input tensor index: 14
Input tensor name: conv2d_input
Selected order of channels: RGB
Selected pixel values range: NA
Filling time: 0.45083 ms
Inference time 1: 198.342 ms
Inference time 2: 199.043 ms
Inference time 3: 198.543 ms
Average inference time: 198.643 ms
Total prediction time: 199.093 ms
Output tensor index: 5
Output tensor name: activation_5/Softmax
Top results:
 1      Red Apple

Tweaking the number of threads[edit | edit source]

The following screenshots show the system status while executing the application with different values of the thread parameter.

Thread parameter unspecified


Thread parameter set to 4

Results[edit | edit source]

The following table lists the prediction times for a single image depending on the model and the thread parameter.

Inference times
Model Threads parameter Inference time

[ms]

Notes
Floating-point unspecified 220
1 220
2 390
Half-quantized unspecified 330
Fully-quantized unspecified 200 Four threads are created beside the main process (supposedly, this quantity is set accordingly to the number of physical cores available). Nevertheless, they seem to be constantly in sleep state.
4 84 Interestingly, 7 actual processes are created beside the main one. Four of them, however, seem to be constantly in sleep state.

The prediction time takes into account the time needed to fill the input tensor with the image. Furthermore, it is averaged over several predictions.

The same tests were repeated using a network file system (NFS) over an Ethernet connection, too. No significant variations in the prediction times were observed.

In conclusion, to maximize the performance in terms of execution time, the model has to be fully-quantized and the number of threads has to be specified explicitly.