Changes

Jump to: navigation, search
no edit summary
{{InfoBoxTop}}
{{AppliesToMachineLearning}}
{{AppliesTo Machine Learning TN}}
{{InfoBoxBottom}}
 
[[File:TBD.png|thumb|center|200px|Work in progress]]
__FORCETOC__
|September 2020
|First public release
|-
|1.1.0
|November 2020
|Added application written in Python (version 2B)
|}
==Introduction==
This Technical Note (TN for short) belongs to the series introduced [[ML-TN-001_-_AI_at_the_edge:_comparison_of_different_embedded_platforms_-_Part_1|here]].
In particular, it illustrates the execution of different versions of an inference application (fruit classifier) that makes use of the model described in [[ML-TN-001_-_AI_at_the_edge:_comparison_of_different_embedded_platforms_-_Part_1#Reference_application_.231:_fruit_classifier|this section]], when executed on the [https://www.nxp.com/products/processors-and-microcontrollers/arm-processors/i-mx-applications-processors/i-mx-8-processors/i-mx-8m-plus-arm-cortex-a53-machine-learning-vision-multimedia-and-industrial-iot:IMX8MPLUS NXP i.MX8M Plus EVK]board. In addition, this document compares the results achieved to the ones produced by the platforms that were considered in the i.MX8M-powered [[:Category:Mito8M|Mito8M SoM]] detailed [[ML-TN-001_001 -_AI_at_the_edgeAI at the edge:_comparison_of_different_embedded_platforms_comparison of different embedded platforms -_Part_1#Articles_in_this_seriesPart 2|previous articles of this serieshere]].
Specifically, the following versions of the application were tested:
* Version 1: This version is the same described in [[ML-TN-001 - AI at the edge: comparison of different embedded platforms - Part 2|this article]]. As such, inference is implemented in software and is applied to images retrieved from files.
* Version 22A: This version is functionally equivalent to the version 1, but it leverages the Neural Processing Unit (NPU) to hardware accelerate the inference.* Version 2B: This is a Python alternative to version 2A.* Version 3: This is like version 32A, but the inference is applied to the frames captured live from an image sensor.
=== Test Bed Testbed ===
The kernel and the root file system of the tested platform were built with the L5.4.24_2.1.0 release of the Yocto Board Support Package (BSP) for i.MX 8 family of devices. They were built with support for [https://www.nxp.com/design/software/development-software/eiq-ml-development-environment:EIQ eIQ]: "a collection of software and development tools for NXP microprocessors and microcontrollers to do inference of neural network models on embedded systems".
The following table details the relevant specs of the test bedtestbed.
{| class="wikitable" style="margin: auto;"
== Model deployment and inference applications ==
=== Version 1 (C++) ===The C++ application previously used and described [https://wiki.dave.eu/index.php/ML-TN-001_-_AI_at_the_edge:_comparison_of_different_embedded_platforms_-_Part_2#Model_deployment_and_inference_application here] was adapted to work with the new NXP Linux BSP release. Now it uses OpenCV 4.2.0 to pre-process the input image and TensorFlow Lite (TFL) 2.1 as inference engine. It still supports all the 3 TFL models previously tested on the [https[://wiki.dave.eu/index.php?title=Category:Mito8M&action=edit&redlink=1 |Mito8M SoM]]:* 32-bit floating-point model;* half-quantized model (post-training 8-bit quantization of the weights only);
* fully-quantized model (TensorFlow v1 quantization-aware training and 8-bit quantization of the weights and activations).
=== Version 2 2A (C++) ===The version 1 application was then modified to accelerate the inference using the NPU (ML module) of the [https://www.nxp.com/products/processors-and-microcontrollers/arm-processors/i-mx-applications-processors/i-mx-8-processors/i-mx-8m-plus-arm-cortex-a53-machine-learning-vision-multimedia-and-industrial-iot:IMX8MPLUS i.MX8M Plus Soc] SoC. This is possible because "''the TensorFlow Lite library uses the Android NN API driver implementation from the GPU/ML module driver for running inference using the GPU/ML module"''.
Neither the floating-point nor the half-quantized models work in with the NPU, however. Moreover, "''the GPU/ML module driver does not support per-channel quantization yet. Therefore post-training quantization of models with TensorFlow v2 cannot be used if the model is supposed to run on the GPU/ML module (inference on CPU does not have this limitation). TensorFlow v1 quantization-aware training and model conversion is recommended in this case"''. Therefore, only the fully-quantized model was tested with this version of the application.
Therefore=== Version 2B (Python) ===The version 2A application was then ported to Python. This Python version is functionally equivalent to the 2A version, which is written in C++. The goal of version 2B is to make a comparison in terms of performance with respect to version 2A. Generally, Python has the advantage of being easier to work with, but at the cost of being slower to execute. However, in this case, '''regarding the inference computation''', the performance is '''pretty much the same between the two versions'''. This is because the Python API's act only as a wrapper to the core TensorFlow library written in C++ (and other "fast" languages). As detailed [[#Results comparison|in this section]], the overall time is significantly different because it takes into account the fullypre/post-quantized processing computations as well. These computations don't leverage the NPU accelerator and thus are more affected by the slower Python code. Nevertheless, in case the model was tested with used is much more complex as it usually occurs in real-world cases, this overhead could be still tolerable because it might be neglectable. In conclusion, the use of Python has not to be discarded a priori because of performance concerns. Depending on the version 2 applicationspecific use case, it can be a valid option to consider.
=== Version 3 (C++) ===A new C++ application was written to apply the inference to the frames captured from the image sensor ([https://cdn.sparkfun.com/datasheets/Sensors/LightImaging/OV5640_datasheet.pdf OV5640]) of a [https://www.nxp.com/part/MINISASTOCSI#/ camera module], instead of images retrieved from files. It This version uses OpenCV 4.2.0 to control the camera and to pre-process the frames. Like version 2, inference run runs on NPU, so only the fully-quantized model was tested with the version 3 application.
== Running the applications ==
* All the files required to run the test—the executable, the image files, etc.—are stored on a [https://www.jamescoyle.net/how-to/943-create-a-ram-disk-in-linux tmpfs RAM disk] in order to make file system/storage medium overhead neglectable.
=== <big>Version 1</big> (no NPU acceleration) ===The following sections detail the execution of the first version of the classifier on the embedded platform. The number of threads was also tweaked in order to test different configurations. During the execution, the well-know known <code>[https://en.wikipedia.org/wiki/Htop htop]</code> utility was used to monitor the system. This tool is very convenient to get some useful information such as cores allocation, processor load, and number of running threads. ====Floating-point model====The following dump refers to the execution of the application when using the floating-point model.
==== <big>Floating-point model</big> ====
<pre class="board-terminal">
root@imx8mpevk:/mnt/ramdisk/image_classifier_eIQ_plus# build/image_classifier_cv 2 my_converted_model.tflite labels.txt testdata/red-apple1.jpg
[[File:ML-TN-001 4 float 2threads.png|thumb|center|600px|Thread parameter set to 2]]
==== <big>Half-quantized model</big> ====The following dump refers to the execution of the application in combination with the half-quantized model. 
<pre class="board-terminal">
root@imx8mpevk:/mnt/ramdisk/image_classifier_eIQ_plus# build/image_classifier_cv 2 my_fruits_model_1.12_quant.tflite labels.txt testdata/red-apple1.jpg
2.47029e-18 Hand
</pre>
  The following screenshot shows the system status while executing during the applicationexecution. In this case, the thread parameter was unspecified. 
[[File:ML-TN-001 4 weightsquant default.png|thumb|center|600px|Thread parameter unspecified]]
==== <big>Fully-quantized model</big> ====The following dump refers to the execution of the application when using the fully-quantized model. 
<pre class="board-terminal">
root@imx8mpevk:/mnt/ramdisk/image_classifier_eIQ_plus# build/image_classifier_cv 3 my_fruits_model_qatlegacy.tflite labels.txt testdata/red-apple1.jpg
[[File:ML-TN-001 4 fullquant 4threads.png|thumb|center|600px|Thread parameter set to 4]]
=== <big>Version 2</big> 2A (C++) ===The execution of the second version 2A of the classifier on the embedded platform is detailed below. During the execution, <code>htop</code> was used to monitor the system. Note that "''the first execution of model inference using the NN API always takes many times longer, because of model graph initialization needed by the GPU/ML module"'', as stated by NXP documentation. Therefore, the time needed for the first inference (warm up) is measured separately. <pre class="board-terminal">
root@imx8mpevk:/mnt/ramdisk/image_classifier_eIQ_plus# build/image_classifier_cv 3 my_fruits_model_qatlegacy.tflite labels.txt testdata/red-apple1.jpg
INFO: Created TensorFlow Lite delegate for NNAPI.
</pre>
The following screenshot shows the system status while executing the application.
 
[[File:ML-TN-001 4 acceleration.png|thumb|center|600px]]
 It is worth to remember that, when using the NPU accelerator, it is not possible to select the number of threads.==== <big>Profiling model execution on NPU</big> ====The following block shows For the sake of completeness, the eIQ profiler logis provided as well in the following box. "According to NXP documentation, ''The log captures detailed information of the execution clock cycles and DDR data transmission in each layer"''. Note that the time needed for inference is longer than usual while the because of profiler overhead is added. The input command and the messages printed from the application are in bold to separate them from the log.
'''root@imx8mpevk:/mnt/ramdisk/image_classifier_eIQ_plus# build/image_classifier_cv 3 my_fruits_model_qatlegacy.tflite labels.txt testdata/red-apple1.jpg '''
'''INFO: Created TensorFlow Lite delegate for NNAPI.'''
prev_ptrs = 0xffffa369c040
Exit VX Thread: 0xa3ee5fb0
=== <big>Version 3</big> ===
The following image shows the execution of the third version of the classifier on the embedded platform. The image sensor is pointed at a red apple which is correctly classified with 98% confidence. Note that with this camera, the frame rate is capped at 30 fps, but it could be way higher because the inference on NPU only takes few milliseconds as shown before.
[[File=== Version 2B ===The execution of the version 2B of the classifier on the embedded platform is detailed below. As before, <code>htop</code> was used to monitor the system. <pre class="board-terminal">root@imx8mpevk:ML/home/mathias/devel/image_classifier_eIQ_plus# python3 image_classifier.py -TNm my_fruits_model_qatlegacy.tflite -001 4 camera photol labels.txt -i testdata/red-apple1.jpg|thumb|center|600px]]INFO: Created TensorFlow Lite delegate for NNAPI.Applied NNAPI delegate.Warm-up time: 3474.22 msOriginal image size: (600, 600)Cropped image size: (600, 600)Resized image size: (224, 224)Filling time: 0.72 msInference time 1: 1.44 msInference time 2: 1.38 msInference time 3: 1.39 msAverage inference time: 1.40 msTotal prediction time: 2.12 msResults: 1.000 Red Apple 0.000 Orange 0.000 Hand</pre> Note that the inference time is close to the C++ one, but the filling time (needed to fill the input tensor with the image) is slower. This because Python doesn't allow some low-level operations with pointers like C++.
During the execution, <code>htop</code> was used to monitor the system. The following screenshot shows the system status while executing the application.[[File:ML-TN-001 4 camera htop.png|thumb|center|600px]]
== Results ==
=== <big>Version 1</big> ===The following table lists the prediction times for a single image depending on the model and the thread parameter[[File:ML-TN-001 4 acceleration python.png|center|thumb|600x600px]]
=== Version 3 ===The following image shows the execution of the third version of the classifier on the embedded platform. The image sensor is pointed at a red apple which is correctly classified with 98% confidence. Note that with this camera, the frame rate is capped at 30 fps, but it could be way higher because the inference on NPU only takes few milliseconds as shown before.  [[File:ML-TN-001 4 camera photo.jpg|thumb|center|600px|Version 3 of the application running on the i.MX8 Plus EVK]]   During the execution, <code>htop</code> was used to monitor the system. The following screenshot shows the system status while executing the application.  [[File:ML-TN-001 4 camera htop.png|thumb|center|600px|<code>htop</code> screenshot during the execution of the classifier version 3]] == Results == === Version 1 ===The following table lists the prediction times for a single image depending on the model and the thread parameter. {| class="wikitable" style="margin: auto;"|+
Prediction times
!Model
In conclusion, to maximize the performance in terms of execution time, the model has to be fully-quantized and the number of threads has to be specified explicitly.
=== <big>Version 2</big> 2A and 3 ===In this case, only the fully-quantized model could be testedand the thread number has no effect.
{| class="wikitable" style="margin: auto;"
|'''Fully-quantized'''
|1.5
|}
 
=== Version 2B ===
 
{| class="wikitable" style="margin: auto;"
|+
Prediction times
!Model
!Prediction time
[ms]
|-
|'''Fully-quantized'''
|2.1
|}
== Results comparison ==
The following table compares the results achieved to the ones produced by measured on the [[ML-TN-001 - AI at the edge: comparison of different embedded platforms previously tested- Part 2|i.MX8M-based Mito8M SoM]].
{| class="wikitable" style="margin: auto;"
Prediction times
!Platform
!BSP
!TensorFlow Lite
!ARM cores
(# / Type / Max freq. [GHz])
!Acceleration
!Model
!Prediction time
[ms]
!Notes
|-
| rowspan="6" |'''NXP i.MX8M-based Mito8M SoM'''
| rowspan="6" |L4.14.98_2.0.0
| rowspan="6" |1.12
| rowspan="6" |4 / Cortex-A53 / 1.3
| rowspan="6" |no
| rowspan="3" |Floating-point
|unspecified (4)
|220
|
|-
|1
|220
|
|-
|2
|390
|
|-
|Half-quantized
|unspecified (4)
|330
|
|-
| rowspan="2" |Fully-quantized
|unspecified (1)
|200
|
|-
|4
|84
|
|-
| rowspan="58" |'''Xilinx Zynq UltraScale+ MPSoC ZCU104 Evaluation KitNXP i.MX8M Plus EVK'''|2 cores DPU(through DNNDK API)|Fully-quantized US+ (*)|1|3.7|-| rowspan="48" |2 cores DPU(through VART API)L5.4.24_2.1.0| rowspan="48" |Fully-quantized US+ (*)|1|4.1|-|2|2.3|-|4|1.2|-|6|1.2|-| rowspan="78" |'''NXP i4 / Cortex-A53 / 1.MX8M Plus EVK'''8
| rowspan="6" |no
(version 1)
| rowspan="3" |Floating-point
|unspecified (4)
|89
|
|-
|1
|160
|
|-
|2
|130
|
|-
|Half-quantized
|unspecified (4)
|180
|
|-
| rowspan="2" |Fully-quantized
|unspecified (1)
|85
|
|-
|4
|29
|Interestingly, this time is significantly smaller than the one measured on the i.MX8M (84 ms). Probably, this is due to improvements at the TFL inference engine level, besides the increased maximum ARM frequency.
|-
|NPU
(version 2A: C++)
|Fully-quantized
|NA
|1.5
|
|-
|NPU
(version 2B: Python)
|Fully-quantized
|NA
|2.1
|See also section [[#Version 2B (Python)|''Version 2B (Python)'']].
|}
(*) This fully-quantized model differs from the one tested with the other two platforms because of the different [https://wiki.dave.eu/index.php/ML-TN-001_-_AI_at_the_edge:_comparison_of_different_embedded_platforms_-_Part_3#Building_the_application workflow needed to deploy it]. In particular, it is not in the TFL format. Anyway, both fully-quantized models were obtained from TensorFlow models sharing the same structure, as every other model listed.
dave_user, Administrators
5,161
edits

Navigation menu