Changes

← Older edit

ML-TN-001 - AI at the edge: comparison of different embedded platforms - Part 2

2,504 bytes added, 13:25, 5 January 2021

no edit summary

==Introduction==

This Technical Note (TN for short) belongs to the series introduced [[ML-TN-001_-_AI_at_the_edge:_comparison_of_different_embedded_platforms_-_Part_1|here]].

Specifically, it illustrates the execution of ~~an inference application (fruit classifier) that makes use of the model described in~~ [[ML-TN-001_-_AI_at_the_edge:_comparison_of_different_embedded_platforms_-_Part_1#Reference_application_.231:_fruit_classifier|this ~~section~~inference application (fruit classifier)]] ~~when executed~~ on the [[:Category:Mito8M|Mito8M SoM]], a system-on-module based on the NXP [https://www.nxp.com/products/processors-and-microcontrollers/arm-processors/i-mx-applications-processors/i-mx-8-processors/i-mx-8m-family-armcortex-a53-cortex-m4-audio-voice-video:i.MX8M i.MX8M SoC].

=== Test bed ===

The kernel and the root file system of the tested platform were built with the L4.14.98_2.0.0 release of the Yocto Board Support Package for i.MX 8 family of devices. They were built with support for [https://www.nxp.com/design/software/development-software/eiq-ml-development-environment:EIQ eIQ]: "a collection of software and development tools for NXP microprocessors and microcontrollers to do inference of neural network models on embedded systems".

The following table details the relevant specs of the test bed. {| class="wikitable"|+!!!!style="margin: auto;"

|-

|'''NXP Linux BSP release'''

|L4.14.98_2.0.0

|

|-

|'''Inference engine'''

|TensorFlow Lite 1.12

|

|-

|'''Maximum ARM cores frequency''' '''[MHz]'''|1300|-|'''SDRAM memory frequency (LPDDR4)''''''[MHz]'''|1600|-|'''[https://www.kernel.org/doc/Documentation/cpu-freq/governors.txt Governor]'''|ondemand

|}

==Model deploymentand inference application==

To run the model on the target, a new C++ application was written. After debugging this application on a host PC, it was migrated to the edge device where it was built natively. The root file system for eIQ, in fact, provides the native C++ compiler as well.

So, in the end, three converted models were obtained: a regular 32-bit floating-point one, an 8-bit half-quantized (only the weights, not the activations) one, and a fully-quantized one.

The following images show the graphs of the models before conversion (click to enlarge):

* All the files required to run the test—the executable, the image files, etc.—are stored on a tmpfs RAM disk in order to make file system/storage medium overhead neglectable.

The following ~~blocks show~~ sections detail the execution of the classifier on the embedded platform. ~~With~~ The [https://www.tensorflow.org/lite/performance/best_practices#tweak_the_number_of_threads number of threads] was also tweaked in order to test different configurations. During the ~~floating~~execution, the well-~~point model~~know [https://en.wikipedia.org/wiki/Htop <code>htop</code>] utility was used to monitor the system. This tool is very convenient to get some useful information such as cores allocation, processor load, and number of running threads.

=== Floating-point model ===

root@imx8qmmek:~/devel/image_classifier_eIQ# ./image_classifier_cv 2 my_converted_model.tflite labels.txt testdata/red-apple1.jpg

</pre>

==== Tweaking the number of threads ====

The following screenshots show the system status while executing the application with different values of the thread parameter.

~~With the half-quantized model:~~

[[File:ML-TN-001 2 float default.png|thumb|center|600px|Thread parameter unspecified]]

[[File:ML-TN-001 2 float 1thread.png|thumb|center|600px|Thread parameter set to 1]]

[[File:ML-TN-001 2 float 2threads.png|thumb|center|600px|Thread parameter set to 2]]

=== Half-quantized model ===

root@imx8qmmek:~/devel/image_classifier_eIQ# ./image_classifier_cv 2 my_fruits_model_1.12_quant.tflite labels.txt testdata/red-apple1.jpg

~~With~~ The following screenshot shows the ~~fully-quantized model:~~system status while executing the application. In this case, the thread parameter was unspecified.

[[File:ML-TN-001 2 weightsquant default.png|thumb|center|600px|Thread parameter unspecified]]

=== Fully-quantized model ===

root@imx8qmmek:~/devel/image_classifier_eIQ# ./image_classifier_cv 3 my_fruits_model_qatlegacy.tflite labels.txt testdata/red-apple1.jpg

1 Red Apple

</pre>

==== Tweaking the number of threads ====

The following screenshots show the system status while executing the application with different values of the thread parameter.

[[File:ML-TN-001 2 fullquant default.png|thumb|center|600px|Thread parameter unspecified]]

[[File:ML-TN-001 2 fullquant 4threads.png|thumb|center|600px|Thread parameter set to 4]]

== Results ==

~~As shown above,~~ The following table lists the ~~total~~ prediction times for a single image ~~are~~depending on the model and the thread parameter. {| class="wikitable" style="margin:auto;"* ~ 220 |+Inference times!Model!Threads parameter!Inference time[ms ~~with the floating~~]!Notes|-| rowspan="3" |'''Floating-point ~~model;~~'''|unspecified|220||-|1|220||-|2|390||-|'''Half-quantized'''|unspecified* ~ |330 ~~ms with the half~~||-| rowspan="2" |'''Fully-quantized ~~model;~~'''|unspecified* ~ |200 ~~ms with~~ |Four threads are created beside the main process (supposedly, this quantity is set accordingly to the ~~fully~~number of physical cores available). Nevertheless, they seem to be constantly in sleep state.|-~~quantized model~~|4|84|Interestingly, 7 actual processes are created beside the main one. Four of them, however, seem to be constantly in sleep state.|} The ~~total~~ prediction time '''takes into account the time needed to fill the input tensor with the image ~~and the average inference time~~ '''. Furthermore, it is averaged over ~~three~~ several predictions. The same tests were repeated using a network file system (NFS) over an Ethernet connection, too. No significant variations in the prediction times were observed.

~~The same tests were repeated using an ext4 file system stored on a microSD card. No significant variations~~ In conclusion, to maximize the performance in terms of execution time, the model has to be fully-quantized and the ~~prediction times were observed~~number of threads has to be specified explicitly.

U0009

dave_user, Administrators

5,166

edits

Changes

ML-TN-001 - AI at the edge: comparison of different embedded platforms - Part 2

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Quick Links

Contact us

How to use wiki

Advanced Search

Tools