Open main menu

DAVE Developer's Wiki β

Changes

no edit summary
Specifically, the following versions of the application were tested:
* Version 1: This version is the same described in [[ML-TN-001 - AI at the edge: comparison of different embedded platforms - Part 2|this article]]. As such, inference in is implemented in software and is applied to images retrieved from files.
* Version 2: This version is functionally equivalent to the version 1, but it leverages the Neural Processing Unit (NPU) to hardware accelerate the inference.
* Version 3: This is like version 3, but the inference is applied to the frames captured from an image sensor.
Neither the floating-point nor the half-quantized models work in NPU. Moreover, "the GPU/ML module driver does not support per-channel quantization yet. Therefore post-training quantization of models with TensorFlow v2 cannot be used if the model is supposed to run on the GPU/ML module (inference on CPU does not have this limitation). TensorFlow v1 quantization-aware training and model conversion is recommended in this case".
SoTherefore, only the fully-quantized model was tested with the version 2 application.
=== Version 3 ===
=== <big>Version 2</big> ===
The following sections detail the execution of the second version of the classifier on the embedded platformis detailed below. During the execution, <code>htop</code> was used to monitor the system .Note that "the first execution of model inference using the NN API always takes many times longer, because of model graph initialization needed by the GPU/ML module".Therefore, the time needed for the first inference (warm up) is measured separately.<pre class="board-terminal">
TBD
</pre>The following screenshot shows the system status while executing the application.
- "The first execution of model inference using the NN API always takes many times longer, because of model graph initialization needed by the GPU/ML module"[IMAGE]
- ==== <big>Profiling? ma rallenta esecuzionemodel execution on NPU</big> ====The following block shows the profiler log."The log captures detailed information of the execution clock cycles and DDR data transmission in each layer".Note that the time needed for inference is longer than usual while the profiler overhead is added. TBD{
$ export CNN_PERF=1 NN_EXT_SHOW_PERF=1 VIV_VX_DEBUG_LEVEL=1 VIV_VX_PROFILE=1
89
edits