Changes

← Older edit

ML-TN-001 - AI at the edge: comparison of different embedded platforms - Part 3

27,669 bytes added, 13:26, 5 January 2021

no edit summary

~~[[File:TBD.png|thumb|center|200px|Work in progress]]~~

__FORCETOC__

|-

|1.0.0

|~~September~~ October 2020

|First public release

|}

==Introduction==

This Technical Note (TN for short) belongs to the series introduced [[ML-TN-001_-_AI_at_the_edge:_comparison_of_different_embedded_platforms_-_Part_1|here]].Specifically, it illustrates the execution of ~~an inference application (fruit classifier) that makes use of the model described in~~ [[ML-TN-001_-_AI_at_the_edge:_comparison_of_different_embedded_platforms_-_Part_1#Reference_application_.231:_fruit_classifier|this ~~section~~inference application (fruit classifier)]] ~~when executed~~ on the [https://www.xilinx.com/products/boards-and-kits/zcu104.html Xilinx Zynq UltraScale+ MPSoC ZCU104 Evaluation Kit]. The ~~results achieved are also compared~~ same application was tested on the NXP i.MX8M-based Mito8M SoM as well. For more details, please refer to [[ML-TN-001_-_AI_at_the_edge:_comparison_of_different_embedded_platforms_-_Part_2|this article]].===Test bed===The following table details the test bed used for this Technical Note.{| class="wikitable" style="margin: auto;"|+Host and target configurations!System!Component!Name!Version!Notes|-| rowspan="3" |'''Host'''|Operating system|GNU/Linux Ubuntu|18.04||-|Software development platform|Vitis|1.2||-|Machine learning frameworl|TensorFlow|1.15.2||-| rowspan="4" |'''Target'''|Hardware platform|ZCU104|1.0||-|Linux BSP|Petalinux|2020.1||-|Software binary image (microSD card)|xilinx-zcu104-dpu-v2020.1-v1.2.0|v2020.1-v1.2.0||-|Neural network hardware accelerator|DPU|3.3|For more details, please refer to the following sections.|} The target was configured in order to leverage the ~~ones produced~~ hardware acceleration provided by the ~~platforms that were considered~~ [https://www.xilinx.com/products/intellectual-property/dpu.html Xilinx Deep Learning Processor Unit (DPU)], which is an IP instantiated in the Programmable Logic (PL) as depicted in the following block diagram. [[File:ML-TN-~~001_~~001-MPSoC-PL1.png|thumb|center|600px|Top-level architecture of the system implemented in the SoC]] In particular, this is the DPU-~~_AI_at_the_edge~~related subsystem: [[File:~~_comparison_of_different_embedded_platforms_~~ML-TN-001-MPSoC-~~_Part_1#Articles_in_this_series~~PL2.png|thumb|center|600px|~~previous articles~~ DPU-related subsystem]] Interestingly, to some extent, the DPU IP can be customized in order to find the optimal trade-off between performances and resource allocation. For instance, the actual number of DPU cores can be selected. The default configuration of the DPU used for the initial testing is depicted in the following images. As shown previously, in this ~~series~~case, two DPU cores are instantiated (DPU_0 and DPU_1). {| class="wikitable" style="margin: auto;"|+!DPU default configuration|-|[[File:ML-TN-001-MPSoC-PL3.png|thumb|center|600px]]|-|[[File:ML-TN-001-MPSoC-PL4.png|thumb|center|600px]]|-|[[File:ML-TN-001-MPSoC-PL5.png|thumb|center|600px]]|}

==Building the application==

The starting point for the application is the model described [[ML-TN-001_-_AI_at_the_edge:_comparison_of_different_embedded_platforms_-_Part_1#Reference_application_.231:_fruit_classifier|here]]. Incidentally, the '''same''' model structure was used as starting point for [[ML-TN-001_-_AI_at_the_edge:_comparison_of_different_embedded_platforms_-_Part_2|this other test]] as well (*). This makes the comparison of the two tests straightforward, even though they were run on SoC's that differ significantly from the architectural standpoint.

(*) The two models share the same structure but, as they are trained independently, their weights differ.

===Training the model===

Model training is performed with the help of the Docker container provided by Vitis AI.

The model is trained for a total number of 100 epochs, with early stopping to prevent model overfitting on train data and checkpointing the weights on best val_loss. After that, a new model is created disabling all the layers only useful during training such as dropouts and batchnorms (i.e. in this case the batchnorm layers are not used).

[[File:Train Accuracy.png|thumb|center|500px|Plot of model's accuracy during training phase]]

[[File:Train Loss.png|thumb|center|500px|Plot of model's loss during training phase]]

===Pruning the model===

{{ImportantMessage|text=This operation if performed at TensorFlow level. As such, it does not make use of the Xilinx pruning tool, which is referred to in [https://www.xilinx.com/support/documentation/ip_documentation/dpu/v3_2/pg338-dpu.pdf this document], for example.}}

Weight pruning means eliminating unnecessary values in the weight tensors, practically setting the neural network parameters’ values to zero in order to remove the unnecessary connections between the layers of a neural network. This is done during the training process to allow the neural network to adapt to the changes. An immediate benefit from this work is disk compression: sparse tensors are amenable to compression. Hence, by applying simple file compression to the pruned TensorFlow checkpoint, it is possible to reduce the size of the model for its storage and/or transmission.

The following list shows the weights sparsity of the model, before applying pruning. It is notable how there is actually no sparsity in the weights of the model.

<pre>

predictions/bias:0 -- Param: 6 -- Zeros: 00.00%

</pre>

The dimension in bytes of the compressed model size before applying pruning:

<pre>

Size of gzipped loaded model: 17801431.00 bytes

</pre>

The accuracy of the non-pruned model over the test dataset:

<pre>

1/1 [==============================] - 0s 214ms/step - loss: 1.3166 - acc: 0.7083

</pre>

The model is loaded and trained once again, resuming its previous state, after applying a pruning schedule. As training proceeds, the pruning routine will be scheduled to execute, eliminating (i.e. setting to zero) the weights with the lowest magnitude values (i.e. those closest to zero) until the current sparsity target is reached. Every time the pruning routine is scheduled to execute, the current sparsity target is recalculated, starting from 0% until it reaches the final target sparsity at the end of the pruning schedule. After the end step, the training continues, in order to regain the lost accuracy, knowing that the actual level of sparsity will not change.

In this particular case, a good compromise between compression and accuracy drop is to prune only the two dense layers of the model, which have a high number of parameters, with a pruning schedule that start at epoch 0, ends at 1/3 of the total number of epochs (i.e. 100 epochs), starting with an initial sparsity of 50% and ending with a final sparsity of 80%, with a pruning frequency of 5 steps (i.e. the model is pruned every 5 steps during the training phase).

[[File:Prune Accuracy.png|thumb|center|500px|Plot of model's accuracy during pruning phase]]

[[File:Prune Loss.png|thumb|center|500px|Plot of model's loss during pruning phase]]

The weights sparsity of the model, after applying pruning:

<pre>

</pre>

The dimension in bytes of the compressed model size after pruning:

<pre>

Size of gzipped loaded model: 5795289.00 bytes

</pre>

The difference between the two versions of the same compressed model (before and after pruning) in terms of disk occupation is remarkable, almost by a factor of 3.

The accuracy of the pruned model over the test dataset:

<pre>

===Freezing the computational graph===

~~'''Baseline~~ Freezing the model~~'''~~means producing a singular file containing information about the graph and checkpoint variables, but saving these hyperparameters as constants within the graph structure. This eliminates additional information saved in the checkpoint files such as the gradients at each point, which are included so that the model can be reloaded and resume training starting from a previously saved point. As this is not needed when serving a model purely for inference, they are discarded in freezing.

<pre>

INFO:tensorflow:Froze 12 variables.

===Transform the computational graph===

~~'''Applied transformations'''<pre>transformations_list = ['remove_nodes(op=Identity~~After freezing, ~~op=CheckNumerics)',~~ ~~'merge_duplicate_nodes',~~ ~~'strip_unused_nodes',~~ ~~'fold_constants(ignore_errors=true)',~~ ~~'fold_batch_norms']</pre>~~the computational graph is described as follows:

~~'''Baseline model'''~~

<pre>

describe : frozen_graph.pb

total nodes : 56

</pre>

A much more detailed description of the computational graph, showing all the nodes and the corrisponding operations, is provided as follows:

<pre>

Op: Softmax -- Name: predictions/Softmax

</pre>

The structure of the current computational graph can be optimized, using the Graph Transform tool, which is provided within the Tensorflow framework. The tool allows the application of a series of transformations which reduces the complexity of the input graph, erasing all the nodes and the operation which are not useful for the purpose of inference. The list of used transformations is the following one:

<pre>

transformations_list = ['remove_nodes(op=Identity, op=CheckNumerics)',

'merge_duplicate_nodes',

'strip_unused_nodes',

'fold_constants(ignore_errors=true)',

'fold_batch_norms']

</pre>

After performing the optimization, the new description of the computational graph is provided:

<pre>

total nodes : 42

</pre>

A much more detailed description of the optimized computational graph, showing all the nodes and the corresponding operations, is provided as follows:

<pre>

Op: BiasAdd -- Name: predictions/BiasAdd

</pre>

The accuracy of the '''baseline model''' over the test dataset after applying all transformations:

<pre>

Graph accuracy with test dataset: 0.7083

</pre>

The accuracy of the '''pruned model''' over the test dataset after applying all transformations:

<pre>

</pre>

===~~Quantize~~ Quantizing the computational graph=== The process of inference is expensive in terms of computation and requires a high memory bandwidth to satisfy the low-latency and high-throughput requirement of edge applications. Generally, when training neural networks, 32-bit floating-point weights and activation values are used but, with the Vitis AI quantizer, the complexity of the computation could be reduced without losing prediction accuracy. This is achieved by converting the 32-bit floating-point values to 8-bit integer format. In this case, the fixed-point network model requires less memory bandwidth, providing faster speed and higher power efficiency than using the floating-point model. In the quantize calibration process, only a small set of images are required to analyze the distribution of activations. Since we are not performing any backpropagation, there is no need to provide any labels either. Depending on the size of the neural network, the running time of quantize calibration varies from a few seconds to several minutes. After calibration, the quantized model is transformed into a DPU deployable model (named as <code>deploy_model.pb</code> for vai_q_tensorflow) which follows the data format of a DPU. This model can be compiled by the Vitis AI compiler and deployed to the DPU. This quantized model cannot be used by the standard TensorFlow framework to evaluate the loss of accuracy. Hence, in order to do so, a second file is produced (named as <code>quantize_eval_model.pb</code> for vai_q_tensorflow). For the current application, 100 images are sampled from the train dataset and augmented, resulting in a total number of 1000 images used for calibration. Furthermore, the graph is calibrated providing a batch of 10 images for 100 iterations. Following, the log of vai_q_tensorflow shows the result of the whole quantization process: <pre>Vai_q_tensorflow v1.2.0 build for Tensorflow 1.15.22020-10-08 13:26:59.752125: W tensorflow/stream_executor/cuda/redzone_allocator.cc:312] Not found: ./bin/ptxas not foundRelying on driver to perform ptx compilation. This message will be only logged once.100% (100 of 100) |######################| Elapsed Time: 0:00:33 Time: 0:00:33INFO: Checking Float Graph...INFO: Float Graph Check Done.INFO: Calibrating for 100 iterations...INFO: Calibration Done.INFO: Generating Deploy Model...INFO: Deploy Model Generated.********************* Quantization Summary ********************* INFO: Output: quantize_eval_model: ./build/quantize/baseline/quantize_eval_model.pb deploy_model: ./build/quantize/baseline/deploy_model.pb</pre> The accuracy of the '''baseline model''' over the test dataset after applying quantization:

~~'''Baseline model'''~~

<pre>

graph accuracy with test dataset: 0.7083

</pre>

The accuracy of the '''~~Pruned~~ pruned model'''over the test dataset after applying quantization:

<pre>

graph accuracy with test dataset: 0.~~7083~~6667

</pre>

===Compiling the model===

The Vitis AI compiler operates in a multi-stage process: # The compiler parses the topology of the optimized and quantized input model and produces a new computation graph consisting of a data flow and a control flow.# It will then optimize the data and control flow through processes such as fusing the batch normalization layers into the presiding convolution layers, efficient instruction scheduling by exploit inherent parallelism and exploiting data reuse.# Finally, it generates the code to be run. It must be noted that due to the limited number of operations supported by the DPU, the Vitis AI compiler automatically partitions the input network model into several kernels when there are operations not supported by DPU. For this particular case, two kernels are produced, due to the fact that the softmax activation layer is not currently supported by the DPU. Following, the log of vai_c_tensorflow shows the result of the compilation for the '''~~Baseline~~ baseline model''':

<pre>

Kernel topology "custom_cnn_kernel_graph.jpg" for network "custom_cnn"

</pre>

Following, the log of vai_c_tensorflow shows the result of the compilation for the '''~~Pruned~~ pruned model''':

<pre>

Kernel topology "pruned_custom_cnn_kernel_graph.jpg" for network "pruned_custom_cnn"

</pre>

==~~Testing~~ Running the application== In order to have reproducible and reliable results, some measures were taken:* The inference was repeated several times and the average execution time was computed* All the files required to run the test—the executable, the image files, etc.—are stored on a [https://www.jamescoyle.net/how-to/943-create-a-ram-disk-in-linux tmpfs RAM disk] in order to make file system/storage medium overhead neglectable. Two new C++ applications were developed for the trained, optimized, and compiled neural network model as illustrated in the steps above:* The first application uses the old DNNDK low-level APIs for loading the DPU kernel, creating the DPU task, and preparing the input-output tensors for the inference. Besides the use of the DSight visual tool, two possible profiling strategies are available depending on the chosen DPU mode when compiling the kernel (normal or profile):**A coarse-grained profiling, which shows the execution time for all the main tasks executed on the CPU and on the DPU**A fine-grained profiling, which shows detailed information about all the nodes of the model, such as the workload, the memory occupation, and the runtime.*The second application is a multi-thread application instead, which uses the VART high-level APIs for retrieving the computational subgraph from the DPU kernel and for performing the inference. In this case, it is possible to split the entire workload on multiple concurrent threads, assigning each one a batch of images. Both applications make use of the OpenCV library for cropping and resizing the input images, in order to match the model's input tensor shape and ~~performances~~display the results of the inference (i.e. the probability for each class) for each image. Before illustrating the results by running the C++ applications, it can be interesting to check some information about the DPU and the DPU kernel elf file. This can be done, with DExplorer and DDump tools. ===DExplorer=== It provides DPU running mode configuration, DNNDK version checking, DPU status checking, and DPU core signature checking. This can be done by using the ''DExplorer'' tool as illustrated here: <pre>root@xilinx-zcu104-2020_1:~# dexplorer -v -w Vitis AI for Edge DPU version 1.2Copyright 2019 Xilinx Inc. DExplorer version 3.0Build Label: Jun 19 2020 05:21:20 DSight version 2.1Build Label: Jun 19 2020 05:21:20 DDump version 2.0Build Label: Jun 19 2020 05:21:20 N2Cube Core library version 4.2Build Label: Jun 19 2020 05:21:16[DPU IP Spec]IP Timestamp : 2020-06-18 12:00:00DPU Core Count : 2 [DPU Core Configuration List]DPU Core : #0DPU Enabled : YesDPU Arch : B4096DPU Target Version : v1.4.1DPU Freqency : 300 MHzRam Usage : HighDepthwiseConv : EnabledDepthwiseConv+Relu6 : EnabledConv+Leakyrelu : EnabledConv+Relu6 : EnabledChannel Augmentation : EnabledAverage Pool : Enabled DPU Core : #1DPU Enabled : YesDPU Arch : B4096DPU Target Version : v1.4.1DPU Freqency : 300 MHzRam Usage : HighDepthwiseConv : EnabledDepthwiseConv+Relu6 : EnabledConv+Leakyrelu : EnabledConv+Relu6 : EnabledChannel Augmentation : EnabledAverage Pool : Enabled</pre> ===DDump=== It is possible to dump some information encapsulated inside the DPU ELF file, such as the DPU Kernel name and general information, and the DPU architecture information. These are useful for analysis and debugging purposes. To retrieve this information, use the ''DDump'' tool as illustrated here: <pre>root@xilinx-zcu104-2020_1:~/VART_2# ddump -f bin/dpu_custom_cnn_0.elf -aDPU Kernel List from file bin/dpu_custom_cnn_0.elf ID: Name 0: custom_cnn_0 DPU Kernel name: custom_cnn_0 ---------------------------------------------------------------- -> DPU Kernel general info Mode: NORMAL Code Size: 0.02MB Param Size: 4.60MB Workload MACs: 498.209MOP IO Memory Space: 0.52MB Mean Value: 0, 0, 0 Node Count: 6 Tensor Count: 7 Tensor In(H*W*C) Tensor ID-0: 224*224*3 Tensor Out(H*W*C) Tensor ID-6: 1*1*6 -> DPU architecture info DPU ABI Ver: v2.1DPU Configuration Parameters DPU Target Ver: 1.4.1 DPU Arch Type: B4096 RAM Usage: high DepthwiseConv: Enabled DepthwiseConv+Relu6: Enabled Conv+Leakyrelu: Enabled Conv+Relu6: Enabled Channel Augmentation: Enabled Average Pool: Enabled -> DNNC compiler info DNNC Ver: VAI_C Compiler for Edge, Version v5.01DPU Target : v1.4.1Build Label: Jun 23 2020 03:34:14Copyright @2020 Xilinx Inc. All Rights Reserved.</pre> ===DNNDK-based application=======Coarse grained profiling using DNNDK low level API==== The results of the coarse-grained profiling achieved using the baseline's DPU kernel (i.e. <code>custom_cnn_0</code>) compiled with option mode set as '''normal''' are indicated in the following box. <pre>--------------------------------------------------------------- red_apple_1.jpg [Time] LoadImage 0.0161657 ms[Time] PreprocessImage 0.00521947 ms[Time] SetInputImageInHWCFP32 0.00115946 ms[DPU Time] dpuSetInputTensorInHWCFP32 1.67499 ms[DPU Time] dpuRunTask 1.99908 ms[DPU Time] dpuGetOutputTensorInHWCFP32 0.00864 ms[DPU Time] dpuRunSoftmax 0.00476 ms[DPU tot time] 3.68747 ms[DPU throughput] 271.189 FPS [Time] CpuArgmax<float> 1.3e-07 ms[Time] CpuSoftmax 1.01e-06 ms[Time] RunCustomCNN 0.0109225 ms[Time] TopK 8.04e-06 ms1) red_apple : 0.9996652) orange : 0.000335353) hand : 8.76131e-084) avocado : 9.23435e-095) banana : 1.38833e-116) green_apple : 3.44132e-14_______________________________________________________________</pre> Within the scope of this TN, the most relevant time is ''DPU tot time'', which indicates the time spent to execute the inference (~3.7ms). This leads to a throughput of about 271 fps. ====Fine grained profiling using DNNDK low level API==== The following frame reports the results of the fine-grained profiling achieved using the baseline's DPU kernel (i.e. <code>dbg_custom_cnn_0</code>) compiled with option mode set as '''profile'''. <pre>--------------------------------------------------------------- red_apple_1.jpg [Time] LoadImage 0.0163798 ms[Time] PreprocessImage 0.00518644 ms[Time] SetInputImageInHWCFP32 0.00132577 ms[DNNDK] Performance profile - DPU Kernel "dbg_custom_cnn_0" DPU Task "dbg_custom_cnn_0-1"===================================================================================================== ID NodeName Workload(MOP) Mem(MB) RunTime(ms) Perf(GOPS) Utilization MB/S 1 conv2d_1_Conv2D 85.163 0.53 0.363 234.6 19.1% 1453.7 2 conv2d_2_Conv2D 218.991 0.48 0.205 1068.2 86.9% 2330.1 3 conv2d_3_Conv2D 99.680 0.15 0.108 923.0 75.1% 1396.5 4 conv2d_4_Conv2D 84.935 0.13 0.089 954.3 77.7% 1487.2 5 dense_1_MatMul 9.437 4.52 1.106 8.5 0.7% 4089.5 6 predictions_MatMul 0.003 0.00 0.013 0.2 0.0% 145.8 Total Nodes In Avg: All 498.209 6.10 1.884 264.4 21.5% 3235.6=====================================================================================================[Time] CpuArgmax<float> 1.3e-07 ms[Time] CpuSoftmax 9e-07 ms[Time] RunCustomCNN 0.0110171 ms[Time] TopK 8.34e-06 ms1) red_apple : 0.9996652) orange : 0.000335353) hand : 8.76131e-084) avocado : 9.23435e-095) banana : 1.38833e-116) green_apple : 3.44132e-14_______________________________________________________________</pre> ====Profiling analysis with DSight==== DSight is the DNNDK performance profiling tool which is used as a visual performance analysis tool for neural network models. By running the DNNDK application with profile as the DPU running mode configuration, a <code>.prof</code> log file is produced. This file can be parsed and processed with DSight, obtaining an HTML web page providing a visual format chart showing DPU cores' utilization and scheduling efficiency over time, as illustrated in the following picture: [[File:Xilinx DSight.png|thumb|center|1000px|DSight visual performance analysis]] ===VART-based application===As stated previously, this version of the application is functionally equivalent to the DNNDK-based one, but it makes use of the newer [https://github.com/Xilinx/Vitis-AI/blob/master/VART/README.md Vitis AI Runtime (VART) API]. The following dump shows the output of the application when processing the image file <code>red_apple_1.jpg</code>.<pre>image name : red_apple_1.jpgground truth label : red_applepredicted label : red_apple1) red_apple : 0.9996652) orange : 0.000335353) hand : 8.76131e-084) avocado : 9.23435e-095) banana : 1.38833e-116) green_apple : 3.44132e-14________________________________________________ execution time : 0.0583705 stot correct : 1tot wrong : 0</pre> ====Profiling with Vitis AI Profiler====Vitis-AI Profiler is a powerful, application-level tool that could help to optimize the whole AI application. The main purpose of this tool is to help to detect bottlenecks of the whole AI application by profiling the pre-processing functions and the post-processing functions together with DPU kernels' running status. There are two components of this tool named <code>vaitrace</code>, which runs on the target device and takes the responsibility for data collection, and <code>vaiprofiler</code>, which runs on a PC or local server and takes the responsibility for analysis and visualization of collected data. Note that it is preferable to save the information for <code>vaitrace</code> into a configuration file as follows: <pre>{ "options": { "runmode": "normal", "cmd": "./bin/customCNNclassification -w split -i ./images -c ./custom_images -r 10 -t 1 -M ./bin/dpu_custom_cnn_0.elf", "output": "./trace_customCNN_vart.xat", "timeout": 10 }, "trace": { "enable_trace_list": ["vitis-ai-library", "vart", "opencv", "custom"] }, "trace_custom": ["ListDirectory", "GetImageFileNames", "TopK", "CpuArgmax", "CpuSoftmax", "GetPredictedLabels", "GetGroundTruthLabels", "SliceVector"]}</pre> The developed application is profiled several times each one with a different number of threads. For all the profiling traces, the DPU throughput is provided along with some additional information concerning the latency of the DPUs and the usage of both CPU and DPU cores. The inference is repeated for 10 times on the same image. =====One thread=====In the figure below, the VART-based application uses 1 thread. The trace shows that the throughput is stable, around '''245''' fps. The throughput is similar to the one achieved by the DNNDK-based application, but a little bit smaller. This is probably due to the fact that the VART API's are affected by a little bigger overhead. [[File:Vaiprofiler 1 thread 10 runs.png|thumb|center|800px|Profiling VART based application, 1 thread only]] {| class="wikitable" style="margin: auto;"|+Trace information|-! Item! Value|- style="font-weight:bold;"| DPU_1 Latency| style="font-weight:normal;" | |-| custom_cnn_0| 1514.05 us|- style="font-weight:bold;"| Utilization| style="font-weight:normal;" | |-| CPU-00| 15.90 %|-| CPU-01| 23.74 %|-| CPU-02| 1.12 %|-| CPU-03| 1.15 %|-| DPU-01| 18.72 %|} As expected, only one of the two DPU cores is actually leveraged.=====Two threads=====In the figure below, the VART-based application uses 2 threads. The trace shows that the throughput is stable, around '''442''' fps. [[File:Vaiprofiler 2 threads 10 runs.png|thumb|center|800px|Profiling VART based application, 2 threads]] {| class="wikitable" style="margin: auto;"|+Trace information|-! Item! Value|- style="font-weight:bold;"| DPU_0 Latency| style="font-weight:normal;" | |-| custom_cnn_0| 2085.12 us|- style="font-weight:bold;"| DPU_1 Latency| style="font-weight:normal;" | |-| custom_cnn_0| 1648.66 us|- style="font-weight:bold;"| Utilization| style="font-weight:normal;" | |-| CPU-00| 2.84 %|-| CPU-01| 10.56 %|-| CPU-02| 30.00 %|-| CPU-03| 19.14 %|-| DPU-00| 19.02 %|-| DPU-01| 13.24 %|} As expected, profiling information indicates that both DPU's are used. At first approximation, the throughput is doubled with respect to the single thread application in accordance with the fact that the DPU cores work in parallel and the CPU cores are not saturated. =====Four threads=====In the figure below, the VART-based application uses 4 threads. The trace shows that the throughput is stable, around ''' 818''' fps. [[File:Vaiprofiler 4 threads 10 runs.png|thumb|center|800px|Profiling VART based application, 4 threads]] {| class="wikitable" style="margin: auto;"|+Trace information|-! Item! Value|- style="font-weight:bold;"| DPU_0 Latency| style="font-weight:normal;" | |-| custom_cnn_0| 2111.89 us|- style="font-weight:bold;"| DPU_1 Latency| style="font-weight:normal;" | |-| custom_cnn_0| 1679.56 us|- style="font-weight:bold;"| Utilization| style="font-weight:normal;" | |-| CPU-00| 20.05 %|-| CPU-01| 18.56 %|-| CPU-02| 19.26 %|-| CPU-03| 22.21 %|-| DPU-00| 23.95 %|-| DPU-01| 16.96 %|} Interestingly, having four threads—i.e. the same number of CPU cores—allows to further increment the throughput by a factor of almost 2, while keeping the DPU cores occupation low. It should not be forgotten, in fact, that part of the algorithm does make use of the CPU computational power as well. =====Six threads=====In the figure below, the VART-based application uses 6 threads. The trace shows that the throughput is stable, around '''830''' fps. [[File:Vaiprofiler 6 threads 10 runs.png|thumb|center|800px|Profiling VART based application, 6 threads]] {| class="wikitable" style="margin: auto;"|+Trace information|-! Item! Value|- style="font-weight:bold;"| DPU_0 Latency| style="font-weight:normal;" | |-| custom_cnn_0| 2305.08 us|- style="font-weight:bold;"| DPU_1 Latency| style="font-weight:normal;" | |-| custom_cnn_0| 1856.95 us|- style="font-weight:bold;"| Utilization| style="font-weight:normal;" | |-| CPU-00| 20.36 %|-| CPU-01| 19.88 %|-| CPU-02| 22.71 %|-| CPU-03| 19.21 %|-| DPU-00| 22.87 %|-| DPU-01| 20.84 %|} ==Results== In the following table, the throughputs achieved by different versions of the application are summarized. {| class="wikitable" style="margin: auto;"|+!API!Number of threads!Throughput[fps]|-|DNNDK|1|271|-| rowspan="4" |VART|1|245|-|2|442|-|4|818|-|6|830|} It is worth mentioning that*When the number of threads is greater than 1, the latency of the DPU_0 is higher than the latency of the DPU_1, although they are equivalent in terms of hardware configuration. To date, this fact is still unexplained.*Increasing the number of threads of the VART-based application beyond 6 does not further increase the achieved throughput.

U0009

dave_user, Administrators

5,190

edits

DAVE Developer's Wiki β

Changes

ML-TN-001 - AI at the edge: comparison of different embedded platforms - Part 3

DAVE Developer's Wiki ^β