Open main menu

DAVE Developer's Wiki β

Changes

no edit summary
{{InfoBoxTop}}
{{AppliesToMachineLearning}}
{{AppliesTo Machine Learning TN}}
{{InfoBoxBottom}}
|For more details, please refer to the following sections.
|}
 
The target was configured in order to leverage the hardware acceleration provided by the [https://www.xilinx.com/products/intellectual-property/dpu.html Xilinx Deep Learning Processor Unit (DPU)], which is an IP instantiated in the Programmable Logic (PL) as depicted in the following block diagram.
==Building the application==
The starting point for the application is the model—in the form of a TensorFlow protobuf file (.pb)—described model described [[ML-TN-001_-_AI_at_the_edge:_comparison_of_different_embedded_platforms_-_Part_1#Reference_application_.231:_fruit_classifier|here]]. Incidentally, this is the '''same''' protobuf file model structure was used as starting point for [[ML-TN-001_-_AI_at_the_edge:_comparison_of_different_embedded_platforms_-_Part_2|this other test]] as well(*). This makes the comparison of the two tests straightforward, even though they were run on SoC's that differ significantly from the architectural standpoint.  
(*) The two models share the same structure but, as they are trained independently, their weights differ.
===Training the model===
Model training is performed with the help of the Docker container provided by Vitis AI.
In order to have reproducible and reliable results, some measures were taken:
* The inference was repeated several times and the average execution time was computed
* All the files required to run the test—the executable, the image files, etc.—are stored on a [https://www.jamescoyle.net/how-to/943-create-a-ram-disk-in-linux tmpfs RAM disk ] in order to make file system/storage medium overhead neglectable.
Two new C++ applications were developed for the trained, optimized, and compiled neural network model as illustrated in the steps above:
</pre>
Within the scope of this TN, the most relevant time is ''[DPU tot time]'', which indicates the time spent to execute the inference (~3.7ms). This leads to a throughput of about 271 fps.
====Fine grained profiling using DNNDK low level API====
[[File:Vaiprofiler 1 thread 10 runs.png|thumb|center|800px|Profiling VART based application, 1 thread only]]
 
{| class="wikitable" style="margin: auto;"
As expected, only one of the two DPU cores is actually leveraged.
=====Two threads=====
In the figure below, the VART-based application uses 2 threads. The trace shows that the throughput is stable, around '''442''' fps</code>.
[[File:Vaiprofiler 2 threads 10 runs.png|thumb|center|800px|Profiling VART based application, 2 threads]]
 
{| class="wikitable" style="margin: auto;"
[[File:Vaiprofiler 4 threads 10 runs.png|thumb|center|800px|Profiling VART based application, 4 threads]]
 
{| class="wikitable" style="margin: auto;"
|}
 Interestingly, having four threads—i.e. the same number of CPU cores—allows to furtherly further increment the throughput by a factor of almost 2 , while keeping the DPU cores occupation low. It should not be forgotten, in fact, that part of the algorithm does make use of the CPU computational power as well.
=====Six threads=====
[[File:Vaiprofiler 6 threads 10 runs.png|thumb|center|800px|Profiling VART based application, 6 threads]]
 
{| class="wikitable" style="margin: auto;"
|}
==Results==
 
In the following table, the throughputs achieved by different versions of the application are summarized.
 
 
{| class="wikitable" style="margin: auto;"
|+
!API
!Number of threads
!Throughput
[fps]
|-
|DNNDK
|1
|271
|-
| rowspan="4" |VART
|1
|245
|-
|2
|442
|-
|4
|818
|-
|6
|830
|}
==Results==
It is possible to notice worth mentioning that *When the number of threads is greater than 1, the latency of the DPU_0 is higher than the latency of the DPU_1, although they are equivalent in terms of hardware configuration. To date, this fact is still unexplained.*Increasing the number of threads of the VART-based application beyond 6 does not further increase the achieved throughput.
dave_user, Administrators
5,191
edits