Changes

← Older edit

ML-TN-001 - AI at the edge: comparison of different embedded platforms - Part 3

5,831 bytes added, 13:26, 5 January 2021

no edit summary

|For more details, please refer to the following sections.

|}

The target was configured in order to leverage the hardware acceleration provided by the [https://www.xilinx.com/products/intellectual-property/dpu.html Xilinx Deep Learning Processor Unit (DPU)], which is an IP instantiated in the Programmable Logic (PL) as depicted in the following block diagram.

==Building the application==

The starting point for the application is the model described [[ML-TN-001_-_AI_at_the_edge:_comparison_of_different_embedded_platforms_-_Part_1#Reference_application_.231:_fruit_classifier|here]]. Incidentally, the '''same''' model structure was used as starting point for [[ML-TN-001_-_AI_at_the_edge:_comparison_of_different_embedded_platforms_-_Part_2|this other test]] as well (*). This makes the comparison of the two tests straightforward, even though they were run on SoC's that differ significantly from the architectural standpoint.

(*) The two models share the same structure but, as they are trained independently, their weights differ.

===Training the model===

Model training is performed with the help of the Docker container provided by Vitis AI.

The model is trained for a total number of 100 epochs, with early stopping to prevent model overfitting on train data and checkpointing the weights on best val_loss. After that, a new model is created disabling all the layers only useful during training such as dropouts and batchnorms (i.e. in this case the batchnorm layers are not used).

[[File:Train Loss.png|thumb|center|500px|Plot of model's loss during training phase]]

===~~Prune~~ Pruning the model==={{ImportantMessage|text=This operation if performed at TensorFlow level. As such, it does not make use of the Xilinx pruning tool, which is referred to in [https://www.xilinx.com/support/documentation/ip_documentation/dpu/v3_2/pg338-dpu.pdf this document], for example.}}

Weight pruning means eliminating unnecessary values in the weight tensors, practically setting the neural network parameters’ values to zero in order to remove the unnecessary connections between the layers of a neural network. This is done during the training process to allow the neural network to adapt to the changes. An immediate benefit from this work is disk compression: sparse tensors are amenable to compression. Hence, by applying simple file compression to the pruned TensorFlow checkpoint, it is possible to reduce the size of the model for its storage and/or transmission.

For this particular case, two kernels are produced, due to the fact that the softmax activation layer is not currently supported by the DPU.

Following, the log of vai_c_tensorflow shows the result of the compilation for the '''~~Baseline~~ baseline model''':

<pre>

Following, the log of vai_c_tensorflow shows the result of the compilation for the '''~~Pruned~~ pruned model''':

<pre>

In order to have reproducible and reliable results, some measures were taken:

* The inference was repeated several times and the average execution time was computed

* All the files required to run the test—the executable, the image files, etc.—are stored on a [https://www.jamescoyle.net/how-to/943-create-a-ram-disk-in-linux tmpfs RAM disk ] in order to make file system/storage medium overhead neglectable.

Two new C++ applications were developed for the trained, optimized, and compiled neural network model as illustrated in the steps above:

===DExplorer===

It provides DPU running mode configuration, DNNDK version checking, DPU status checking, and DPU core signature checking. This can be done by using the '''DExplorer''' tool as illustrated here:

<pre>

===DDump===

It is possible to dump some information encapsulated inside the DPU ELF file, such as the DPU Kernel name and general information, and the DPU architecture information. These are useful for analysis and debugging purposes. To retrieve this information, use the '''DDump''' tool as illustrated here:

<pre>

====Coarse grained profiling using DNNDK low level API====

The results of the coarse-grained profiling achieved using the baseline's DPU kernel (i.e. <code>custom_cnn_0</code>) compiled with option mode set as '''normal'''are indicated in the following box.

<pre>

_______________________________________________________________

</pre>

Within the scope of this TN, the most relevant time is ''DPU tot time'', which indicates the time spent to execute the inference (~3.7ms). This leads to a throughput of about 271 fps.

====Fine grained profiling using DNNDK low level API====

The following frame reports the results of the ~~coarse~~fine-grained profiling achieved using the baseline's DPU kernel (i.e. <code>dbg_custom_cnn_0</code>) compiled with option mode set as '''profile'''.

<pre>

====Profiling analysis with DSight====

DSight is the DNNDK performance profiling tool which is used as a visual performance analysis tool for neural network models. By running the DNNDK application with profile as the DPU running mode configuration, a <code>.prof</code> log file is produced. This file can be parsed and processed with DSight, obtaining an HTML web page providing a visual format chart showing DPU cores' utilization and scheduling efficiencyover time, as illustrated in the following picture:

[[File:Xilinx DSight.png|thumb|center|1000px|DSight visual performance analysis]]

===VART-based application=== As stated previously, this version of the application is functionally equivalent to the DNNDK-based one, but it makes use of the newer [https://github.com/Xilinx/Vitis-AI/blob/master/VART/README.md Vitis AI Runtime (VART) API].

The following dump shows the output of the application when processing the image file <code>red_apple_1.jpg</code>.

<pre>

====Profiling with Vitis AI Profiler====

Vitis-AI Profiler is an a powerful, application-level tool that could help to optimize the whole AI application. The main purpose of this tool is to help to detect bottlenecks of the whole AI application by profiling the pre-processing functions and the post-processing functions together with DPU kernels' running status.

There are two components of this tool named <code>vaitrace</code>, which runs on the target device and takes the responsibility for data collection, and <code>vaiprofiler</code>, which runs on a PC or local server and takes the responsibility for analysis and visualization of collected data.

</pre>

The developed application is profiled several times each one with a different number of threads. For all the profiling traces, the DPU throughput is provided along with some additional information concerning the latency of the DPUs and the usage of both CPU and DPU cores. The inference is repeated for 10 times on the same image.

=====One thread=====

In the figure below, the VART-based application uses 1 thread. The trace shows that the throughput is stable, around '''245''' fps. The throughput is similar to the one achieved by the DNNDK-based application, but a little bit smaller. This is probably due to the fact that the VART API's are affected by a little bigger overhead.

[[File:Vaiprofiler 1 thread 10 runs.png|thumb|center|800px|Profiling VART based application, 1 thread only]]

{| class="wikitable" style="margin: auto;"

|+

Trace information

|-

! Item

! Value

|- style="font-weight:bold;"

| DPU_1 Latency

| style="font-weight:normal;" |

|-

| custom_cnn_0

| 1514.05 us

|- style="font-weight:bold;"

| Utilization

| style="font-weight:normal;" |

|-

| CPU-00

| 15.90 %

|-

| CPU-01

| 23.74 %

|-

| CPU-02

| 1.12 %

|-

| CPU-03

| 1.15 %

|-

| DPU-01

| 18.72 %

|}

As expected, only one of the two DPU cores is actually leveraged.

=====Two threads=====

In the figure below, the VART-based application uses 2 threads. The trace shows that the throughput is stable, around '''442''' fps.

[[File:Vaiprofiler 2 threads 10 runs.png|thumb|center|800px|Profiling VART based application, 2 threads]]

{| class="wikitable" style="margin: auto;"

|+

Trace information

|-

! Item

! Value

|- style="font-weight:bold;"

| DPU_0 Latency

| style="font-weight:normal;" |

|-

| custom_cnn_0

| 2085.12 us

|- style="font-weight:bold;"

| DPU_1 Latency

| style="font-weight:normal;" |

|-

| custom_cnn_0

| 1648.66 us

|- style="font-weight:bold;"

| Utilization

| style="font-weight:normal;" |

|-

| CPU-00

| 2.84 %

|-

| CPU-01

| 10.56 %

|-

| CPU-02

| 30.00 %

|-

| CPU-03

| 19.14 %

|-

| DPU-00

| 19.02 %

|-

| DPU-01

| 13.24 %

|}

As expected, profiling information indicates that both DPU's are used. At first approximation, the throughput is doubled with respect to the single thread application in accordance with the fact that the DPU cores work in parallel and the CPU cores are not saturated.

=====Four threads=====

In the figure below, the VART-based application uses 4 threads. The trace shows that the throughput is stable, around ''' 818''' fps.

[[File:Vaiprofiler 4 threads 10 runs.png|thumb|center|800px|Profiling ~~CPU and DPU using~~ VART ~~APIs~~based application, ~~with two single thread tasks; the inference is firstly performed over 1 test image and then over 1 custom image:~~4 threads]]

~~[[File~~{| class="wikitable" style="margin:~~Vaiprofiler~~auto;"|+Trace information|-! Item! Value|- style="font-weight:bold;"| DPU_0 Latency| style="font-weight:normal;" | |-| custom_cnn_0| 2111.~~png~~89 us|~~thumb~~- style="font-weight:bold;"|~~center~~DPU_1 Latency|~~1000px~~style="font-weight:normal;" | |-| custom_cnn_0| 1679.56 us|- style="font-weight:bold;"| Utilization| style="font-weight:normal;" | |-| CPU-00| 20.05 %|-| CPU-01| 18.56 %|-| CPU-02| 19.26 %|-| CPU-03| 22.21 %|-| DPU-00| 23.95 %|-| DPU-01| 16.96 %|~~Vitis Ai Profiler]]~~}

~~Profiling~~ Interestingly, having four threads—i.e. the same number of CPU ~~and~~ cores—allows to further increment the throughput by a factor of almost 2, while keeping the DPU ~~using VART APIs~~cores occupation low. It should not be forgotten, in fact, ~~with two single thread tasks;~~ that part of the algorithm does make use of the ~~inference is firstly performed over 180 test images and then over 24 custom images:~~CPU computational power as well.

~~TBD ADD TRACE HERE WITH THROUGHPUT~~=====Six threads=====In the figure below, the VART-based application uses 6 threads. The trace shows that the throughput is stable, around '''830''' fps.

[[File:Vaiprofiler 6 threads 10 runs.png|thumb|center|800px|Profiling VART based application, 6 threads]]

{| class="wikitable" style="margin: auto;"

|+

Trace information

|-

! Item

! Value

|- style="font-weight:bold;"

| DPU_0 Latency

| style="font-weight:normal;" |

|-

| custom_cnn_0

| 2305.08 us

|- style="font-weight:bold;"

| DPU_1 Latency

| style="font-weight:normal;" |

|-

| custom_cnn_0

| 1856.95 us

|- style="font-weight:bold;"

| Utilization

| style="font-weight:normal;" |

|-

| CPU-00

| 20.36 %

|-

| CPU-01

| 19.88 %

|-

| CPU-02

| 22.71 %

|-

| CPU-03

| 19.21 %

|-

| DPU-00

| 22.87 %

|-

| DPU-01

| 20.84 %

|}

==Results==

In the following table, the throughputs achieved by different versions of the application are summarized.

{| class="wikitable" style="margin: auto;"

|+

!API

!Number of threads

!Throughput

[fps]

|-

|DNNDK

|1

|271

|-

| rowspan="4" |VART

|1

|245

|-

|2

|442

|-

|4

|818

|-

|6

|830

|}

~~Profiling CPU and DPU using VART APIs, two tasks each one with 4 threads; the inference is firstly performed over 180 test images and then 24 custom images:~~

~~TBD ADD TRACE HERE WITH THROUGHPUT~~It is worth mentioning that*When the number of threads is greater than 1, the latency of the DPU_0 is higher than the latency of the DPU_1, although they are equivalent in terms of hardware configuration. To date, this fact is still unexplained.*Increasing the number of threads of the VART-based application beyond 6 does not further increase the achieved throughput.

U0009

dave_user, Administrators

5,178

edits

DAVE Developer's Wiki β

Changes

ML-TN-001 - AI at the edge: comparison of different embedded platforms - Part 3

DAVE Developer's Wiki ^β