Changes

ML-TN-001 - AI at the edge: comparison of different embedded platforms - Part 3

737 bytes added, 13:53, 13 October 2020

no edit summary

!Name

!Version

!Notes

|-

| rowspan="2" |Host

|GNU/Linux Ubuntu

|18.04

|

|-

|Software development platform

|Vitis

|1.2

|

|-

| rowspan="2" |Target

|Hardware platform

|ZCU104

|TBD

|

|-

|Petalinux

|2020.1

|

|-

|Neural network hardware accelerator

|DPU

|3.3

|FOr more details, please refer to the following sections.

|}

The target was configured in order to leverage the hardware acceleration provided by the [https://www.xilinx.com/products/intellectual-property/dpu.html Xilinx Deep Learning Processor Unit (DPU)], which is an IP instantiated in the Programmable Logic (PL) as depicted in the following block diagram.

~~TBD~~[[File:ML-TN-001-MPSoC-PL3.png|thumb|center|600px|Top-level architecture of the system implemented in the SoC]]

In particular, this is the DPU-related subsystem:

~~TBD~~

~~Interestingly, to some extent, the DPU can be customized in order to find the optimal trade~~[[File:ML-TN-001-MPSoC-~~off between performances and resource allocation~~PL3. ~~The default configuration of the~~ png|thumb|center|600px|DPU ~~used for the initial testing is shown in the following images.~~-related subsystem]]

Interestingly, to some extent, the DPU can be customized in order to find the optimal trade-off between performances and resource allocation. For instance, the actual number of DPU cores can be selected. The default configuration of the DPU used for the initial testing is shown in the following images. {| class="wikitable" style="margin: auto;"|+!DPU default configuration!

|-

|~~TBD~~[[File:ML-TN-001-MPSoC-PL3.png|thumb|center|600px]]

|-

|~~TBD~~[[File:ML-TN-001-MPSoC-PL4.png|thumb|center|600px]]

|-

|~~TBD~~[[File:ML-TN-001-MPSoC-PL5.png|thumb|center|600px]]

|}

* All the files required to run the test—the executable, the image files, etc.—are stored on a tmpfs RAM disk in order to make file system/storage medium overhead neglectable.

Two new C++ applications were developed for the trained, optimized, and compiled neural network model as illustrated in the steps above. :* The first application uses the old DNNDK low-level APIs for loading the DPU kernel, creating the DPU task, and preparing the input-output tensors for the inference. ~~Two~~ Besides the use of the DSight visual tool, two possible profiling strategies are available depending on the chosen DPU mode when compiling the kernel (normal or profile): a **A coarse-grained profiling, ~~that~~ which shows the execution time for all the main tasks executed on the CPU and on the DPU~~, and a~~ **A fine-grained profiling, ~~that~~ which shows detailed information about all the nodes of the model, such as the workload, the memory occupation, and the runtime. ~~Instead, the~~ *The second application is a multi-thread application ~~that~~ instead, which uses the VART high-level APIs for retrieving the computational subgraph from the DPU kernel and for performing the inference. In this case, it is possible to split the entire workload on multiple concurrent threads, assigning each one a batch of images. Both applications make use of the OpenCV library for cropping and resizing the input images, in order to match the model's input tensor shape, and display the results of the inference (i.e. the probability for each class) for each image.

Before illustrating the results by running the C++ applications, it can be interesting to check some information about the DPU and the DPU kernel elf file. This can be done, with DExplorer and DDump tools.

===DExplorer===

It provides DPU running mode configuration, DNNDK version checking, DPU status checking, and DPU core signature checking. This can be done by using the '''DExplorer''' tool as illustrated ~~as follows~~here:

<pre>

===DDump===

It is possible to dump some information encapsulated inside the DPU ELF file, such as the DPU Kernel name and general ~~info~~ information, and the DPU architecture ~~info which~~ information. These are useful for analysis and debugging purposes. ~~This can be done by using~~ To retrieve this information, use the '''DDump''' tool as illustrated ~~as follows~~here:

<pre>

</pre>

===DNNDK-based application=======Coarse grained profiling using DNNDK low level API====

The results of the coarse-grained profiling achieved using the baseline's DPU kernel (i.e. custom_cnn_0) compiled with option mode set as '''normal'''.

</pre>

====Fine grained profiling using DNNDK low level API====

The results of the coarse-grained profiling achieved using the baseline's DPU kernel (i.e. dbg_custom_cnn_0) compiled with option mode set as '''profile'''.

</pre>

====Profiling analysis with DSight====

DSight is the DNNDK performance profiling tool which is used as a visual performance analysis tool for neural network models. By running the DNNDK application with profile as the DPU running mode configuration, a <code>.prof </code> log file is produced. This file can be parsed and processed with DSight, obtaining an HTML web page, providing a visual format chart showing DPU cores' utilization and scheduling efficiency. , as illustrated in the following picture:

[[File:Xilinx DSight.png|thumb|center|1000px|DSight visual performance analysis]]

</pre>

Vitis-AI Profiler is an application-level tool that could help to optimize the whole AI application. The main purpose of this tool is to help to detect bottlenecks of the whole AI application by profiling the pre-processing functions and the post-processing functions together with DPU kernels' running status. There are two components of this tool named <code>vaitrace</code>, which runs on the target device and takes the responsibility for data collection, and <code>vaiprofiler</code>, which runs on a PC or local server and takes the responsibility for ~~analyzation~~ analysis and visualization of ~~the~~ collected data.

Note that it is preferable to save the information for <code>vaitrace </code> into a configuration file as follows:

<pre>

}

</pre>

Profiling CPU and DPU using VART APIs, with two single thread tasks; the inference is firstly performed over 1 test image and then over 1 custom image:

[[File:Vaiprofiler.png|thumb|center|1000px|Vitis Ai Profiler]]

Profiling CPU and DPU using VART APIs, with two single thread tasks; the inference is firstly performed over 180 test images and then over 24 custom images:

TBD ADD TRACE HERE WITH THROUGHPUT

Profiling CPU and DPU using VART APIs, two tasks each one with 4 threads; the inference is firstly performed over 180 test images and then 24 custom images:

TBD ADD TRACE HERE WITH THROUGHPUT

U0001

Bureaucrats, dave_user, Administrators

4,650

edits

Changes

ML-TN-001 - AI at the edge: comparison of different embedded platforms - Part 3

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Quick Links

Contact us

How to use wiki

Advanced Search

Tools