Changes

Jump to: navigation, search
no edit summary
!Name
!Version
!Notes
|-
| rowspan="2" |Host
|GNU/Linux Ubuntu
|18.04
|
|-
|Software development platform
|Vitis
|1.2
|
|-
| rowspan="2" |Target
|Hardware platform
|ZCU104
|TBD
|
|-
|Petalinux
|2020.1
|
|-
|Neural network hardware accelerator
|DPU
|3.3
|FOr more details, please refer to the following sections.
|}
The target was configured in order to leverage the hardware acceleration provided by the [https://www.xilinx.com/products/intellectual-property/dpu.html Xilinx Deep Learning Processor Unit (DPU)], which is an IP instantiated in the Programmable Logic (PL) as depicted in the following block diagram.
TBD[[File:ML-TN-001-MPSoC-PL3.png|thumb|center|600px|Top-level architecture of the system implemented in the SoC]] 
In particular, this is the DPU-related subsystem:
TBD
Interestingly, to some extent, the DPU can be customized in order to find the optimal trade[[File:ML-TN-001-MPSoC-off between performances and resource allocationPL3. The default configuration of the png|thumb|center|600px|DPU used for the initial testing is shown in the following images.-related subsystem]] 
Interestingly, to some extent, the DPU can be customized in order to find the optimal trade-off between performances and resource allocation. For instance, the actual number of DPU cores can be selected. The default configuration of the DPU used for the initial testing is shown in the following images. {| class="wikitable" style="margin: auto;"|+!DPU default configuration!
|-
|TBD[[File:ML-TN-001-MPSoC-PL3.png|thumb|center|600px]]
|-
|TBD[[File:ML-TN-001-MPSoC-PL4.png|thumb|center|600px]]
|-
|TBD[[File:ML-TN-001-MPSoC-PL5.png|thumb|center|600px]]
|}
* All the files required to run the test—the executable, the image files, etc.—are stored on a tmpfs RAM disk in order to make file system/storage medium overhead neglectable.
Two new C++ applications were developed for the trained, optimized, and compiled neural network model as illustrated in the steps above. :* The first application uses the old DNNDK low-level APIs for loading the DPU kernel, creating the DPU task, and preparing the input-output tensors for the inference. Two Besides the use of the DSight visual tool, two possible profiling strategies are available depending on the chosen DPU mode when compiling the kernel (normal or profile): a **A coarse-grained profiling, that which shows the execution time for all the main tasks executed on the CPU and on the DPU, and a **A fine-grained profiling, that which shows detailed information about all the nodes of the model, such as the workload, the memory occupation, and the runtime. Instead, the *The second application is a multi-thread application that instead, which uses the VART high-level APIs for retrieving the computational subgraph from the DPU kernel and for performing the inference. In this case, it is possible to split the entire workload on multiple concurrent threads, assigning each one a batch of images.  Both applications make use of the OpenCV library for cropping and resizing the input images, in order to match the model's input tensor shape, and display the results of the inference (i.e. the probability for each class) for each image.
Before illustrating the results by running the C++ applications, it can be interesting to check some information about the DPU and the DPU kernel elf file. This can be done, with DExplorer and DDump tools.
===DExplorer===
It provides DPU running mode configuration, DNNDK version checking, DPU status checking, and DPU core signature checking. This can be done by using the '''DExplorer''' tool as illustrated as followshere:
<pre>
===DDump===
It is possible to dump some information encapsulated inside the DPU ELF file, such as the DPU Kernel name and general info information, and the DPU architecture info which information. These are useful for analysis and debugging purposes. This can be done by using To retrieve this information, use the '''DDump''' tool as illustrated as followshere:
<pre>
</pre>
===DNNDK-based application=======Coarse grained profiling using DNNDK low level API====
The results of the coarse-grained profiling achieved using the baseline's DPU kernel (i.e. custom_cnn_0) compiled with option mode set as '''normal'''.
</pre>
====Fine grained profiling using DNNDK low level API====
The results of the coarse-grained profiling achieved using the baseline's DPU kernel (i.e. dbg_custom_cnn_0) compiled with option mode set as '''profile'''.
</pre>
====Profiling analysis with DSight====
DSight is the DNNDK performance profiling tool which is used as a visual performance analysis tool for neural network models. By running the DNNDK application with profile as the DPU running mode configuration, a <code>.prof </code> log file is produced. This file can be parsed and processed with DSight, obtaining an HTML web page, providing a visual format chart showing DPU cores' utilization and scheduling efficiency. , as illustrated in the following picture:
[[File:Xilinx DSight.png|thumb|center|1000px|DSight visual performance analysis]]
</pre>
Vitis-AI Profiler is an application-level tool that could help to optimize the whole AI application. The main purpose of this tool is to help to detect bottlenecks of the whole AI application by profiling the pre-processing functions and the post-processing functions together with DPU kernels' running status.  There are two components of this tool named <code>vaitrace</code>, which runs on the target device and takes the responsibility for data collection, and <code>vaiprofiler</code>, which runs on a PC or local server and takes the responsibility for analyzation analysis and visualization of the collected data.
Note that it is preferable to save the information for <code>vaitrace </code> into a configuration file as follows:
<pre>
}
</pre>
 
Profiling CPU and DPU using VART APIs, with two single thread tasks; the inference is firstly performed over 1 test image and then over 1 custom image:
[[File:Vaiprofiler.png|thumb|center|1000px|Vitis Ai Profiler]]
 
Profiling CPU and DPU using VART APIs, with two single thread tasks; the inference is firstly performed over 180 test images and then over 24 custom images:
TBD ADD TRACE HERE WITH THROUGHPUT 
Profiling CPU and DPU using VART APIs, two tasks each one with 4 threads; the inference is firstly performed over 180 test images and then 24 custom images:
TBD ADD TRACE HERE WITH THROUGHPUT
4,650
edits

Navigation menu