Changes

ML-TN-001 - AI at the edge: comparison of different embedded platforms - Part 3

932 bytes added, 19:20, 12 October 2020

no edit summary

===Train the model===

The model is trained for a total number of 100 epochs, with early stopping to prevent model overfitting on train data and checkpointing the weights on best val_loss. After that, a new model is created disabling all the layers only useful during training such as dropouts and batchnorms (i.e. in this case the batchnorm layers are not used).

[[File:Train Accuracy.png|thumb|center|200px|Work in progress]]

[[File:Train Loss.png|thumb|center|200px|Work in progress]]

===Prune the model===

In this particular case, a good compromise between compression and accuracy drop, is to prune only the two dense layers of the model, which have a high number of parameters, with a pruning schedule that start at epoch 0, ends at 1/3 of the total number of epochs (i.e. 100 epochs), starting with an initial sparsity of 50% and ending with a final sparsity of 80%, with a pruning frequency of 5 steps (i.e. the model is pruned every 5 steps during the training phase).

[[File:Pruned Accuracy.png|thumb|center|200px|Work in progress]]

[[File:Pruned Loss.png|thumb|center|200px|Work in progress]]

The weights sparsity of the model, after applying pruning:

===Quantize the computational graph===

The process of inference is expensive in terms of computation and requires a high memory bandwidth to satisfy the low-latency and high-throughput requirement of edge applications. Generally, when training neural networks, 32-bit floating-point weights and activation values are used but, with the Vitis AI quantizer , the complexity of the computation could be reduced without losing prediction accuracy, by converting the 32-bit floating-point values to 8-bit integer format. In this case, the fixed-point network model requires less memory bandwidth, providing faster speed and higher power efficiency than using the floating-point model.

In the quantize calibration process, only a small set of images are required to analyze the distribution of activations. Since we are not performing any backpropagation, there is no need to provide any labels either. Depending on the size of the neural network the running time of quantize calibration varies from a few seconds to several minutes.

* All the files required to run the test—the executable, the image files, etc.—are stored on a tmpfs RAM disk in order to make file system/storage medium overhead neglectable.

Two new C++ applications were developed for the trained, optimized and compiled neural network model as illustrated in the steps above. The first application uses the old DNNDK low-level APIs for loading the DPU kernel, creating the DPU task , and preparing the input-output tensors for the inference. Two possible profiling strategies are available depending on the chosen DPU mode when compiling the kernel (normal or profile): a coarse -grained profiling, that shows the execution time for all the main tasks executed on the CPU and on the DPU , and a fine -grained profiling, that shows detailed information about all the nodes of the model, such as the workload, the memory occupation , and the runtime. Instead, the second application is a multi-thread application that uses the VART high -level APIs for retrieving the computational subgraph from the DPU kernel and for performing the inference. In this case, it is possible to split the entire workload on multiple concurrent threads, assigning each one a batch of images. Both applications use the opencv library for cropping and resize the input images, in order to match the model's input tensor shape, and display the results of the inference (i.e. the probability for each class) for each image.

Before illustrating the results by running the C++ applications it can be interesting checking some information about the DPU and the DPU kernel elf file. This can be done, with DExplorer and DDump tools.

===Coarse grained profiling using DNNDK low level API===

The results of the coarse-grained profiling achieved using the baseline's DPU kernel (i.e. custom_cnn_0) compiled with option mode set as '''normal'''.

<pre>

===Fine grained profiling using DNNDK low level API===

The results of the coarse-grained profiling achieved using the baseline's DPU kernel (i.e. dbg_custom_cnn_0) compiled with option mode set as '''profile'''.

<pre>

DSight is the DNNDK performance profiling tool which is used as a visual performance analysis tool for neural network models. By running the DNNDK application with profile as the DPU running mode configuration, a .prof log file is produced. This file can be parsed and processed with DSight, obtaining an HTML web page, providing a visual format chart showing DPU cores' utilization and scheduling efficiency.

[[File:Xilinx DSight.png|thumb|center|1000px|~~DESCRIPTION~~DSight visual performance analysis]]

===Profiling VART high-level APIs with Vitis AI Profiler===

</pre>

Vitis-AI Profiler is an application-level tool that could help to optimize the whole AI application. The main purpose of this tool is to help to detect bottlenecks of the whole AI application by profiling the pre-processing functions and the post-processing functions together with DPU kernels' running status. There are two components of this tool named vaitrace, which runs on the target device and takes the responsibility for data collection , and vaiprofiler, which runs on a PC or local server and takes the responsibility for analyzation and visualization of the collected data.

Fist all necessary information for vaitrace have to be saved into a configuration file as follows:

U0019

dave_user

207

edits

DAVE Developer's Wiki β

Changes

ML-TN-001 - AI at the edge: comparison of different embedded platforms - Part 3

DAVE Developer's Wiki ^β