Difference between revisions of "ML-TN-001 - AI at the edge: comparison of different embedded platforms - Part 3"

From DAVE Developer's Wiki
Jump to: navigation, search
(Results)
(Results)
Line 938: Line 938:
 
In the following table are summirized the achieved throughput for all the tes
 
In the following table are summirized the achieved throughput for all the tes
  
{| class="wikitable"
+
{| class="wikitable" style="margin: auto;"
 
|+
 
|+
 
!API
 
!API

Revision as of 15:40, 14 October 2020

Info Box
NeuralNetwork.png Applies to Machine Learning


History[edit | edit source]

Version Date Notes
1.0.0 October 2020 First public release

Introduction[edit | edit source]

This Technical Note (TN for short) belongs to the series introduced here. Specifically, it illustrates the execution of this inference application (fruit classifier) on the Xilinx Zynq UltraScale+ MPSoC ZCU104 Evaluation Kit.

The same application was tested on the NXP i.MX8M-based Mito8M SoM as well. For more details, please refer to this article.

Test bed[edit | edit source]

The following table details the test bed used for this Technical Note.

Host and target configurations
System Component Name Version Notes
Host Operating system GNU/Linux Ubuntu 18.04
Software development platform Vitis 1.2
Machine learning frameworl TensorFlow 1.15.2
Target Hardware platform ZCU104 1.0
Linux BSP Petalinux 2020.1
Software binary image (microSD card) xilinx-zcu104-dpu-v2020.1-v1.2.0 v2020.1-v1.2.0
Neural network hardware accelerator DPU 3.3 For more details, please refer to the following sections.

The target was configured in order to leverage the hardware acceleration provided by the Xilinx Deep Learning Processor Unit (DPU), which is an IP instantiated in the Programmable Logic (PL) as depicted in the following block diagram.


Top-level architecture of the system implemented in the SoC


In particular, this is the DPU-related subsystem:


DPU-related subsystem


Interestingly, to some extent, the DPU IP can be customized in order to find the optimal trade-off between performances and resource allocation. For instance, the actual number of DPU cores can be selected. The default configuration of the DPU used for the initial testing is depicted in the following images. As shown previously, in this case, two DPU cores are instantiated (DPU_0 and DPU_1).

DPU default configuration
ML-TN-001-MPSoC-PL3.png
ML-TN-001-MPSoC-PL4.png
ML-TN-001-MPSoC-PL5.png

Building the application[edit | edit source]

The starting point for the application is the model—in the form of a TensorFlow protobuf file (.pb)—described here. Incidentally, this is the same protobuf file used as starting point for this other test as well. This makes the comparison of the two tests straightforward, even though they were run on SoC's that differ significantly from the architectural standpoint.

Training the model[edit | edit source]

Model training is performed with the help of the Docker container provided by Vitis AI.

The model is trained for a total number of 100 epochs, with early stopping to prevent model overfitting on train data and checkpointing the weights on best val_loss. After that, a new model is created disabling all the layers only useful during training such as dropouts and batchnorms (i.e. in this case the batchnorm layers are not used).


Plot of model's accuracy during training phase


Plot of model's loss during training phase

Pruning the model[edit | edit source]

200px-Emblem-important.svg.png

This operation if performed at TensorFlow level. As such, it does not make use of the Xilinx pruning tool, which is referred to in this document, for example.


Weight pruning means eliminating unnecessary values in the weight tensors, practically setting the neural network parameters’ values to zero in order to remove the unnecessary connections between the layers of a neural network. This is done during the training process to allow the neural network to adapt to the changes. An immediate benefit from this work is disk compression: sparse tensors are amenable to compression. Hence, by applying simple file compression to the pruned TensorFlow checkpoint, it is possible to reduce the size of the model for its storage and/or transmission.

The following list shows the weights sparsity of the model, before applying pruning. It is notable how there is actually no sparsity in the weights of the model.

conv2d_1/kernel:0    -- Param:      864 -- Zeros: 00.00%
conv2d_1/bias:0      -- Param:       32 -- Zeros: 00.00%
conv2d_2/kernel:0    -- Param:     9216 -- Zeros: 00.00%
conv2d_2/bias:0      -- Param:       32 -- Zeros: 00.00%
conv2d_3/kernel:0    -- Param:    18432 -- Zeros: 00.00%
conv2d_3/bias:0      -- Param:       64 -- Zeros: 00.00%
conv2d_4/kernel:0    -- Param:    73728 -- Zeros: 00.00%
conv2d_4/bias:0      -- Param:      128 -- Zeros: 00.00%
dense_1/kernel:0     -- Param:  4718592 -- Zeros: 00.00%
dense_1/bias:0       -- Param:      256 -- Zeros: 00.39%
predictions/kernel:0 -- Param:     1536 -- Zeros: 00.00%
predictions/bias:0   -- Param:        6 -- Zeros: 00.00%


The dimension in bytes of the compressed model size before applying pruning:

Size of gzipped loaded model: 17801431.00 bytes


The accuracy of the non-pruned model over the test dataset:

Test set
1/1 [==============================] - 0s 214ms/step - loss: 1.3166 - acc: 0.7083

The model is loaded and trained once again, resuming its previous state, after applying a pruning schedule. As training proceeds, the pruning routine will be scheduled to execute, eliminating (i.e. setting to zero) the weights with the lowest magnitude values (i.e. those closest to zero) until the current sparsity target is reached. Every time the pruning routine is scheduled to execute, the current sparsity target is recalculated, starting from 0% until it reaches the final target sparsity at the end of the pruning schedule. After the end step, the training continues, in order to regain the lost accuracy, knowing that the actual level of sparsity will not change.

In this particular case, a good compromise between compression and accuracy drop is to prune only the two dense layers of the model, which have a high number of parameters, with a pruning schedule that start at epoch 0, ends at 1/3 of the total number of epochs (i.e. 100 epochs), starting with an initial sparsity of 50% and ending with a final sparsity of 80%, with a pruning frequency of 5 steps (i.e. the model is pruned every 5 steps during the training phase).


Plot of model's accuracy during pruning phase


Plot of model's loss during pruning phase


The weights sparsity of the model, after applying pruning:

conv2d_1/kernel:0    -- Param:      864 -- Zeros: 00.00%
conv2d_1/bias:0      -- Param:       32 -- Zeros: 00.00%
conv2d_2/kernel:0    -- Param:     9216 -- Zeros: 00.00%
conv2d_2/bias:0      -- Param:       32 -- Zeros: 00.00%
conv2d_3/kernel:0    -- Param:    18432 -- Zeros: 00.00%
conv2d_3/bias:0      -- Param:       64 -- Zeros: 00.00%
conv2d_4/kernel:0    -- Param:    73728 -- Zeros: 00.00%
conv2d_4/bias:0      -- Param:      128 -- Zeros: 00.00%
dense_1/kernel:0     -- Param:  4718592 -- Zeros: 80.00%
dense_1/bias:0       -- Param:      256 -- Zeros: 00.00%
predictions/kernel:0 -- Param:     1536 -- Zeros: 80.01%
predictions/bias:0   -- Param:        6 -- Zeros: 00.00%


The dimension in bytes of the compressed model size after pruning:

Size of gzipped loaded model: 5795289.00 bytes

The difference between the two versions of the same compressed model (before and after pruning) in terms of disk occupation is remarkable, almost by a factor of 3.


The accuracy of the pruned model over the test dataset:

Test set
1/1 [==============================] - 0s 29ms/step - loss: 1.4578 - acc: 0.6667

Freezing the computational graph[edit | edit source]

Freezing the model means producing a singular file containing information about the graph and checkpoint variables, but saving these hyperparameters as constants within the graph structure. This eliminates additional information saved in the checkpoint files such as the gradients at each point, which are included so that the model can be reloaded and resume training starting from a previously saved point. As this is not needed when serving a model purely for inference, they are discarded in freezing.

INFO:tensorflow:Froze 12 variables.
I1002 09:08:49.716494 140705992206144 graph_util_impl.py:334] Froze 12 variables.
INFO:tensorflow:Converted 12 variables to const ops.
I1002 09:08:49.776397 140705992206144 graph_util_impl.py:394] Converted 12 variables to const ops.

Transform the computational graph[edit | edit source]

After freezing, the computational graph is described as follows:

describe             : frozen_graph.pb
input feature nodes  : ['images_in']
unused nodes         : []
output nodes         : ['predictions/kernel', 'predictions/bias', 'predictions/MatMul/ReadVariableOp', 'predictions/MatMul', 'predictions/BiasAdd/ReadVariableOp', 'predictions/BiasAdd', 'predictions/Softmax']
quantization nodes   : []
constant count       : 16
variable count       : 0
identity count       : 13
total nodes          : 56


A much more detailed description of the computational graph, showing all the nodes and the corrisponding operations, is provided as follows:

Op: Placeholder          -- Name: images_in                     
Op: Const                -- Name: conv2d_1/kernel               
Op: Const                -- Name: conv2d_1/bias                 
Op: Identity             -- Name: conv2d_1/Conv2D/ReadVariableOp
Op: Conv2D               -- Name: conv2d_1/Conv2D               
Op: Identity             -- Name: conv2d_1/BiasAdd/ReadVariableOp
Op: BiasAdd              -- Name: conv2d_1/BiasAdd              
Op: Relu                 -- Name: conv2d_1/Relu                 
Op: MaxPool              -- Name: maxpool_1/MaxPool             
Op: Const                -- Name: conv2d_2/kernel               
Op: Const                -- Name: conv2d_2/bias                 
Op: Identity             -- Name: conv2d_2/Conv2D/ReadVariableOp
Op: Conv2D               -- Name: conv2d_2/Conv2D               
Op: Identity             -- Name: conv2d_2/BiasAdd/ReadVariableOp
Op: BiasAdd              -- Name: conv2d_2/BiasAdd              
Op: Relu                 -- Name: conv2d_2/Relu                 
Op: MaxPool              -- Name: maxpool_2/MaxPool             
Op: Const                -- Name: conv2d_3/kernel               
Op: Const                -- Name: conv2d_3/bias                 
Op: Identity             -- Name: conv2d_3/Conv2D/ReadVariableOp
Op: Conv2D               -- Name: conv2d_3/Conv2D               
Op: Identity             -- Name: conv2d_3/BiasAdd/ReadVariableOp
Op: BiasAdd              -- Name: conv2d_3/BiasAdd              
Op: Relu                 -- Name: conv2d_3/Relu                 
Op: MaxPool              -- Name: maxpool_3/MaxPool             
Op: Const                -- Name: conv2d_4/kernel               
Op: Const                -- Name: conv2d_4/bias                 
Op: Identity             -- Name: conv2d_4/Conv2D/ReadVariableOp
Op: Conv2D               -- Name: conv2d_4/Conv2D               
Op: Identity             -- Name: conv2d_4/BiasAdd/ReadVariableOp
Op: BiasAdd              -- Name: conv2d_4/BiasAdd              
Op: Relu                 -- Name: conv2d_4/Relu                 
Op: MaxPool              -- Name: maxpool_4/MaxPool             
Op: Shape                -- Name: flatten/Shape                 
Op: Const                -- Name: flatten/strided_slice/stack   
Op: Const                -- Name: flatten/strided_slice/stack_1 
Op: Const                -- Name: flatten/strided_slice/stack_2 
Op: StridedSlice         -- Name: flatten/strided_slice         
Op: Const                -- Name: flatten/Reshape/shape/1       
Op: Pack                 -- Name: flatten/Reshape/shape         
Op: Reshape              -- Name: flatten/Reshape               
Op: Const                -- Name: dense_1/kernel                
Op: Const                -- Name: dense_1/bias                  
Op: Identity             -- Name: dense_1/MatMul/ReadVariableOp 
Op: MatMul               -- Name: dense_1/MatMul                
Op: Identity             -- Name: dense_1/BiasAdd/ReadVariableOp
Op: BiasAdd              -- Name: dense_1/BiasAdd               
Op: Relu                 -- Name: dense_1/Relu                  
Op: Identity             -- Name: dropout_1/Identity            
Op: Const                -- Name: predictions/kernel            
Op: Const                -- Name: predictions/bias              
Op: Identity             -- Name: predictions/MatMul/ReadVariableOp
Op: MatMul               -- Name: predictions/MatMul            
Op: Identity             -- Name: predictions/BiasAdd/ReadVariableOp
Op: BiasAdd              -- Name: predictions/BiasAdd           
Op: Softmax              -- Name: predictions/Softmax


The structure of the current computational graph can be optimized, using the Graph Transform tool, which is provided within the Tensorflow framework. The tool allows the application of a series of transformations which reduces the complexity of the input graph, erasing all the nodes and the operation which are not useful for the purpose of inference. The list of used transformations is the following one:

transformations_list = ['remove_nodes(op=Identity, op=CheckNumerics)', 
                        'merge_duplicate_nodes',
                        'strip_unused_nodes',
                        'fold_constants(ignore_errors=true)',
                        'fold_batch_norms']


After performing the optimization, the new description of the computational graph is provided:

describe             : baseline_transf_graph.pb
input feature nodes  : ['images_in']
unused nodes         : []
output nodes         : ['predictions/MatMul', 'predictions/kernel', 'predictions/bias', 'predictions/Softmax', 'predictions/BiasAdd']
quantization nodes   : []
constant count       : 15
variable count       : 0
identity count       : 0
total nodes          : 42


A much more detailed description of the optimized computational graph, showing all the nodes and the corresponding operations, is provided as follows:

Op: Conv2D               -- Name: conv2d_1/Conv2D               
Op: BiasAdd              -- Name: conv2d_2/BiasAdd              
Op: Relu                 -- Name: conv2d_4/Relu                 
Op: Conv2D               -- Name: conv2d_3/Conv2D               
Op: Const                -- Name: conv2d_2/kernel               
Op: MaxPool              -- Name: maxpool_4/MaxPool             
Op: Const                -- Name: conv2d_1/kernel               
Op: Const                -- Name: conv2d_3/kernel               
Op: Placeholder          -- Name: images_in                     
Op: Pack                 -- Name: flatten/Reshape/shape         
Op: Const                -- Name: conv2d_3/bias                 
Op: Const                -- Name: conv2d_4/kernel               
Op: Reshape              -- Name: flatten/Reshape               
Op: Shape                -- Name: flatten/Shape                 
Op: Conv2D               -- Name: conv2d_4/Conv2D               
Op: Const                -- Name: conv2d_2/bias                 
Op: MaxPool              -- Name: maxpool_2/MaxPool             
Op: Relu                 -- Name: conv2d_1/Relu                 
Op: MatMul               -- Name: predictions/MatMul            
Op: BiasAdd              -- Name: dense_1/BiasAdd               
Op: MaxPool              -- Name: maxpool_1/MaxPool             
Op: Const                -- Name: flatten/strided_slice/stack   
Op: Const                -- Name: dense_1/kernel                
Op: BiasAdd              -- Name: conv2d_1/BiasAdd              
Op: Const                -- Name: flatten/Reshape/shape/1       
Op: Const                -- Name: predictions/kernel            
Op: BiasAdd              -- Name: conv2d_4/BiasAdd              
Op: Const                -- Name: conv2d_1/bias                 
Op: Relu                 -- Name: conv2d_2/Relu                 
Op: Const                -- Name: flatten/strided_slice/stack_1 
Op: Const                -- Name: dense_1/bias                  
Op: Const                -- Name: predictions/bias              
Op: Conv2D               -- Name: conv2d_2/Conv2D               
Op: MaxPool              -- Name: maxpool_3/MaxPool             
Op: Const                -- Name: conv2d_4/bias                 
Op: Relu                 -- Name: dense_1/Relu                  
Op: Relu                 -- Name: conv2d_3/Relu                 
Op: Softmax              -- Name: predictions/Softmax           
Op: BiasAdd              -- Name: conv2d_3/BiasAdd              
Op: MatMul               -- Name: dense_1/MatMul                
Op: StridedSlice         -- Name: flatten/strided_slice         
Op: BiasAdd              -- Name: predictions/BiasAdd


The accuracy of the baseline model over the test dataset after applying all transformations:

Graph accuracy with test dataset: 0.7083


The accuracy of the pruned model over the test dataset after applying all transformations:

Graph accuracy with test dataset: 0.6667

Quantizing the computational graph[edit | edit source]

The process of inference is expensive in terms of computation and requires a high memory bandwidth to satisfy the low-latency and high-throughput requirement of edge applications. Generally, when training neural networks, 32-bit floating-point weights and activation values are used but, with the Vitis AI quantizer, the complexity of the computation could be reduced without losing prediction accuracy. This is achieved by converting the 32-bit floating-point values to 8-bit integer format. In this case, the fixed-point network model requires less memory bandwidth, providing faster speed and higher power efficiency than using the floating-point model.

In the quantize calibration process, only a small set of images are required to analyze the distribution of activations. Since we are not performing any backpropagation, there is no need to provide any labels either. Depending on the size of the neural network, the running time of quantize calibration varies from a few seconds to several minutes.

After calibration, the quantized model is transformed into a DPU deployable model (named as deploy_model.pb for vai_q_tensorflow) which follows the data format of a DPU. This model can be compiled by the Vitis AI compiler and deployed to the DPU. This quantized model cannot be used by the standard TensorFlow framework to evaluate the loss of accuracy. Hence, in order to do so, a second file is produced (named as quantize_eval_model.pb for vai_q_tensorflow).

For the current application, 100 images are sampled from the train dataset and augmented, resulting in a total number of 1000 images used for calibration. Furthermore, the graph is calibrated providing a batch of 10 images for 100 iterations. Following, the log of vai_q_tensorflow shows the result of the whole quantization process:

Vai_q_tensorflow v1.2.0 build for Tensorflow 1.15.2
2020-10-08 13:26:59.752125: W tensorflow/stream_executor/cuda/redzone_allocator.cc:312] Not found: ./bin/ptxas not found
Relying on driver to perform ptx compilation. This message will be only logged once.
100% (100 of 100) |######################| Elapsed Time: 0:00:33 Time:  0:00:33
INFO: Checking Float Graph...
INFO: Float Graph Check Done.
INFO: Calibrating for 100 iterations...
INFO: Calibration Done.
INFO: Generating Deploy Model...
INFO: Deploy Model Generated.
********************* Quantization Summary *********************      
INFO: Output:       
  quantize_eval_model: ./build/quantize/baseline/quantize_eval_model.pb       
  deploy_model: ./build/quantize/baseline/deploy_model.pb


The accuracy of the baseline model over the test dataset after applying quantization:

graph accuracy with test dataset: 0.7083


The accuracy of the pruned model over the test dataset after applying quantization:

graph accuracy with test dataset: 0.6667

Compiling the model[edit | edit source]

The Vitis AI compiler operates in a multi-stage process:

  1. The compiler parses the topology of the optimized and quantized input model and produces a new computation graph consisting of a data flow and a control flow.
  2. It will then optimize the data and control flow through processes such as fusing the batch normalization layers into the presiding convolution layers, efficient instruction scheduling by exploit inherent parallelism and exploiting data reuse.
  3. Finally, it generates the code to be run. It must be noted that due to the limited number of operations supported by the DPU, the Vitis AI compiler automatically partitions the input network model into several kernels when there are operations not supported by DPU.

For this particular case, two kernels are produced, due to the fact that the softmax activation layer is not currently supported by the DPU.

Following, the log of vai_c_tensorflow shows the result of the compilation for the baseline model:

Kernel topology "custom_cnn_kernel_graph.jpg" for network "custom_cnn"
kernel list info for network "custom_cnn"
                               Kernel ID : Name
                                       0 : custom_cnn_0
                                       1 : custom_cnn_1

                             Kernel Name : custom_cnn_0
--------------------------------------------------------------------------------
                             Kernel Type : DPUKernel
                               Code Size : 0.02MB
                              Param Size : 4.60MB
                           Workload MACs : 498.21MOPS
                         IO Memory Space : 0.52MB
                              Mean Value : 0, 0, 0, 
                      Total Tensor Count : 7
                Boundary Input Tensor(s)   (H*W*C)
                          images_in:0(0) : 224*224*3

               Boundary Output Tensor(s)   (H*W*C)
                 predictions_MatMul:0(0) : 1*1*6

                        Total Node Count : 6
                           Input Node(s)   (H*W*C)
                      conv2d_1_Conv2D(0) : 224*224*3

                          Output Node(s)   (H*W*C)
                   predictions_MatMul(0) : 1*1*6




                             Kernel Name : custom_cnn_1
--------------------------------------------------------------------------------
                             Kernel Type : CPUKernel
                Boundary Input Tensor(s)   (H*W*C)
                predictions_Softmax:0(0) : 1*1*6

               Boundary Output Tensor(s)   (H*W*C)
                predictions_Softmax:0(0) : 1*1*6

                           Input Node(s)   (H*W*C)
                     predictions_Softmax : 1*1*6

                          Output Node(s)   (H*W*C)
                     predictions_Softmax : 1*1*6


Following, the log of vai_c_tensorflow shows the result of the compilation for the pruned model:

Kernel topology "pruned_custom_cnn_kernel_graph.jpg" for network "pruned_custom_cnn"
kernel list info for network "pruned_custom_cnn"
                               Kernel ID : Name
                                       0 : pruned_custom_cnn_0
                                       1 : pruned_custom_cnn_1

                             Kernel Name : pruned_custom_cnn_0
--------------------------------------------------------------------------------
                             Kernel Type : DPUKernel
                               Code Size : 0.02MB
                              Param Size : 4.60MB
                           Workload MACs : 498.21MOPS
                         IO Memory Space : 0.52MB
                              Mean Value : 0, 0, 0, 
                      Total Tensor Count : 7
                Boundary Input Tensor(s)   (H*W*C)
                          images_in:0(0) : 224*224*3

               Boundary Output Tensor(s)   (H*W*C)
                 predictions_MatMul:0(0) : 1*1*6

                        Total Node Count : 6
                           Input Node(s)   (H*W*C)
                      conv2d_1_Conv2D(0) : 224*224*3

                          Output Node(s)   (H*W*C)
                   predictions_MatMul(0) : 1*1*6




                             Kernel Name : pruned_custom_cnn_1
--------------------------------------------------------------------------------
                             Kernel Type : CPUKernel
                Boundary Input Tensor(s)   (H*W*C)
                predictions_Softmax:0(0) : 1*1*6

               Boundary Output Tensor(s)   (H*W*C)
                predictions_Softmax:0(0) : 1*1*6

                           Input Node(s)   (H*W*C)
                     predictions_Softmax : 1*1*6

                          Output Node(s)   (H*W*C)
                     predictions_Softmax : 1*1*6

Running the application[edit | edit source]

In order to have reproducible and reliable results, some measures were taken:

  • The inference was repeated several times and the average execution time was computed
  • All the files required to run the test—the executable, the image files, etc.—are stored on a tmpfs RAM disk in order to make file system/storage medium overhead neglectable.

Two new C++ applications were developed for the trained, optimized, and compiled neural network model as illustrated in the steps above:

  • The first application uses the old DNNDK low-level APIs for loading the DPU kernel, creating the DPU task, and preparing the input-output tensors for the inference. Besides the use of the DSight visual tool, two possible profiling strategies are available depending on the chosen DPU mode when compiling the kernel (normal or profile):
    • A coarse-grained profiling, which shows the execution time for all the main tasks executed on the CPU and on the DPU
    • A fine-grained profiling, which shows detailed information about all the nodes of the model, such as the workload, the memory occupation, and the runtime.
  • The second application is a multi-thread application instead, which uses the VART high-level APIs for retrieving the computational subgraph from the DPU kernel and for performing the inference. In this case, it is possible to split the entire workload on multiple concurrent threads, assigning each one a batch of images.

Both applications make use of the OpenCV library for cropping and resizing the input images, in order to match the model's input tensor shape and display the results of the inference (i.e. the probability for each class) for each image.

Before illustrating the results by running the C++ applications, it can be interesting to check some information about the DPU and the DPU kernel elf file. This can be done, with DExplorer and DDump tools.

DExplorer[edit | edit source]

It provides DPU running mode configuration, DNNDK version checking, DPU status checking, and DPU core signature checking. This can be done by using the DExplorer tool as illustrated here:

root@xilinx-zcu104-2020_1:~# dexplorer -v -w   
Vitis AI for Edge DPU version 1.2
Copyright 2019 Xilinx Inc.

DExplorer version 3.0
Build Label: Jun 19 2020 05:21:20

DSight version 2.1
Build Label: Jun 19 2020 05:21:20

DDump version 2.0
Build Label: Jun 19 2020 05:21:20

N2Cube Core library version 4.2
Build Label: Jun 19 2020 05:21:16
[DPU IP Spec]
IP  Timestamp            : 2020-06-18 12:00:00
DPU Core Count           : 2

[DPU Core Configuration List]
DPU Core                 : #0
DPU Enabled              : Yes
DPU Arch                 : B4096
DPU Target Version       : v1.4.1
DPU Freqency             : 300 MHz
Ram Usage                : High
DepthwiseConv            : Enabled
DepthwiseConv+Relu6      : Enabled
Conv+Leakyrelu           : Enabled
Conv+Relu6               : Enabled
Channel Augmentation     : Enabled
Average Pool             : Enabled

DPU Core                 : #1
DPU Enabled              : Yes
DPU Arch                 : B4096
DPU Target Version       : v1.4.1
DPU Freqency             : 300 MHz
Ram Usage                : High
DepthwiseConv            : Enabled
DepthwiseConv+Relu6      : Enabled
Conv+Leakyrelu           : Enabled
Conv+Relu6               : Enabled
Channel Augmentation     : Enabled
Average Pool             : Enabled

DDump[edit | edit source]

It is possible to dump some information encapsulated inside the DPU ELF file, such as the DPU Kernel name and general information, and the DPU architecture information. These are useful for analysis and debugging purposes. To retrieve this information, use the DDump tool as illustrated here:

root@xilinx-zcu104-2020_1:~/VART_2# ddump -f bin/dpu_custom_cnn_0.elf -a
DPU Kernel List from file bin/dpu_custom_cnn_0.elf
                      ID:  Name
                       0:  custom_cnn_0

DPU Kernel name: custom_cnn_0 
----------------------------------------------------------------
 -> DPU Kernel general info
                    Mode:  NORMAL
               Code Size:  0.02MB
              Param Size:  4.60MB
           Workload MACs:  498.209MOP
         IO Memory Space:  0.52MB
              Mean Value:  0, 0, 0
              Node Count:  6
            Tensor Count:  7
         Tensor In(H*W*C)
             Tensor ID-0:  224*224*3
        Tensor Out(H*W*C)
             Tensor ID-6:  1*1*6

 -> DPU architecture info
             DPU ABI Ver:  v2.1
DPU Configuration Parameters
          DPU Target Ver:  1.4.1
           DPU Arch Type:  B4096
               RAM Usage:  high
           DepthwiseConv:  Enabled
     DepthwiseConv+Relu6:  Enabled
          Conv+Leakyrelu:  Enabled
              Conv+Relu6:  Enabled
    Channel Augmentation:  Enabled
            Average Pool:  Enabled

 -> DNNC compiler info
   DNNC Ver: VAI_C Compiler for Edge, Version v5.01
DPU Target : v1.4.1
Build Label: Jun 23 2020 03:34:14
Copyright @2020 Xilinx Inc. All Rights Reserved.

DNNDK-based application[edit | edit source]

Coarse grained profiling using DNNDK low level API[edit | edit source]

The results of the coarse-grained profiling achieved using the baseline's DPU kernel (i.e. custom_cnn_0) compiled with option mode set as normal are indicated in the following box.

---------------------------------------------------------------
                         red_apple_1.jpg

[Time] LoadImage                                   0.0161657  ms
[Time] PreprocessImage                             0.00521947 ms
[Time] SetInputImageInHWCFP32                      0.00115946 ms
[DPU Time] dpuSetInputTensorInHWCFP32              1.67499    ms
[DPU Time] dpuRunTask                              1.99908    ms
[DPU Time] dpuGetOutputTensorInHWCFP32             0.00864    ms
[DPU Time] dpuRunSoftmax                           0.00476    ms
[DPU tot time]                                     3.68747    ms
[DPU throughput]                                   271.189   FPS

[Time] CpuArgmax<float>                            1.3e-07    ms
[Time] CpuSoftmax                                  1.01e-06   ms
[Time] RunCustomCNN                                0.0109225  ms
[Time] TopK                                        8.04e-06   ms
1) red_apple          : 0.999665
2) orange             : 0.00033535
3) hand               : 8.76131e-08
4) avocado            : 9.23435e-09
5) banana             : 1.38833e-11
6) green_apple        : 3.44132e-14
_______________________________________________________________

Within the scope of this TN, the most relevant time is [DPU tot time], which indicates the time spent to execute the inference (~3.7ms).

Fine grained profiling using DNNDK low level API[edit | edit source]

The following frame reports the results of the fine-grained profiling achieved using the baseline's DPU kernel (i.e. dbg_custom_cnn_0) compiled with option mode set as profile.

---------------------------------------------------------------
                         red_apple_1.jpg

[Time] LoadImage                                   0.0163798  ms
[Time] PreprocessImage                             0.00518644 ms
[Time] SetInputImageInHWCFP32                      0.00132577 ms
[DNNDK] Performance profile - DPU Kernel "dbg_custom_cnn_0" DPU Task "dbg_custom_cnn_0-1"
=====================================================================================================
  ID                       NodeName Workload(MOP) Mem(MB) RunTime(ms) Perf(GOPS) Utilization    MB/S
   1                conv2d_1_Conv2D        85.163    0.53       0.363      234.6        19.1%  1453.7
   2                conv2d_2_Conv2D       218.991    0.48       0.205     1068.2        86.9%  2330.1
   3                conv2d_3_Conv2D        99.680    0.15       0.108      923.0        75.1%  1396.5
   4                conv2d_4_Conv2D        84.935    0.13       0.089      954.3        77.7%  1487.2
   5                 dense_1_MatMul         9.437    4.52       1.106        8.5         0.7%  4089.5
   6             predictions_MatMul         0.003    0.00       0.013        0.2         0.0%   145.8

                Total Nodes In Avg:
                                All       498.209    6.10       1.884      264.4        21.5%  3235.6
=====================================================================================================
[Time] CpuArgmax<float>                            1.3e-07    ms
[Time] CpuSoftmax                                  9e-07      ms
[Time] RunCustomCNN                                0.0110171  ms
[Time] TopK                                        8.34e-06   ms
1) red_apple          : 0.999665
2) orange             : 0.00033535
3) hand               : 8.76131e-08
4) avocado            : 9.23435e-09
5) banana             : 1.38833e-11
6) green_apple        : 3.44132e-14
_______________________________________________________________

Profiling analysis with DSight[edit | edit source]

DSight is the DNNDK performance profiling tool which is used as a visual performance analysis tool for neural network models. By running the DNNDK application with profile as the DPU running mode configuration, a .prof log file is produced. This file can be parsed and processed with DSight, obtaining an HTML web page providing a visual format chart showing DPU cores' utilization and scheduling efficiency over time, as illustrated in the following picture:

DSight visual performance analysis

VART-based application[edit | edit source]

As stated previously, this version of the application is functionally equivalent to the DNNDK-based one, but it makes use of the newer Vitis AI Runtime (VART) API.

The following dump shows the output of the application when processing the image file red_apple_1.jpg.

image name            : red_apple_1.jpg
ground truth label    : red_apple
predicted label       : red_apple
1) red_apple          : 0.999665
2) orange             : 0.00033535
3) hand               : 8.76131e-08
4) avocado            : 9.23435e-09
5) banana             : 1.38833e-11
6) green_apple        : 3.44132e-14
________________________________________________

execution time        : 0.0583705 s
tot correct           : 1
tot wrong             : 0

Profiling with Vitis AI Profiler[edit | edit source]

Vitis-AI Profiler is a powerful, application-level tool that could help to optimize the whole AI application. The main purpose of this tool is to help to detect bottlenecks of the whole AI application by profiling the pre-processing functions and the post-processing functions together with DPU kernels' running status.

There are two components of this tool named vaitrace, which runs on the target device and takes the responsibility for data collection, and vaiprofiler, which runs on a PC or local server and takes the responsibility for analysis and visualization of collected data.

Note that it is preferable to save the information for vaitrace into a configuration file as follows:

{
	"options": {
		"runmode": "normal",
		"cmd": "./bin/customCNNclassification -w split -i ./images -c ./custom_images -r 10 -t 1 -M ./bin/dpu_custom_cnn_0.elf",
		"output": "./trace_customCNN_vart.xat",
		"timeout": 10
	},
	"trace": {
		"enable_trace_list": ["vitis-ai-library", "vart", "opencv", "custom"]
	},
	"trace_custom": ["ListDirectory", "GetImageFileNames", "TopK", "CpuArgmax", "CpuSoftmax", "GetPredictedLabels", "GetGroundTruthLabels", "SliceVector"]
}

The developed application is profiled several times each one with a different number of threads. For all the profiling traces, the DPU throughput is provided along with some additional information concerning the latency of the DPUs and the usage of both CPU and DPU cores. The inference is repeated for 10 times on the same image.

One thread[edit | edit source]

In the figure below, the VART-based application uses 1 thread. The trace shows that the throughput is stable, around 245 fps. The throughput is similar to the one achieved by the DNNDK-based application, but a little bit smaller. This is probably due to the fact that the VART API's are affected by a little bigger overhead.

Profiling VART based application, 1 thread only
Trace information
Item Value
DPU_1 Latency
custom_cnn_0 1514.05 us
Utilization
CPU-00 15.90 %
CPU-01 23.74 %
CPU-02 1.12 %
CPU-03 1.15 %
DPU-01 18.72 %

As expected, only one of the two DPU cores is actually leveraged.

Two threads[edit | edit source]

In the figure below, the VART-based application uses 2 threads. The trace shows that the throughput is stable, around 442 fps.

Profiling VART based application, 2 threads
Trace information
Item Value
DPU_0 Latency
custom_cnn_0 2085.12 us
DPU_1 Latency
custom_cnn_0 1648.66 us
Utilization
CPU-00 2.84 %
CPU-01 10.56 %
CPU-02 30.00 %
CPU-03 19.14 %
DPU-00 19.02 %
DPU-01 13.24 %

As expected, profiling information indicates that both DPU's are used. At first approximation, the throughput is doubled with respect to the single thread application in accordance with the fact that the DPU cores work in parallel and the CPU cores are not saturated.

Four threads[edit | edit source]

In the figure below, the VART-based application uses 4 threads. The trace shows that the throughput is stable, around 818 fps.

Profiling VART based application, 4 threads
Trace information
Item Value
DPU_0 Latency
custom_cnn_0 2111.89 us
DPU_1 Latency
custom_cnn_0 1679.56 us
Utilization
CPU-00 20.05 %
CPU-01 18.56 %
CPU-02 19.26 %
CPU-03 22.21 %
DPU-00 23.95 %
DPU-01 16.96 %

Interestingly, having four threads—i.e. the same number of CPU cores—allows to furtherly increment the throughput by a factor of almost 2 while keeping the DPU cores occupation low.

Six threads[edit | edit source]

In the figure below, the VART-based application uses 6 threads. The trace shows that the throughput is stable, around 830 fps.

Profiling VART based application, 6 threads
Trace information
Item Value
DPU_0 Latency
custom_cnn_0 2305.08 us
DPU_1 Latency
custom_cnn_0 1856.95 us
Utilization
CPU-00 20.36 %
CPU-01 19.88 %
CPU-02 22.71 %
CPU-03 19.21 %
DPU-00 22.87 %
DPU-01 20.84 %

Results[edit | edit source]

In the following table are summirized the achieved throughput for all the tes

API Number of threads Throughput

[fps]

DNNDK 1
VART 1
2
4
6

It is possible to notice that the latency of the DPU_0 is higher than the latency of the DPU_1.