Difference between revisions of "ML-TN-001 - AI at the edge: comparison of different embedded platforms - Part 3"

From DAVE Developer's Wiki
Jump to: navigation, search
Line 20: Line 20:
 
This Technical Note (TN for short) belongs to the series introduced [[ML-TN-001_-_AI_at_the_edge:_comparison_of_different_embedded_platforms_-_Part_1|here]]. Specifically, it illustrates the execution of [[ML-TN-001_-_AI_at_the_edge:_comparison_of_different_embedded_platforms_-_Part_1#Reference_application_.231:_fruit_classifier|this inference application (fruit classifier)]] on the [https://www.xilinx.com/products/boards-and-kits/zcu104.html Xilinx Zynq UltraScale+ MPSoC ZCU104 Evaluation Kit]. The results achieved are also compared to the ones produced by other platforms discussed in the [[ML-TN-001_-_AI_at_the_edge:_comparison_of_different_embedded_platforms_-_Part_1#Articles_in_this_series|articles of this series]].
 
This Technical Note (TN for short) belongs to the series introduced [[ML-TN-001_-_AI_at_the_edge:_comparison_of_different_embedded_platforms_-_Part_1|here]]. Specifically, it illustrates the execution of [[ML-TN-001_-_AI_at_the_edge:_comparison_of_different_embedded_platforms_-_Part_1#Reference_application_.231:_fruit_classifier|this inference application (fruit classifier)]] on the [https://www.xilinx.com/products/boards-and-kits/zcu104.html Xilinx Zynq UltraScale+ MPSoC ZCU104 Evaluation Kit]. The results achieved are also compared to the ones produced by other platforms discussed in the [[ML-TN-001_-_AI_at_the_edge:_comparison_of_different_embedded_platforms_-_Part_1#Articles_in_this_series|articles of this series]].
 
===Test bed===
 
===Test bed===
 +
{| class="wikitable"
 +
|+
 +
!
 +
!Component
 +
!Name/version
 +
!Version
 +
|-
 +
| rowspan="2" |Host
 +
|
 +
|
 +
|
 +
|-
 +
|
 +
|
 +
|
 +
|-
 +
| rowspan="2" |Target
 +
|Hardware platform
 +
|ZCU104
 +
|
 +
|-
 +
|Linux BSP
 +
|Petalinux
 +
|2020.1
 +
|}
  
  

Revision as of 17:32, 9 October 2020

Info Box
NeuralNetwork.png Applies to Machine Learning
Work in progress


History[edit | edit source]

Version Date Notes
1.0.0 September 2020 First public release

Introduction[edit | edit source]

This Technical Note (TN for short) belongs to the series introduced here. Specifically, it illustrates the execution of this inference application (fruit classifier) on the Xilinx Zynq UltraScale+ MPSoC ZCU104 Evaluation Kit. The results achieved are also compared to the ones produced by other platforms discussed in the articles of this series.

Test bed[edit | edit source]

Component Name/version Version
Host
Target Hardware platform ZCU104
Linux BSP Petalinux 2020.1


Building the application[edit | edit source]

Train the model[edit | edit source]

Prune the model[edit | edit source]

Weight pruning means eliminating unnecessary values in the weight tensors, practically setting the neural network parameters’ values to zero to remove the unnecessary connections between the layers of a neural network. This is done during the training process to allow the neural network to adapt to the changes. An immediate benefit from this work is disk compression: sparse tensors are amenable to compression. Hence, by applying simple file compression to the pruned TensorFlow checkpoint, it is possible to reduce the size of the model for its storage and/or transmission.

The weights sparsity of the model, before applying pruning; It is notable how there is actually no sparsity in the weights of the model.

conv2d_1/kernel:0    -- Param:      864 -- Zeros: 00.00%
conv2d_1/bias:0      -- Param:       32 -- Zeros: 00.00%
conv2d_2/kernel:0    -- Param:     9216 -- Zeros: 00.00%
conv2d_2/bias:0      -- Param:       32 -- Zeros: 00.00%
conv2d_3/kernel:0    -- Param:    18432 -- Zeros: 00.00%
conv2d_3/bias:0      -- Param:       64 -- Zeros: 00.00%
conv2d_4/kernel:0    -- Param:    73728 -- Zeros: 00.00%
conv2d_4/bias:0      -- Param:      128 -- Zeros: 00.00%
dense_1/kernel:0     -- Param:  4718592 -- Zeros: 00.00%
dense_1/bias:0       -- Param:      256 -- Zeros: 00.39%
predictions/kernel:0 -- Param:     1536 -- Zeros: 00.00%
predictions/bias:0   -- Param:        6 -- Zeros: 00.00%

The dimension in bytes of the compressed model size before applying pruning:

Size of gzipped loaded model: 17801431.00 bytes

The accuracy of the non-pruned model over the test dataset:

Test set
1/1 [==============================] - 0s 214ms/step - loss: 1.3166 - acc: 0.7083

The model is loaded and trained once again, resuming its previous state, after applying a pruning schedule. As training proceeds, the pruning routine will be scheduled to execute, eliminating (i.e. setting to zero) the weights with the lowest magnitude values (i.e. those closest to zero) until the current sparsity target is reached. Every time the pruning routine is scheduled to execute, the current sparsity target is recalculated, starting from 0% until it reaches the final target sparsity at the end of the pruning schedule. After the end step, the training continues, in order to regain the lost accuracy, knowing that the actual level of sparsity will not change.

In this particular case, a good compromise between compression and accuracy drop, is to prune only the two dense layers of the model, which have a high number of parameters, with a pruning schedule that start at epoch 0, ends at 1/3 of the total number of epochs (i.e. 100 epochs), starting with an initial sparsity of 50% and ending with a final sparsity of 80%, with a pruning frequency of 5 steps (i.e. the model is pruned every 5 steps during the training phase).

The weights sparsity of the model, after applying pruning:

conv2d_1/kernel:0    -- Param:      864 -- Zeros: 00.00%
conv2d_1/bias:0      -- Param:       32 -- Zeros: 00.00%
conv2d_2/kernel:0    -- Param:     9216 -- Zeros: 00.00%
conv2d_2/bias:0      -- Param:       32 -- Zeros: 00.00%
conv2d_3/kernel:0    -- Param:    18432 -- Zeros: 00.00%
conv2d_3/bias:0      -- Param:       64 -- Zeros: 00.00%
conv2d_4/kernel:0    -- Param:    73728 -- Zeros: 00.00%
conv2d_4/bias:0      -- Param:      128 -- Zeros: 00.00%
dense_1/kernel:0     -- Param:  4718592 -- Zeros: 80.00%
dense_1/bias:0       -- Param:      256 -- Zeros: 00.00%
predictions/kernel:0 -- Param:     1536 -- Zeros: 80.01%
predictions/bias:0   -- Param:        6 -- Zeros: 00.00%

The dimension in bytes of the compressed model size after pruning; the difference between the two versions of the same compressed model (before and after pruning) in terms of disk occupation is remarkable, almost by a factor of 3.

Size of gzipped loaded model: 5795289.00 bytes

The accuracy of the pruned model over the test dataset:

Test set
1/1 [==============================] - 0s 29ms/step - loss: 1.4578 - acc: 0.6667

Freezing the computational graph[edit | edit source]

Freezing the model means producing a singular file containing information about the graph and checkpoint variables, but saving these hyperparameters as constants within the graph structure. This eliminates additional information saved in the checkpoint files such as the gradients at each point, which are included so that the model can be reloaded and resume training starting from a previous saved point. As this is not needed when serving a model purely for inference they are discarded in freezing.

INFO:tensorflow:Froze 12 variables.
I1002 09:08:49.716494 140705992206144 graph_util_impl.py:334] Froze 12 variables.
INFO:tensorflow:Converted 12 variables to const ops.
I1002 09:08:49.776397 140705992206144 graph_util_impl.py:394] Converted 12 variables to const ops.

Transform the computational graph[edit | edit source]

After freezing, the computational graph is described as follows:

describe             : frozen_graph.pb
input feature nodes  : ['images_in']
unused nodes         : []
output nodes         : ['predictions/kernel', 'predictions/bias', 'predictions/MatMul/ReadVariableOp', 'predictions/MatMul', 'predictions/BiasAdd/ReadVariableOp', 'predictions/BiasAdd', 'predictions/Softmax']
quantization nodes   : []
constant count       : 16
variable count       : 0
identity count       : 13
total nodes          : 56

A much more detailed description of the computational graph, showing all the nodes and the corrisponding operations, is provided as follows:

Op: Placeholder          -- Name: images_in                     
Op: Const                -- Name: conv2d_1/kernel               
Op: Const                -- Name: conv2d_1/bias                 
Op: Identity             -- Name: conv2d_1/Conv2D/ReadVariableOp
Op: Conv2D               -- Name: conv2d_1/Conv2D               
Op: Identity             -- Name: conv2d_1/BiasAdd/ReadVariableOp
Op: BiasAdd              -- Name: conv2d_1/BiasAdd              
Op: Relu                 -- Name: conv2d_1/Relu                 
Op: MaxPool              -- Name: maxpool_1/MaxPool             
Op: Const                -- Name: conv2d_2/kernel               
Op: Const                -- Name: conv2d_2/bias                 
Op: Identity             -- Name: conv2d_2/Conv2D/ReadVariableOp
Op: Conv2D               -- Name: conv2d_2/Conv2D               
Op: Identity             -- Name: conv2d_2/BiasAdd/ReadVariableOp
Op: BiasAdd              -- Name: conv2d_2/BiasAdd              
Op: Relu                 -- Name: conv2d_2/Relu                 
Op: MaxPool              -- Name: maxpool_2/MaxPool             
Op: Const                -- Name: conv2d_3/kernel               
Op: Const                -- Name: conv2d_3/bias                 
Op: Identity             -- Name: conv2d_3/Conv2D/ReadVariableOp
Op: Conv2D               -- Name: conv2d_3/Conv2D               
Op: Identity             -- Name: conv2d_3/BiasAdd/ReadVariableOp
Op: BiasAdd              -- Name: conv2d_3/BiasAdd              
Op: Relu                 -- Name: conv2d_3/Relu                 
Op: MaxPool              -- Name: maxpool_3/MaxPool             
Op: Const                -- Name: conv2d_4/kernel               
Op: Const                -- Name: conv2d_4/bias                 
Op: Identity             -- Name: conv2d_4/Conv2D/ReadVariableOp
Op: Conv2D               -- Name: conv2d_4/Conv2D               
Op: Identity             -- Name: conv2d_4/BiasAdd/ReadVariableOp
Op: BiasAdd              -- Name: conv2d_4/BiasAdd              
Op: Relu                 -- Name: conv2d_4/Relu                 
Op: MaxPool              -- Name: maxpool_4/MaxPool             
Op: Shape                -- Name: flatten/Shape                 
Op: Const                -- Name: flatten/strided_slice/stack   
Op: Const                -- Name: flatten/strided_slice/stack_1 
Op: Const                -- Name: flatten/strided_slice/stack_2 
Op: StridedSlice         -- Name: flatten/strided_slice         
Op: Const                -- Name: flatten/Reshape/shape/1       
Op: Pack                 -- Name: flatten/Reshape/shape         
Op: Reshape              -- Name: flatten/Reshape               
Op: Const                -- Name: dense_1/kernel                
Op: Const                -- Name: dense_1/bias                  
Op: Identity             -- Name: dense_1/MatMul/ReadVariableOp 
Op: MatMul               -- Name: dense_1/MatMul                
Op: Identity             -- Name: dense_1/BiasAdd/ReadVariableOp
Op: BiasAdd              -- Name: dense_1/BiasAdd               
Op: Relu                 -- Name: dense_1/Relu                  
Op: Identity             -- Name: dropout_1/Identity            
Op: Const                -- Name: predictions/kernel            
Op: Const                -- Name: predictions/bias              
Op: Identity             -- Name: predictions/MatMul/ReadVariableOp
Op: MatMul               -- Name: predictions/MatMul            
Op: Identity             -- Name: predictions/BiasAdd/ReadVariableOp
Op: BiasAdd              -- Name: predictions/BiasAdd           
Op: Softmax              -- Name: predictions/Softmax

The structure of the current computational graph can be optimized, using the Graph Transform tool, which is provided within the Tensorflow framework. The tool allows the application of a series of transformations which reduces the complexity of the input graph, erasing all the nodes and the operation which are not useful for the purpose of inference. The list of used transformations is the following one:

transformations_list = ['remove_nodes(op=Identity, op=CheckNumerics)', 
                        'merge_duplicate_nodes',
                        'strip_unused_nodes',
                        'fold_constants(ignore_errors=true)',
                        'fold_batch_norms']

After performing the optimization, the new description of the computational graph is provided:

describe             : baseline_transf_graph.pb
input feature nodes  : ['images_in']
unused nodes         : []
output nodes         : ['predictions/MatMul', 'predictions/kernel', 'predictions/bias', 'predictions/Softmax', 'predictions/BiasAdd']
quantization nodes   : []
constant count       : 15
variable count       : 0
identity count       : 0
total nodes          : 42

A much more detailed description of the optimized computational graph, showing all the nodes and the corrisponding operations, is provided as follows:

Op: Conv2D               -- Name: conv2d_1/Conv2D               
Op: BiasAdd              -- Name: conv2d_2/BiasAdd              
Op: Relu                 -- Name: conv2d_4/Relu                 
Op: Conv2D               -- Name: conv2d_3/Conv2D               
Op: Const                -- Name: conv2d_2/kernel               
Op: MaxPool              -- Name: maxpool_4/MaxPool             
Op: Const                -- Name: conv2d_1/kernel               
Op: Const                -- Name: conv2d_3/kernel               
Op: Placeholder          -- Name: images_in                     
Op: Pack                 -- Name: flatten/Reshape/shape         
Op: Const                -- Name: conv2d_3/bias                 
Op: Const                -- Name: conv2d_4/kernel               
Op: Reshape              -- Name: flatten/Reshape               
Op: Shape                -- Name: flatten/Shape                 
Op: Conv2D               -- Name: conv2d_4/Conv2D               
Op: Const                -- Name: conv2d_2/bias                 
Op: MaxPool              -- Name: maxpool_2/MaxPool             
Op: Relu                 -- Name: conv2d_1/Relu                 
Op: MatMul               -- Name: predictions/MatMul            
Op: BiasAdd              -- Name: dense_1/BiasAdd               
Op: MaxPool              -- Name: maxpool_1/MaxPool             
Op: Const                -- Name: flatten/strided_slice/stack   
Op: Const                -- Name: dense_1/kernel                
Op: BiasAdd              -- Name: conv2d_1/BiasAdd              
Op: Const                -- Name: flatten/Reshape/shape/1       
Op: Const                -- Name: predictions/kernel            
Op: BiasAdd              -- Name: conv2d_4/BiasAdd              
Op: Const                -- Name: conv2d_1/bias                 
Op: Relu                 -- Name: conv2d_2/Relu                 
Op: Const                -- Name: flatten/strided_slice/stack_1 
Op: Const                -- Name: dense_1/bias                  
Op: Const                -- Name: predictions/bias              
Op: Conv2D               -- Name: conv2d_2/Conv2D               
Op: MaxPool              -- Name: maxpool_3/MaxPool             
Op: Const                -- Name: conv2d_4/bias                 
Op: Relu                 -- Name: dense_1/Relu                  
Op: Relu                 -- Name: conv2d_3/Relu                 
Op: Softmax              -- Name: predictions/Softmax           
Op: BiasAdd              -- Name: conv2d_3/BiasAdd              
Op: MatMul               -- Name: dense_1/MatMul                
Op: StridedSlice         -- Name: flatten/strided_slice         
Op: BiasAdd              -- Name: predictions/BiasAdd

The accuracy of the baseline model over the test dataset after applying all transformations:

Graph accuracy with test dataset: 0.7083

The accuracy of the pruned model over the test dataset after applying all transformations:

Graph accuracy with test dataset: 0.6667

Quantize the computational graph[edit | edit source]

The process of inference is expensive in terms of computation and requires a high memory bandwidth to satisfy the low-latency and high-throughput requirement of edge applications. Generally, when training neural networks, 32-bit floating-point weights and activation values are used but, with the Vitis AI quantizer the complexity of the computation could be reduced without losing prediction accuracy, by converting the 32-bit floating-point values to 8-bit integer format. In this case, the fixed-point network model requires less memory bandwidth, providing faster speed and higher power efficiency than using the floating-point model.

In the quantize calibration process, only a small set of images are required to analyze the distribution of activations. Since we are not performing any backpropagation, there is no need to provide any labels either. Depending on the size of the neural network the running time of quantize calibration varies from a few seconds to several minutes.

After calibration, the quantized model is transformed into a DPU deployable model (named as deploy_model.pb for vai_q_tensorflow) which follows the data format of a DPU. This model can be compiled by the Vitis AI compiler and deployed to the DPU. This quantized model cannot be used by the standard TensorFlow framework to evaluate the loss of accuracy; hence in order to do so, a second file is produced (named as quantize_eval_model.pb for vai_q_tensorflow).

For the current application, 100 images are sampled from the train dataset and augmented, resulting in a total number of 1000 images used for calibration. Furthermore, the graph is calibrated providing a batch of 10 images for 100 iterations. Following, the log of vai_q_tensorflow shows the result of the whole quantization process:

Vai_q_tensorflow v1.2.0 build for Tensorflow 1.15.2
2020-10-08 13:26:59.752125: W tensorflow/stream_executor/cuda/redzone_allocator.cc:312] Not found: ./bin/ptxas not found
Relying on driver to perform ptx compilation. This message will be only logged once.
100% (100 of 100) |######################| Elapsed Time: 0:00:33 Time:  0:00:33
INFO: Checking Float Graph...
INFO: Float Graph Check Done.
INFO: Calibrating for 100 iterations...
INFO: Calibration Done.
INFO: Generating Deploy Model...
INFO: Deploy Model Generated.
********************* Quantization Summary *********************      
INFO: Output:       
  quantize_eval_model: ./build/quantize/baseline/quantize_eval_model.pb       
  deploy_model: ./build/quantize/baseline/deploy_model.pb

The accuracy of the baseline model over the test dataset after applying quantization:

graph accuracy with test dataset: 0.7083

The accuracy of the pruned model over the test dataset after applying quantization:

graph accuracy with test dataset: 0.6667

Compiling the model[edit | edit source]

The Vitis AI compiler operates in a multi-stage process:

  1. The compiler parses the topology of the optimized and quantized input model and produces a new computation graph consisting of a data flow and a control flow.
  2. It will then optimize the data and control flow through processes such as fusing the batch normalization layers into the presiding convolution layers, efficient instruction scheduling by exploit inherent parallelism and exploiting data reuse.
  3. Finally, it generates the code to be run. It must be noted that due to the limited number of operations supported by the DPU, the Vitis AI compiler automatically partitions the input network model into several kernels when there are operations not supported by DPU.

For this particular case, two kernels are produced, due to the fact that the softmax activation layer is not currently supported by the DPU.

Following, the log of vai_c_tensorflow shows the result of the compilation for the Baseline model:

Kernel topology "custom_cnn_kernel_graph.jpg" for network "custom_cnn"
kernel list info for network "custom_cnn"
                               Kernel ID : Name
                                       0 : custom_cnn_0
                                       1 : custom_cnn_1

                             Kernel Name : custom_cnn_0
--------------------------------------------------------------------------------
                             Kernel Type : DPUKernel
                               Code Size : 0.02MB
                              Param Size : 4.60MB
                           Workload MACs : 498.21MOPS
                         IO Memory Space : 0.52MB
                              Mean Value : 0, 0, 0, 
                      Total Tensor Count : 7
                Boundary Input Tensor(s)   (H*W*C)
                          images_in:0(0) : 224*224*3

               Boundary Output Tensor(s)   (H*W*C)
                 predictions_MatMul:0(0) : 1*1*6

                        Total Node Count : 6
                           Input Node(s)   (H*W*C)
                      conv2d_1_Conv2D(0) : 224*224*3

                          Output Node(s)   (H*W*C)
                   predictions_MatMul(0) : 1*1*6




                             Kernel Name : custom_cnn_1
--------------------------------------------------------------------------------
                             Kernel Type : CPUKernel
                Boundary Input Tensor(s)   (H*W*C)
                predictions_Softmax:0(0) : 1*1*6

               Boundary Output Tensor(s)   (H*W*C)
                predictions_Softmax:0(0) : 1*1*6

                           Input Node(s)   (H*W*C)
                     predictions_Softmax : 1*1*6

                          Output Node(s)   (H*W*C)
                     predictions_Softmax : 1*1*6

Following, the log of vai_c_tensorflow shows the result of the compilation for the Pruned model:

Kernel topology "pruned_custom_cnn_kernel_graph.jpg" for network "pruned_custom_cnn"
kernel list info for network "pruned_custom_cnn"
                               Kernel ID : Name
                                       0 : pruned_custom_cnn_0
                                       1 : pruned_custom_cnn_1

                             Kernel Name : pruned_custom_cnn_0
--------------------------------------------------------------------------------
                             Kernel Type : DPUKernel
                               Code Size : 0.02MB
                              Param Size : 4.60MB
                           Workload MACs : 498.21MOPS
                         IO Memory Space : 0.52MB
                              Mean Value : 0, 0, 0, 
                      Total Tensor Count : 7
                Boundary Input Tensor(s)   (H*W*C)
                          images_in:0(0) : 224*224*3

               Boundary Output Tensor(s)   (H*W*C)
                 predictions_MatMul:0(0) : 1*1*6

                        Total Node Count : 6
                           Input Node(s)   (H*W*C)
                      conv2d_1_Conv2D(0) : 224*224*3

                          Output Node(s)   (H*W*C)
                   predictions_MatMul(0) : 1*1*6




                             Kernel Name : pruned_custom_cnn_1
--------------------------------------------------------------------------------
                             Kernel Type : CPUKernel
                Boundary Input Tensor(s)   (H*W*C)
                predictions_Softmax:0(0) : 1*1*6

               Boundary Output Tensor(s)   (H*W*C)
                predictions_Softmax:0(0) : 1*1*6

                           Input Node(s)   (H*W*C)
                     predictions_Softmax : 1*1*6

                          Output Node(s)   (H*W*C)
                     predictions_Softmax : 1*1*6

Running the application[edit | edit source]