Changes

Jump to: navigation, search
no edit summary
Op: Softmax -- Name: predictions/Softmax
</pre>
 
The structure of the current computational graph can be optimized, using the Graph Transform tool, which is provided within the Tensorflow framework. The tool allows the application of a series of transformations which reduces the complexity of the input graph, erasing all the nodes and the operation which are not useful for the purpose of inference. The list of used transformations is the following one:
'fold_batch_norms']
</pre>
 
After performing the optimization, the new description of the computational graph is provided:
</pre>
 A much more detailed description of the optimized computational graph, showing all the nodes and the corrisponding corresponding operations, is provided as follows:
<pre>
Op: BiasAdd -- Name: predictions/BiasAdd
</pre>
 
The accuracy of the '''baseline model''' over the test dataset after applying all transformations:
Graph accuracy with test dataset: 0.7083
</pre>
 
The accuracy of the '''pruned model''' over the test dataset after applying all transformations:
</pre>
===Quantize Quantizing the computational graph===
The process of inference is expensive in terms of computation and requires a high memory bandwidth to satisfy the low-latency and high-throughput requirement of edge applications. Generally, when training neural networks, 32-bit floating-point weights and activation values are used but, with the Vitis AI quantizer, the complexity of the computation could be reduced without losing prediction accuracy, . This is achieved by converting the 32-bit floating-point values to 8-bit integer format. In this case, the fixed-point network model requires less memory bandwidth, providing faster speed and higher power efficiency than using the floating-point model.
In the quantize calibration process, only a small set of images are required to analyze the distribution of activations. Since we are not performing any backpropagation, there is no need to provide any labels either. Depending on the size of the neural network , the running time of quantize calibration varies from a few seconds to several minutes.
After calibration, the quantized model is transformed into a DPU deployable model (named as <code>deploy_model.pb </code> for vai_q_tensorflow) which follows the data format of a DPU. This model can be compiled by the Vitis AI compiler and deployed to the DPU. This quantized model cannot be used by the standard TensorFlow framework to evaluate the loss of accuracy; hence . Hence, in order to do so, a second file is produced (named as <code>quantize_eval_model.pb </code> for vai_q_tensorflow).
For the current application, 100 images are sampled from the train dataset and augmented, resulting in a total number of 1000 images used for calibration. Furthermore, the graph is calibrated providing a batch of 10 images for 100 iterations. Following, the log of vai_q_tensorflow shows the result of the whole quantization process:
deploy_model: ./build/quantize/baseline/deploy_model.pb
</pre>
 
The accuracy of the '''baseline model''' over the test dataset after applying quantization:
graph accuracy with test dataset: 0.7083
</pre>
 
The accuracy of the '''pruned model''' over the test dataset after applying quantization:
predictions_Softmax : 1*1*6
</pre>
 
Following, the log of vai_c_tensorflow shows the result of the compilation for the '''Pruned model''':
* All the files required to run the test—the executable, the image files, etc.—are stored on a tmpfs RAM disk in order to make file system/storage medium overhead neglectable.
Two new C++ applications were developed for the trained, optimized , and compiled neural network model as illustrated in the steps above. The first application uses the old DNNDK low-level APIs for loading the DPU kernel, creating the DPU task, and preparing the input-output tensors for the inference. Two possible profiling strategies are available depending on the chosen DPU mode when compiling the kernel (normal or profile): a coarse-grained profiling, that shows the execution time for all the main tasks executed on the CPU and on the DPU, and a fine-grained profiling, that shows detailed information about all the nodes of the model, such as the workload, the memory occupation, and the runtime. Instead, the second application is a multi-thread application that uses the VART high-level APIs for retrieving the computational subgraph from the DPU kernel and for performing the inference. In this case, it is possible to split the entire workload on multiple concurrent threads, assigning each one a batch of images. Both applications use the opencv OpenCV library for cropping and resize resizing the input images, in order to match the model's input tensor shape, and display the results of the inference (i.e. the probability for each class) for each image.
Before illustrating the results by running the C++ applications , it can be interesting checking to check some information about the DPU and the DPU kernel elf file. This can be done, with DExplorer and DDump tools.
===DExplorer===
4,650
edits

Navigation menu