Changes

Jump to: navigation, search
no edit summary
==Introduction==
This Technical Note (TN for short) belongs to the series introduced [[ML-TN-001 - AI at the edge: comparison of different embedded platforms - Part 1|here]]. Specifically, it illustrates the
==Test Bed==
The following table details the test bed used for this Technical Note.
{| class="wikitable" style="margin: auto;"
|+
Host and target configurations
!System
!Component
!Name
!Version
|-
| rowspan="3" |'''Host'''
|Operating system
|GNU/Linux Ubuntu
|18.04
|-
|Software development platform
|Vitis
|1.2
|-
|Machine learning framework
|TensorFlow
|1.15.2
|-
| rowspan="4" |'''Target'''
|Hardware platform
|ZCU104
|1.0
|-
|Linux BSP
|Petalinux
|2020.1
|-
|Software binary image (microSD card)
|xilinx-zcu104-dpu-v2020.1-v1.2.0
|v2020.1-v1.2.0
|-
|Neural network hardware accelerator
|DPU
|3.3
|}
For more details about the Xilinx's hardware configuration and, the usage of the Vitis-AI software platform please refer to [[ML-TN-001 - AI at the edge: comparison of different embedded platforms - Part 3|this article]].
==FICS-PCB dataset overview==
Over the years, computer vision and ML disciplines have considerably advanced the field of Automated Visual Inspection for Printed Circuit Board (PCB-AVI) assurance. It is already well-known that to develop a robust modelfor any ML-based application a dataset is required, possibly as large as possible, with many examples for a better generalization. Although a few large datasets for PCB-AVI are publicly available, they lack variances that simulate real-world scenarios, such as illumination and scale variations, which are necessary for developing robust PCB-AVI approaches. Therefore, to represent such non-ideal conditions, the FICS-PCB dataset was proposed for evaluating and improving methods for PCB-AVI. This dataset consists of PCB images featuring multiple types of components and various image conditions to facilitate performance evaluation in challenging scenarios that are likely to be encountered in practice.
The dataset consists of 9,912 images of 31 PCB samples and contains a total amount of 77,347 labeled components distributed in six classes: ''IC'', ''capacitor'', ''diode'', ''inductor'', ''resistor'' and, ''transistor''. These components were collected using two image sensor types, a digital microscope and a Digital Single-Lens Reflex (DSLR) camera. To ensure that the dataset also includes samples that represent variations in illumination, the authors collected images using three different intensities from the built-in ring light of the microscope i.e. 20, 40, and 60, where 60 represents the brightest illumination. In addition, variations in scale were included using three difference magnifications i.e. 1×, 1.5×, and 2×.
 
[[File:FICS-PCB samples.png|center|thumb|500x500px|FICS-PCB dataset, examples of six types of components]]
It is straightforward just by looking at the figure below that this dataset is highly unbalanced, having a lot of samples only for two classes i.e. ''capacitor'' and ''resistor''. In this situation, it is not a good idea to use this dataset as it is, simply because the models will be trained on image batches mainly composed of the most common components, hence learning only a restricted number of features. This has as a consequence that the models will probably be very good at classifying ''capacitor'' and ''resistor'' classes and pretty bad at classifying the remaining ones. Therefore, the missing data must be increased with oversampling.
It is straightforward just by looking at the figure below that this dataset is highly unbalanced, having a lot of samples only for two classes i.e. capacitor and resistor. This is indeed no surprise, simply because these two component types are more commonly mounted on a PCB with respect to the others. Unfortunately, in this situation, it is not a good idea to use this dataset as it is, simply because the models will be trained on image batches mainly composed by the most common components, hence learning only a restricted number of features. This has as a consequence that the models will probably be very good at classifying capacitor and resistor and pretty bad at classifying the remaining classes. Therefore, the missing data must be increased with image augmentation. Before proceeding further, please note that the number of DSLR subset examples is by far lower than the number of the Microscope subset samples.As the two subsets were acquired using two different kind kinds of instruments, their characteristics — the resolution, for example — differ significantly. In order to have homogeneous images w.r.t. the characteristics, it is preferable to keep only one of them, specifically the most numerous.  
[[File:Samples per class in Microscope and DSLR subsets.png|center|thumb|500x500px|FICS-PCB dataset, component count per class in DSLR and Microscope subsets]]
The dataset was created by random sampling 150000 component images from Microscope subset of the FICS-PCB dataset, providing a total of 25000 images per class. The 72% of the images were used for training, the 24% was used as a validation subset during training and the remaining 4% was used as a test set, providing exactly 108000 training images, 32000 validation images and 6000 test images, equally distributed among the six classes of the dataset. Each image was preprocessed and padded with a constant value in order to adapt its scale to the input tensor size of the models. During this process the aspect ratio of the image was not modified. To increase variety among the examples random contrast, brightness, saturation and rotation were applyed too. Classes as ''diode'', ''inductor'', ''IC'' and, ''transistor'' were oversampled.
[[File:Dataset processing and augmentation.png|center|thumb|500x500px|FICS-PCB dataset, an example of image augmentation as compensation for lack of data in IC, diode, inductor, and transistor classes]]
==Training configuration and hyperparameters setup==
The training was done in the cloud using Google Colab. All the models were trained with the same configuration for 1000 epochs, providing at each step of an epoch a mini-batch of 32 images. The learn rate was initially set at 0.0001 with an exponential decay schedule, and dropout rate was set at 0.4 for all models. Patience for early stopping was set at 100 epochs. The training images were further augmented with random zoom, shift and, rotation in order to improve model robustness on validation and test subsets and prevent the risk of overfitting.
[[File:Image augmentation for training samples.png|center|thumb|500x500px|FICS-PCB dataset, an example of image augmentation on training images to increase the robustness of the models]]
 
==Proposed models==
RESNET + INCEPTION INFO
TRAINING SPECS (NO TENSORFLOW PRUNING)
METRICS
===ResNet50===
The model, during the training phase, shows an increasing trend in accuracy over the train subset samples and over the validation subset samples. This is a sign that the model is learning in a correct way since it is not underfitting the train data. Furthermore, by looking at the trend of the loss during the 1000 training epochs, the model is clearly not overfitting the train data. By saving the status of the model with checkpoints each time there is an improvement in the validation loss, the best result is found at '''''epoch 993''''' with an '''''accuracy of 93.59%''''' and a '''''loss of 0.1912''''' on the validation data.  <!--Start of table definition-->{|style="background:transparent; color:black" border="0" heightalign="center" cellpadding="10px" cellspacing="5500px" alignheight="center550" valign="bottom" cellpadding=10px cellspacing=0px|-align="center"
|
|[[File:Resnet50 train and validation accuracy.png|thumb|500x500px|Train and validation accuracy trend over 1000 training epochs for ResNet50 model]]
|}
The model, before performing the quantization with the ''vai_q_tensorflow'' tool, has an overall value of '''''accuracy of 94.85%''''' and an overall weighted average '''''F1-score of 94.86%''''' over the test subset of the dataset, showing a good generalization capability on unseen samples. The classes with the highest F1-score, above 96.00% are: ''resistor'' (98.08% F1-score), ''inductor'' (97.10% F1-score) and, ''capacitor'' (96.88% F1-score). On the contrary, the class in which the model performs poorly w.r.t the others, is the ''diode'' class (91.75% F1-score). This is attributable to a low value of precision metric (88.55% precision).
The model, before performing the quantization with the ''vai_q_tensorflow'' tool, has an overall value of '''''accuracy of 94.85%''''' and an overall weighted average '''''F1-score of 94.86%''''' over the test subset of the dataset, showing a good generalization capability on unseen samples. The classes with the highest F1-score, above 96.00% are: ''resistor'' (98.08% F1-score), ''inductor'' (97.10% F1-score) and, ''capacitor'' (96.88% F1-score). On the contrary, the class on which the model performs poorly w.r.t the other classes, is the class diode (91.75% F1-score) due to having a low precision (88.55% precision).  {| align="center" style = "background: transparent; margin: auto; width: 60%;"
|-
| width="200px " style = " vertical-align: center; " |[[File:Resnet50 host confusion matrix.png|center|thumb|500x500px|Confusion matrix of ResNet50 model on host machine before quantization]]| width="200px " style = " vertical-align: center; " |
{| class="wikitable" style="margin: auto; text-align: center;"
|+ Host machine, classification report
|}
After performing the quantization with the ''vai_q_tensorflow'' tool and after the deployment on the target device, the model has an overall value of '''''accuracy of 93.27%''''' and an overall weighted average '''''F1-score of 93.29%''''' on the test subset of the dataset. The model is still performing well in correcly classify samples belonging to the ''resistor'' class (98.08% F1-score), ''inductor'' class (97.10% F1-score) and, ''capacitor'' class (96.88% F1-score). The worst results of the model in the classification task can be found in the ''transistor'' class (89.78% F1-score) because both its measured precision and recall metrics are below 90.00% (89.96% precision and, 89.60% recall) and, in the ''diode'' class (88.59% F1-score) because the precision metric is very low (83.77% precision).
After performing the quantization with the ''vai_q_tensorflow'' tool and after the deployment on the target device, the model has an overall value of '''''accuracy of 93.27%''''' and an overall weighted average '''''F1-score of 93.29%''''' on the test subset of the dataset. The model is still performing well on ''resistor'' (98.08% F1-score), ''inductor'' (97.10% F1-score) and, ''capacitor'' (96.88% F1-score) classes. However, the model shows the worst results for ''transistor'' class (89.78% F1-score) due to having both precision and recall below 90.00% (89.96% precision and, 89.60% recall) and, ''diode'' class (88.59& F1-score) since the precision for this class is very low (83.77% precision).   {| align="center" style = "background: transparent; margin: auto; width: 60%;"
|-
| width="200px " style = " vertical-align: center; " |[[File:Resnet50 target confusion matrix.png|center|thumb|500x500px|Confusion matrix of ResNet50 model on target device after quantization]]| width="200px " style = " vertical-align: center; " |
{| class="wikitable" style="margin: auto; text-align: center;"
|+ Target device, classification report
|}
 To perform the inference over the images, only one DPU core is used for 1 thread, leading to almost a 55% utilization of the DPU-01 core. By increasing the number of threads i.e. with 4 threads, more cores are used and the percentage gets higher, very close to 100% on DPU-00 core and close to 90% on DPU-01 core. Concerning the DPU latency, for 1 thread the average latency for one image is about 12ms (11526.41μs). By increasing the concurrency, the latency for both cores is higher, about 13ms (13318.01μs) for DPU-00 core, and 12ms (12019.21μs) for DPU-01 core when using 2 threads and about 14ms (14200.19μs) for DPU-00 core, and 13ms (12776.24μs) for DPU-01 core with 4 concurrent threads.   <!--Start of table definition-->{|style="background:transparent; color:black" border="0" heightalign="center" cellpadding="10px" cellspacing="5500px" alignheight="center550" valign="bottom" cellpadding=10px cellspacing=0px|-align="center"
|
|[[File:Resnet50 cores utilization.png|thumb|500x500px|Utilization of CPU and DPU cores of ResNet50 model for 1, 2, and 4 threads]]
|
|}
 
===ResNet101===
<!--Start of table definition-->
{|style="background:transparent; color:black" border="0" heightalign="center" cellpadding="10px" cellspacing="5500px" alignheight="center550" valign="bottom" cellpadding=10px cellspacing=0px|-align="center"
|
|[[File:Resnet101 train and validation accuracy.png|thumb|500x500px|Train and validation accuracy trend over 1000 training epochs for ResNet101 model]]
|}
The model, before performing the quantization with the ''vai_q_tensorflow'' tool, has an overall value of '''''accuracy of 97.10%''''' and an overall weighted average '''''F1-score of 97.11%''''' over the test subset of the dataset, showing a very high generalization capability on unseen samples. All the classes have a F1-score above 96.00%. In particular it is very high in the ''resistor'' class (98.65% F1-score) and, in the ''inductor'' class (98.50% F1-score). The only exception is the ''diode'' class (95.40% F1-score) mainly because it has a low value of recall metric (94.40% recall).
The model, before performing the quantization with the ''vai_q_tensorflow'' tool, has an overall value of '''''accuracy of 97.10%''''' and an overall weighted average '''''F1-score of 97.11%''''' over the test subset of the dataset, showing a very high generalization capability on unseen samples. All the classes have a F1-score above 96.00%, very high for class ''resistor'' (98.65% F1-score) and, class ''inductor'' (98.50% F1-score) with only the exception of the diode class (95.40% F1-score) mainly because it has a low recall (94.40% recall).  {| align="center" style = "background: transparent; margin: auto; width: 60%;"
|-
| width="200px " style = " vertical-align: center; " |[[File:Resnet101 host confusion matrix.png|center|thumb|500x500px|Confusion matrix of ResNet101 model on host machine before quantization]]| width="200px " style = " vertical-align: center; " |
{| class="wikitable" style="margin: auto; text-align: center;"
|+ Host machine, classification report
|}
After performing the quantization with the ''vai_q_tensorflow'' tool and after the deployment on the target device, the model has an overall value of '''''accuracy of 93.95%''''' and an overall weighted average '''''F1-score of 93.91%''''' on the test subset of the dataset. The model is still performing very well in correcly classify samples belonging to the ''capacitor'' class by keeping the F1-score above 96.00% (97.03% F1-score). On the other hand for the remaining classes, there is a substantial reduction in the value of this metric. The classes that exhibit the worst results are ''diode'' class (92.09% F1-score) and, ''IC'' class (92.06% F1-score) because both class shows a low value of the recall metric (90.30% recall for the former, 88.20% recall for the latter). In general, the performance of the model is still good, similar to the one obtained with the ResNet50 model.
After performing the quantization with the ''vai_q_tensorflow'' tool and after the deployment on the target device, the model has an overall value of '''''accuracy of 93.95%'''''' and an overall weighted average '''''F1-score of 93.91%''''' on the test subset of the dataset. The model is still performing very well on the ''capacitor'' class by keeping a F1-score above 96.00% (97.03% F1-score) but, on the other hand for the remaining classes, there is a substantial drop in the value of the metric. The classes that exhibits the worst results are ''diode'' (92.09% F1-score) and, ''IC'' (92.06% F1-score) due to having a low recall (88.20% recall). In general, the performance of the model is still good, similar to the one obtained with the ResNet50 model.  {| align="center" style = "background: transparent; margin: auto; width: 60%;"
|-
| width="200px " style = " vertical-align: center; " |[[File:Resnet101 target confusion matrix.png|center|thumb|500x500px|Confusion matrix of ResNet101 model on target device after quantization]]| width="200px " style = " vertical-align: center; " |
{| class="wikitable" style="margin: auto; text-align: center;"
|+ Target device, classification report
|}
 To perform the inference over the images, only one DPU core is used for 1 thread, leading to almost a 70% utilization of the DPU-01 core. By increasing the number of threads i.e. with 4 threads, more cores are used and the percentage gets higher, very close to 100% on DPU-00 core and close to 95% on DPU-01 core. Concerning the DPU latency, for 1 thread the average latency for one image is about 21ms (21339.73μs). By increasing the concurrency, the latency for both cores is higher, about 24ms (24313.61μs) for DPU-00 core, and 22ms (22231.22μs) for DPU-01 core when using 2 threads and about 25ms (25385.51μs) for DPU-00 core, and 23ms (23025.89μs) for DPU-01 core with 4 concurrent threads.  <!--Start of table definition-->{|style="background:transparent; color:black" border="0" heightalign="center" cellpadding="10px" cellspacing="5500px" alignheight="center550" valign="bottom" cellpadding=10px cellspacing=0px|-align="center"
|
|[[File:Resnet101 cores utilization.png|thumb|500x500px|Utilization of CPU and DPU cores of ResNet101 model for 1, 2, and 4 threads]]
|
|}
 
===ResNet152===
<!--Start of table definition-->
{|style="background:transparent; color:black" border="0" heightalign="center" cellpadding="10px" cellspacing="5500px" alignheight="center550" valign="bottom" cellpadding=10px cellspacing=0px|-align="center"
|
|[[File:Resnet152 train and validation accuracy.png|thumb|500x500px|Train and validation accuracy trend over 1000 training epochs for ResNet152 model]]
|}
The model, before performing the quantization with the ''vai_q_tensorflow'' tool, has an overall value of '''''accuracy of 96.46%''''' and an overall weighted average '''''F1-score of 96.48%''''' over the test subset of the dataset, showing a good generalization capability on unseen samples. The classes with the highest F1-score, above 96.00% are respectively ''resistor'' class (98.58% F1-score), ''inductor'' class (98.03% F1-score) and, ''capacitor'' class (96.99% F1-score). The worst performance is the one displayed by the ''transistor'' class by having "only" a F1-score around 94.00% (94.18% F1-score) mainly bacause the model exhibits a low value of the precision metric in this class (92.89% precision).
The model, before performing the quantization with the ''vai_q_tensorflow'' tool, has an overall value of '''''accuracy of 96.46%''''' and an overall weighted average '''''F1-score of 96.48%''''' over the test subset of the dataset, showing a good generalization capability on unseen samples. The classes with the highest F1-score, above 96.00% are respectively ''resistor'' (98.58% F1-score), ''inductor'' (98.03% F1-score) and, ''capacitor'' (96.99% F1-score), a result quite similar to ResNet50 model. The worst performance is the one displayed by the class ''transistor'' by having "only" a F1-score around 94.00% (94.18% F1-score) mainly due to a low value of the precision metric (92.89% precision).  {| align="center" style = "background: transparent; margin: auto; width: 60%;"
|-
| width="200px " style = " vertical-align: center; " |[[File:Resnet152 host confusion matrix.png|center|thumb|500x500px|Confusion matrix of ResNet152 model on host machine before quantization]]| width="200px " style = " vertical-align: center; " |
{| class="wikitable" style="margin: auto; text-align: center;"
|+ Host machine, classification report
|}
After performing the quantization with the ''vai_q_tensorflow'' tool and after the deployment on the target device, the model has an overall value of '''''accuracy of 93.40%''''' and an overall weighted average '''''F1-score of 93.36%''''' on the test subset of the dataset. The model is still performing very well in correcly classify samples belonging to the ''capacitor'' class by keeping a F1-score above 96.00% (96.62% F1-score). On the other hand for the remaining classes, there is a substantial reduction in the value of this metric. The classes that exhibit the worst results are ''diode'' class (91.65% F1-score) because the value of the recall metric is very low (87.30% recall), ''IC'' class (91.09% F1-score) by having a low value measured for both precision and recall metrics (91.18% precision, 91.00% recall) and, ''transistor'' class (90.62% F1-score) having a low value of precision and recall (90.35% precision, 90.62% recall) in the same way as the previous case. In general, the performance of the model is still good, similar to the performance obtained with two previous models, especially to the one of ResNet101 model.
After performing the quantization with the ''vai_q_tensorflow'' tool and after the deployment on the target device, the model has an overall value of '''''accuracy of 93.40%''''' and an overall weighted average '''''F1-score of 93.36%''''' on the test subset of the dataset. The model is still performing very well on the ''capacitor'' class by keeping a F1-score above 96.00% (96.62% F1-score) but, on the other hand for the remaining classes, there is a substantial drop in the value of the metric. The classes that exhibit the worst results are ''diode'' (91.65% F1-score) because the recall is very low (87.30% recall), ''IC'' (91.09% F1-score) having low precision and recall (91.18% precision, 91.00% recall) and, ''transistor'' (90.62% F1-score) having low precision and recall (90.35% precision, 90.62% recall). In general, the performance of the model is still good, similar to the one obtained with two previous models, especially similar to the ResNet101 model.  {| align="center" style = "background: transparent; margin: auto; width: 60%;"
|-
| width="200px " style = " vertical-align: center; " |[[File:Resnet152 target confusion matrix.png|center|thumb|500x500px|Confusion matrix of ResNet152 model on target device after quantization]]| width="200px " style = " vertical-align: center; " |
{| class="wikitable" style="margin: auto; text-align: center;"
|+ Target device, classification report
|}
 To perform the inference over the images, only one DPU core is used for 1 thread, leading to almost an 80% utilization of the DPU-01 core. By increasing the number of threads i.e. with 4 threads, more cores are used and the percentage gets higher, very close to 100% on both DPU-00 and DPU-01 cores. Concerning the DPU latency, for 1 thread the average latency for one image is about 28ms (28867.86μs). By increasing the concurrency, the latency for both cores is higher, about 33ms (32702.59μs) for DPU-00 core, and 30ms (30046.64μs) for DPU-01 core when using 2 threads and about 34ms (33826.30μs) for DPU-00 core, and 30ms (30834.46μs) for DPU-01 core with 4 concurrent threads.  <!--Start of table definition-->{|style="background:transparent; color:black" border="0" heightalign="center" cellpadding="10px" cellspacing="5500px" alignheight="center550" valign="bottom" cellpadding=10px cellspacing=0px|-align="center"
|
|[[File:Resnet152 cores utilization.png|thumb|500x500px|Utilization of CPU and DPU cores of ResNet152 model for 1, 2, and 4 threads]]
|
|}
 
===InceptionV4===
<!--Start of table definition-->
{|style="background:transparent; color:black" border="0" heightalign="center" cellpadding="10px" cellspacing="5500px" alignheight="center550" valign="bottom" cellpadding=10px cellspacing=0px|-align="center"
|
|[[File:InceptionV4 train and validation accuracy.png|thumb|500x500px|Train and validation accuracy trend over 1000 training epochs for InceptionV4 model]]
|}
The model, before performing the quantization with the vai_q_tensorflow tool, has an overall value of '''''accuracy of 92.68%''''' and an overall weighted average '''''F1-score of 92.69%''''' over the test subset of the dataset, showing a good generalization capability on unseen samples, although lower than in the three ResNet models. The classes with the highest F1-score, above 96.00% are: ''resistor'' (97.56% F1-score), ''capacitor'' (96.81% F1-score) and, ''inductor'' (96.38% F1-score). However, the model performance on the classification task in the three remaining classes is poorly compared w.r.t the previous models, showing an F1-score below 90.00% in the ''diode'' class (87.94% F1-score) and, ''transistor'' class (87.27% F1-score) because both classes have a low value of precision and recall metrics for the former (88.38% precision, 87.50% recall) and, a low value in precision metric for the latter (83.67% precision).
The model, before performing the quantization with the vai_q_tensorflow tool, has an overall value of '''''accuracy of 92.68%''''' and an overall weighted average '''''F1-score of 92.69%''''' over the test subset of the dataset, showing a good generalization capability on unseen samples, although lower than in the three previous models. The classes with the highest F1-score, above 96.00% are: ''resistor'' (97.56% F1-score), ''capacitor'' (96.81% F1-score) and, ''inductor'' (96.38% F1-score). However, the model performance on the three remaining classes is poorly compared w.r.t the previous models, showing an F1-score below 90.00% in the class ''diode'' (87.94% F1-score) and, class ''transistor'' (87.27% F1-score) due to having low precision and recall in the former case (88.38% precision, 87.50% recall) and, low precision in the latter (83.67% precision).  {| align="center" style = "background: transparent; margin: auto; width: 60%;"
|-
| width="200px " style = " vertical-align: center; " |[[File:InceptionV4 host confusion matrix.png|center|thumb|500x500px|Confusion matrix of InceptionV4 model on host machine before quantization]]| width="200px " style = " vertical-align: center; " |
{| class="wikitable" style="margin: auto; text-align: center;"
|+ Host machine, classification report
|}
After performing the quantization with the vai_q_tensorflow tool and after the deployment on the target device, the model has an overall value of '''''accuracy of 88.87%''''' and an overall weighted average '''''F1-score of 88.91%''''' on the test subset of the dataset. The model is still performing well in correcly classify samples belonging to ''resistor'' class (97.65% F1-score). On the other hand for the remaining classes, there is a substantial reduction in the value of this metric. The classes that exhibit the worst results are ''diode'' class (85.15% F1-score), ''IC'' (83.27% F1-score) and, ''transistor'' class (81.97% F1-score). In general, the performance of the model is still good, but it is decidedly lower than the one obtained with the ResNet models analyzed previously.
After performing the quantization with the vai_q_tensorflow tool and after the deployment on the target device, the model has an overall value of '''''accuracy of 88.87%''''' and an overall weighted average '''''F1-score of 88.91%''''' on the test subset of the dataset. The model is still performing well on ''resistor'' (97.65% F1-score) but, on the other hand for the remaining classes, there is a substantial drop in the value of the metric. The classes that exhibit the worst results are ''diode'' (85.15% F1-score), ''IC'' (83.27% F1-score) and, ''transistor'' (81.97% F1-score). In general, the performance of the model is still good, but it is decidedly lower than the models analyzed previously.  {| align="center" style = "background: transparent; margin: auto; width: 60%;"
|-
| width="200px " style = " vertical-align: center; " |[[File:InceptionV4 target confusion matrix.png|center|thumb|500x500px|Confusion matrix of InceptionV4 model on target device after quantization]]| width="200px " style = " vertical-align: center; " |
{| class="wikitable" style="margin: auto; text-align: center;"
|+ Target device, classification report
|}
 To perform the inference over the images, only one DPU core is used for 1 thread, leading to almost a 70% utilization of the DPU-01 core. By increasing the number of threads i.e. with 4 threads, more cores are used and the percentage gets higher, very close to 100% on both DPU-00 and DPU-01 cores. Concerning the DPU latency, for 1 thread the average latency for one image is about 30ms (30127.38μs). By increasing the concurrency, the latency for both cores is higher, about 34ms (34105.45μs) for DPU-00 core, and 31ms (30981.59μs) for DPU-01 core when using 2 threads and about 35ms (35273.61μs) for DPU-00 core, and 31ms (31761.21μs) for DPU-01 core with 4 concurrent threads.  <!--Start of table definition-->{|style="background:transparent; color:black" border="0" heightalign="center" cellpadding="10px" cellspacing="5500px" alignheight="center550" valign="bottom" cellpadding=10px cellspacing=0px|-align="center"
|
|[[File:Inception v4 cores utilization.png|thumb|500x500px|Utilization of CPU and DPU cores of InceptionV4 model for 1, 2, and 4 threads]]
|
|}
 
===Inception ResNet V1===
<!--Start of table definition-->
{|style="background:transparent; color:black" border="0" heightalign="center" cellpadding="10px" cellspacing="5500px" alignheight="center550" valign="bottom" cellpadding=10px cellspacing=0px|-align="center"
|
|[[File:Inception ResNet V1 train and validation accuracy.png|thumb|500x500px|Train and validation accuracy trend over 1000 training epochs for Inception ResNet V1 model]]
|}
The model, before performing the quantization with the vai_q_tensorflow tool, has an overall value of '''''accuracy of 97.66%''''' and an overall weighted average '''''F1-score of 97.36%''''' over the test subset of the dataset, showing a very high generalization capability on unseen samples. All the classes have a F1-score above 96.00%, actually very high for the ''resistor'' class (98.50% F1-score).
The model, before performing the quantization with the vai_q_tensorflow tool, has an overall value of '''''accuracy of 97.66%''''' and an overall weighted average '''''F1-score of 97.36%''''' over the test subset of the dataset, showing a very high generalization capability on unseen samples. All the classes have a F1-score above 96.00%, actually very high for the class ''resistor'' (98.50% F1-score).  {| align="center" style = "background: transparent; margin: auto; width: 60%;"
|-
| width="200px " style = " vertical-align: center; " |[[File:Inception ResNet V1 host confusion matrix.png|center|thumb|500x500px|Confusion matrix of Inception ResNet V1 model on host machine before quantization]]| width="200px " style = " vertical-align: center; " |
{| class="wikitable" style="margin: auto; text-align: center;"
|+ Host machine, classification report
|}
After performing the quantization with the ''vai_q_tensorflow tool'' and after the deployment on the target device, the model has an overall value of '''''accuracy of 93.34%''''' and an overall weighted average '''''F1-score of 93.34%''''' on the test subset of the dataset. The model is still performing very well in correcly classify samples belonging to ''resistor'' class (97.12% F1-score), ''inductor'' class (97.00% F1-score) and, ''capacitor'' class (96.59% F1-score) by keeping a F1-score above 96.00%. However, for the remaining classes, the value of the metric is substantially reduced. The classes that exhibit the worst results are ''IC'' class(89.41% F1-score) because of a low value measured for precision metric (84.12% precision) and, ''transistor'' class (87.75% F1-score) because of a very low value of the recall metric (82.80% recall). In general, the performance of the model is still good, similar to the one obtained with ResNet models.
After performing the quantization with the vai_q_tensorflow tool and after the deployment on the target device, the model has an overall value of '''''accuracy of 93.34%''''' and an overall weighted average '''''F1-score of 93.34%''''' on the test subset of the dataset. The model is still performing very well in three classes i.e ''resistor'' (97.12% F1-score), ''inductor'' (97.00% F1-score) and, ''capacitor'' (96.59% F1-score) by keeping a F1-score above 96.00%. However, for the remaining classes, the value of the metric is substantially reduced. The classes that exhibit the worst results are ''IC'' (89.41% F1-score) due to having low precision (84.12% precision) and, ''transistor'' (87.75% F1-score) due to having a very low recall (82.80% recall). In general, the performance of the model is still good, similar to the one obtained with ResNet models.  {| align="center" style = "background: transparent; margin: auto; width: 60%;"
|-
| width="200px " style = " vertical-align: center; " |[[File:Inception ResNet V1 target confusion matrix.png|center|thumb|500x500px|Confusion matrix of Inception ResNet V1 model on target device after quantization]]| width="200px " style = " vertical-align: center; " |
{| class="wikitable" style="margin: auto; text-align: center;"
|+ Target device, classification report
|}
 To perform the inference over the images, only one DPU core is used for 1 thread, leading to almost a 60% utilization of the DPU-01 core. By increasing the number of threads i.e. with 4 threads, more cores are used and the percentage gets higher, very close to 100% on DPU-00 core and to 90% on DPU-01 core. Concerning the DPU latency, for 1 thread the average latency for one image is about 18ms (17651.31μs). By increasing the concurrency, the latency for both cores is higher, about 21ms (20511.79μs) for DPU-00 core, and 18ms (18466.97μs) for DPU-01 core when using 2 threads and about 22ms (21654.99μs) for DPU-00 core, and 20ms (19503.17μs) for DPU-01 core with 4 concurrent threads.  <!--Start of table definition-->{|style="background:transparent; color:black" border="0" heightalign="center" cellpadding="10px" cellspacing="5500px" alignheight="center550" valign="bottom" cellpadding=10px cellspacing=0px|-align="center"
|
|[[File:Inception resnet v1 cores utilization.png|thumb|500x500px|Utilization of CPU and DPU cores of Inception ResNet V1 model for 1, 2, and 4 threads]]
|
|}
 
===Inception ResNet V2===
<!--Start of table definition-->
{|style="background:transparent; color:black" border="0" heightalign="center" cellpadding="10px" cellspacing="5500px" alignheight="center550" valign="bottom" cellpadding=10px cellspacing=0px|-align="center"
|
|[[File:Inception ResNet V2 train and validation accuracy.png|thumb|500x500px|Train and validation accuracy trend over 1000 training epochs for Inception ResNet V2 model]]
|}
The model, before performing the quantization with the ''vai_q_tensorflow'' tool, has an overall value of '''''accuracy of 97.53%''''' and an overall weighted average '''''F1-score of 97.53%''''' over the test subset of the dataset, showing a very high generalization capability on unseen samples. Five classes have a F1-score above 96.00%, actually very high for ''inductor'' class (98.66% F1-score) and, ''resistor'' class (98.55% F1-score). The worst result is the one displayed by the ''transistor'' class by having a F1-score below 96.00% but, still very close (95.86% F1-score) mainly due to a low value of the precision metric (93.36% precision).
The model, before performing the quantization with the vai_q_tensorflow tool, has an overall value of '''''accuracy of 97.53%''''' and an overall weighted average '''''F1-score of 97.53%''''' over the test subset of the dataset, showing a very high generalization capability on unseen samples. Five classes have a F1-score above 96.00%, actually very high for class ''inductor'' (98.66% F1-score)and, class ''resistor'' (98.55% F1-score). The worst result is the one displayed by the class ''transistor'' by having a F1-score below 96.00% but, still very close (95.86% F1-score) mainly due to a low value of the precision metric (93.36% precision).  {| align="center" style = "background: transparent; margin: auto; width: 60%;"
|-
| width="200px " style = " vertical-align: center; " |[[File:Inception ResNet V2 host confusion matrix.png|center|thumb|500x500px|Confusion matrix of Inception ResNet V2 model on host machine before quantization]]| width="200px " style = " vertical-align: center; " |
{| class="wikitable" style="margin: auto; text-align: center;"
|+ Host machine, classification report
|}
After performing the quantization with the ''vai_q_tensorflow'' tool and after the deployment on the target device, the model has an overall value of '''''accuracy of 93.34%''''' and an overall weighted average '''''F1-score of 93.34%''''' on the test subset of the dataset. The model is still performing very well in correcly classify samples belonging to ''resistor'' class (98.07% F1-score) and, ''capacitor'' class (96.23% F1-score) by keeping a F1-score above 96.00%. However, for the remaining classes, the value of the metric is reduced. In particular the worst results can be found in the ''IC'' class (90.80% F1-score) by having a low value of precision and recall metrics (91.73% precision, 89.90% recall) and, ''transistor'' class by haveing a low value of precision metric (87.88% precision).
The model is still performing very well in two classes i.e ''resistor'' (98.07% F1-score) and, ''capacitor'' (96.23% F1-score) by keeping a F1-score above 96.00%. However, for the remaining classes, the value of the metric is reduced. In particular the worst results can be found in the class ''IC'' (90.80% F1-score) by having a low value for precision and recall metrics (91.73% precision, 89.90% recall) and, class ''transistor'' due to have low precision (87.88% precision).  {| align="center" style = "background: transparent; margin: auto; width: 60%;"
|-
| width="200px " style = " vertical-align: center; " |[[File:Inception ResNet V2 target confusion matrix.png|center|thumb|500x500px|Confusion matrix of Inception ResNet V2 model on target device after quantization]]| width="200px " style = " vertical-align: center; " |
{| class="wikitable" style="margin: auto; text-align: center;"
|+ Target device, classification report
|}
|}
 
To perform the inference over the images, only one DPU core is used for 1 thread, leading to almost a 65% utilization of the DPU-01 core. By increasing the number of threads i.e. with 4 threads, more cores are used and the percentage gets higher, very close to 100% on DPU-00 core and to 95% on DPU-01 core. Concerning the DPU latency, for 1 thread the average latency for one image is about 25ms (25185.03μs). By increasing the concurrency, the latency for both cores is higher, about 29ms (28858.88μs) for DPU-00 core, and 26ms (26336.11μs) for DPU-01 core when using 2 threads and about 30ms (30229.27μs) for DPU-00 core, and 27ms (27452.70μs) for DPU-01 core with 4 concurrent threads.
 
<!--Start of table definition-->
{|style="background:transparent; color:black" border="0" heightalign="center" cellpadding="10px" cellspacing="5500px" alignheight="center550" valign="bottom" cellpadding=10px cellspacing=0px|-align="center"
|
|[[File:Inception resnet v2 cores utilization.png|thumb|500x500px|Utilization of CPU and DPU cores of Inception ResNet V2 model for 1, 2, and 4 threads]]
|
|}
 
==Comparison==
By initially considering the accuracy of the models before the quantization, it is possible to see that the ones that have a higher capability of correctly classifying the test samples are, in descending order, the Inception ResNet V2, Inception ResNet V1, and the ResNet101. These three models show an accuracy above 97%. In contrast, the models that display two of the lowest accuracy values are the ResNet50 and the Inception V4. After doing the quantization, the situation changes radically, having at the top of the list the ResNet101, followed by the ResNet50 model, while the Inception ResNet V1 and inception ResNet V2 stand at the bottom, with an accuracy drop of 6.65% for the former and 5.55% for the latter. Moreover, the worst model among those analyzed is the Inception V4, with an accuracy below 90%.
 
[[File:Pre and post quantization accuracy.png|center|thumb|500x500px|Models pre and post quantization accuracy with vai_q_tensorflow tool]]
 
As mentioned before, two other aspects should be taken into account when comparing the models: the DPU Kernel parameters size and the total tensor count. Recall that these two data can be easily retrieved by looking at the Vitis-AI compiler log file when compiling a model or by executing on the target device the command ''ddump''.
*'''Parameters size''': indicates in the unit of MB, kB, or bytes, the amount of memory occupied by the DPU Kernel, including weight and bias. It is straightforward to check that the greater the number of parameters for an implemented model on the host, the greater the amount of memory occupied on the target device.
*'''Total tensor count''': is the total number of DPU tensors for a DPU Kernel. This value depends on the number of stacked layers between input and output layers of the model and obviously the greater the number of stacked layers, the higher the number of tensors, leading to a more complex computation on the DPU. This is directly responsible for increasing the requested amount of time for a single inference on a single image.  <!--Start of table definition-->{|style="background:transparent; color:black" border="0" heightalign="center" cellpadding="10px" cellspacing="5500px" alignheight="center550" valign="bottom" cellpadding=10px cellspacing=0px|-align="center"
|
|[[File:DPU Kernel parameters size.png|thumb|500x500px|Deployed models DPU Kernel parameters size]]
|
|}
 
In the two figures below it is shown the DPU cores latency for 1, 2, and 4 threads; it is interesting to note that the core latency of Inception ResNet V1 is lower than the one of ResNet152, even though they have similar ''total tensor count'' and different values of DPU Kernel ''parameters size'' (actually greater for ResNet152). Vice versa, the ResNet101 and Inception V4 have a similar DPU Kernel ''parameters size'' and different values of ''total tensor count'', and in this case, the core latency is higher for the latter. The same observation can be made for models ResNet50 and Inception ResNet V1 leading to the following statements:
*with the same DPU Kernel ''parameters size'', the latency decreases if the total tensor count lowers.
These considerations suggest that the best models among the implemented ones are ResNet50, ResNet101, and Inception ResNet V1.  <!--Start of table definition-->{|style="background:transparent; color:black" border="0" heightalign="center" cellpadding="10px" cellspacing="5500px" alignheight="center550" valign="bottom" cellpadding=10px cellspacing=0px|-align="center"
|
|[[File:DPU-00 core latency for 1-2-4 threads.png|thumb|500x500px|Deployed models DPU-00 core latency for 1, 2, and 4 threads]]
|
|}
 
Finally, it is possible to evaluate the DPU throughput in relation to the number of threads used by the benchmark application. In the figure below, it is really interesting to observe how all the models for 1 thread, have similar values of FPS but by increasing the level of concurrency the difference is more and more evident.
 
 
[[File:DPU throughput for 1-2-4 threads.png|center|thumb|500x500px|Deployed models DPU throughput for 1, 2, and 4 threads]]
 
In conclusion, by summing up all the considerations that have been made, it is clearly evident that the solution with the best compromise between accuracy and inference latency is the ResNet50 model, followed by the ResNet101 and Inception ResNet V1 models.
dave_user
207
edits

Navigation menu