Difference between revisions of "ML-TN-003 — AI at the edge: visual inspection of assembled PCBs for defect detection — Part 2"

From DAVE Developer's Wiki
Jump to: navigation, search
(History)
 
(23 intermediate revisions by 2 users not shown)
Line 13: Line 13:
 
|-
 
|-
 
|1.0.0
 
|1.0.0
|March 2021
+
|June 2021
 
|First public release
 
|First public release
 
|}
 
|}
  
 
==Introduction==
 
==Introduction==
 +
This Technical Note (TN for short) belongs to the series introduced [[ML-TN-003 — AI at the edge: visual inspection of assembled PCBs for defect detection — Part 1|here]]. Specifically, it illustrates the first issue to be considered to implement a device capable to spot mounting defects on an assembled PCB, which is to detect and recognize the electronic components populating the board. In ML terminology, we are dealing again with a classification problem.
  
 +
From an engineering perspective, one of the main goals of this work is still the evaluation of the Xilinx's DPU performances on inference tasks in terms of latency and throughput as Xilinx devices are in principle good candidates for implementing an automatic visual inspection machine as described in the [[ML-TN-003 — AI at the edge: visual inspection of assembled PCBs for defect detection — Part 1|opening article]] of this series. It is worth remembering that the characterization of Xilinx DPU used for solving a classification task is covered in [[ML-TN-001_-_AI_at_the_edge:_comparison_of_different_embedded_platforms_-_Part_3|this TN]] as well. Nevertheless, coping with another classification task also allowed to evaluate how different deep CNN models perform on the same problem.
  
 
==Test Bed==
 
==Test Bed==
 +
The following table details the test bed used for this Technical Note.
 +
{| class="wikitable" style="margin: auto;"
 +
|+
 +
Host and target configurations
 +
!System
 +
!Component
 +
!Name
 +
!Version
 +
|-
 +
| rowspan="3" |'''Host'''
 +
|Operating system
 +
|GNU/Linux Ubuntu
 +
|18.04
 +
|-
 +
|Software development platform
 +
|Vitis
 +
|1.2
 +
|-
 +
|Machine learning framework
 +
|TensorFlow
 +
|1.15.2
 +
|-
 +
| rowspan="4" |'''Target'''
 +
|Hardware platform
 +
|ZCU104
 +
|1.0
 +
|-
 +
|Linux BSP
 +
|Petalinux
 +
|2020.1
 +
|-
 +
|Software binary image (microSD card)
 +
|xilinx-zcu104-dpu-v2020.1-v1.2.0
 +
|v2020.1-v1.2.0
 +
|-
 +
|Neural network hardware accelerator
 +
|DPU
 +
|3.3
 +
|}
 +
For more details about Xilinx's hardware configuration and the usage of the Vitis-AI software platform, please refer to [[ML-TN-001 - AI at the edge: comparison of different embedded platforms - Part 3|this article]].
  
 
==FICS-PCB dataset overview==
 
==FICS-PCB dataset overview==
Over the years, computer vision and ML disciplines have considerably advanced the field of Automated Visual Inspection for Printed Circuit Board (PCB-AVI) assurance. It is already well-known that to develop a robust model
+
Over the years, computer vision and ML disciplines have considerably advanced the field of Automated Visual Inspection for Printed Circuit Board (PCB-AVI) assurance. It is already well-known that, to develop a robust model for any ML-based application, a dataset is required, possibly as large as possible, with many examples for a better generalization. Although a few large datasets for PCB-AVI are publicly available, they lack variances that simulate real-world scenarios, such as illumination and scale variations, which are necessary for developing robust PCB-AVI approaches. Therefore, to represent such non-ideal conditions, the FICS-PCB dataset was proposed for evaluating and improving methods for PCB-AVI. This dataset consists of PCB images featuring multiple types of components and various image conditions to facilitate performance evaluation in challenging scenarios that are likely to be encountered in practice.
for any ML-based application a dataset is required, possibly as large as possible, with many examples for a better generalization. Although a few large datasets for PCB-AVI are publicly available, they lack variances that simulate real-world scenarios, such as illumination and scale variations, which are necessary for developing robust PCB-AVI approaches. Therefore, to represent such non-ideal conditions, the FICS-PCB dataset was proposed for evaluating and improving methods for PCB-AVI. This dataset consists of PCB images featuring multiple types of components and various image conditions to facilitate performance evaluation in challenging scenarios that are likely to be encountered in practice.
 
 
 
The dataset consists of 9,912 images of 31 PCB samples and contains a total amount of 77,347 labeled components distributed in six classes: ''IC'', ''capacitor'', ''diode'', ''inductor'', ''resistor'' and, ''transistor''. These components were collected using two image sensor types, a digital microscope and a Digital Single-Lens Reflex (DSLR) camera. To ensure that the dataset also includes samples that represent variations in illumination, the authors collected images using three different intensities from the built-in ring light of the microscope i.e. 20, 40, and 60, where 60 represents the brightest illumination. In addition, variations in scale were included using three difference magnifications i.e. 1×, 1.5×, and 2×.
 
  
 +
The dataset consists of 9,912 images of 31 PCB samples and contains a total amount of 77,347 labeled components distributed in six classes: ''IC'', ''capacitor'', ''diode'', ''inductor'', ''resistor'', and ''transistor''. These components were collected using two image sensor types, a digital microscope and a Digital Single-Lens Reflex (DSLR) camera. To ensure that the dataset also includes samples that represent variations in illumination, the authors collected images using three different intensities from the built-in ring light of the microscope i.e. 20, 40, and 60, where 60 represents the brightest illumination. In addition, variations in scale were included using three difference magnifications i.e. 1×, 1.5×, and 2×.
  
 
[[File:FICS-PCB samples.png|center|thumb|500x500px|FICS-PCB dataset, examples of six types of components]]
 
[[File:FICS-PCB samples.png|center|thumb|500x500px|FICS-PCB dataset, examples of six types of components]]
  
 +
This dataset is highly unbalanced, having a lot of samples only for two classes i.e. ''capacitor'' and ''resistor''. In this situation, it is not a good idea to use this dataset as it is, simply because the models will be trained on image batches mainly composed of the most common components, hence learning only a restricted number of features. This has as a consequence that the models will probably be very good at classifying ''capacitor'' and ''resistor'' classes and pretty bad at classifying the remaining ones. Therefore, the missing data should be increased with oversampling.
  
It is straightforward just by looking at the figure below that this dataset is highly unbalanced, having a lot of samples only for two classes i.e. capacitor and resistor. This is indeed no surprise, simply because these two component types are more commonly mounted on a PCB with respect to the others. Unfortunately, in this situation, it is not a good idea to use this dataset as it is, simply because the models will be trained on image batches mainly composed by the most common components, hence learning only a restricted number of features. This has as a consequence that the models will probably be very good at classifying capacitor and resistor and pretty bad at classifying the remaining classes. Therefore, the missing data must be increased with image augmentation.
+
Before proceeding further, please note that the number of DSLR subset examples is by far lower than the number of the Microscope subset samples. As the two subsets were acquired using two different kinds of instruments, their characteristics — the resolution, for example — differ significantly. In order to have homogeneous images w.r.t. the characteristics, it is preferable to keep only one of them, specifically the most numerous.
 
 
Before proceeding further, please note that the number of DSLR subset examples is by far lower than the number of the Microscope subset samples. As the two subsets were acquired using two different kind of instruments, their characteristics — the resolution, for example — differ significantly. In order to have homogeneous images w.r.t. the characteristics, it is preferable to keep only one of them, specifically the most numerous.
 
 
 
 
 
 
[[File:Samples per class in Microscope and DSLR subsets.png|center|thumb|500x500px|FICS-PCB dataset, component count per class in DSLR and Microscope subsets]]
 
[[File:Samples per class in Microscope and DSLR subsets.png|center|thumb|500x500px|FICS-PCB dataset, component count per class in DSLR and Microscope subsets]]
  
 +
The dataset was created by random sampling 150000 component images from Microscope subset of the FICS-PCB dataset, providing a total of 25000 images per class. The 72% of the images were used for training, the 24% was used as a validation subset during training and the remaining 4% was used as a test set, providing exactly 108000 training images, 32000 validation images, and 6000 test images, equally distributed among the six classes of the dataset. Each image was preprocessed and padded with a constant value in order to adapt its scale to the input tensor size of the models. During this process the aspect ratio of the image was not modified. To increase variety among the examples, random contrast, brightness, saturation, and rotation were applied too. ''Diode'', ''inductor'', ''IC'', and, ''transistor'' classes were oversampled.
 +
[[File:Dataset processing and augmentation.png|center|thumb|500x500px|FICS-PCB dataset: An example of image augmentation as compensation for lack of data in IC, diode, inductor, and transistor classes]]
  
[[File:Dataset processing and augmentation.png|center|thumb|500x500px|FICS-PCB dataset, an example of image augmentation as compensation for lack of data in IC, diode, inductor, and transistor classes]]
+
==Training configuration and hyperparameters setup==
 
+
The training was done in the cloud using [https://colab.research.google.com/ Google Colab]. All the models were trained with the same configuration for 1000 epochs, providing at each step of an epoch a mini-batch of 32 images. Adam was chosen as optimizer, providing an initial learn rate of 0.0001 and a learning rate schedule that uses an exponential decay schedule with a decay rate of 0.96. Dropout rate was set at 0.4 for all models. Patience for early stopping was set at 100 epochs. The training images were further augmented with random zoom, shift, and rotation in order to improve model robustness on validation and test subsets and prevent the risk of overfitting.
 
 
[[File:Image augmentation for training samples.png|center|thumb|500x500px|FICS-PCB dataset, an example of image augmentation on training images to increase the robustness of the models]]
 
  
 +
[[File:Image augmentation for training samples.png|center|thumb|500x500px|FICS-PCB dataset: An example of image augmentation on training images to increase the robustness of the models]]
  
 
==Proposed models==
 
==Proposed models==
 
RESNET + INCEPTION INFO
 
TRAINING SPECS (NO TENSORFLOW PRUNING)
 
METRICS
 
 
 
===ResNet50===
 
===ResNet50===
The model, during the training phase, shows an increasing trend in accuracy over the train subset samples and over the validation subset samples. This is a sign that the model is learning in a correct way since it is not underfitting the train data. Furthermore, by looking at the trend of the loss during the 1000 training epochs, the model is clearly not overfitting the train data. By saving the status of the model with checkpoints each time there is an improvement in the validation loss, the best result is found at '''''epoch 993''''' with an '''''accuracy of 93.59%''''' and a '''''loss of 0.1912''''' on the validation data.
+
The model, during the training phase, shows an increasing trend in accuracy over the train subset samples and over the validation subset samples. This is a sign that the model is learning in a correct way since it is not underfitting the train data. Furthermore, by looking at the trend of the loss during the 1000 training epochs, the model is clearly not overfitting the train data. By saving the status of the model with checkpoints each time there is an improvement in the validation loss, the best result is found at '''''epoch 993''''' with an '''''accuracy of 93.59%''''' and a '''''loss of 0.1912''''' on the validation data.<!--Start of table definition-->
 
 
 
 
<!--Start of table definition-->
 
 
{| style="background:transparent; color:black" border="0" align="center" cellpadding="10px" cellspacing="0px" height="550" valign="bottom"
 
{| style="background:transparent; color:black" border="0" align="center" cellpadding="10px" cellspacing="0px" height="550" valign="bottom"
 
|- align="center"
 
|- align="center"
Line 66: Line 96:
 
|}
 
|}
  
The model, before performing the quantization with the ''vai_q_tensorflow'' tool, has an overall value of '''''accuracy of 94.85%''''' and an overall weighted average '''''F1-score of 94.86%''''' over the test subset of the dataset, showing a good generalization capability on unseen samples. The classes with the highest F1-score, above 96.00% are: ''resistor'' (98.08% F1-score), ''inductor'' (97.10% F1-score) and, ''capacitor'' (96.88% F1-score). On the contrary, the class in which the model performs poorly w.r.t the others, is the ''diode'' class (91.75% F1-score). This is attributable to a low value of precision metric (88.55% precision).
+
The model, before performing the quantization with the ''vai_q_tensorflow'' tool, has an overall value of '''''accuracy of 94.85%''''' and an overall weighted average '''''F1-score of 94.86%''''' over the test subset of the dataset, showing a good generalization capability on unseen samples. The classes with the highest F1-score, above 96.00%, are: ''resistor'' (98.08% F1-score), ''inductor'' (97.10% F1-score), and ''capacitor'' (96.88% F1-score). On the contrary, the class in which the model performs poorly w.r.t. the others is the ''diode'' class (91.75% F1-score). This is attributable to a low value of precision metric (88.55% precision).
  
 
{| align="center" style="background: transparent; margin: auto; width: 60%;"
 
{| align="center" style="background: transparent; margin: auto; width: 60%;"
Line 125: Line 155:
 
|}
 
|}
  
After performing the quantization with the ''vai_q_tensorflow'' tool and after the deployment on the target device, the model has an overall value of '''''accuracy of 93.27%''''' and an overall weighted average '''''F1-score of 93.29%''''' on the test subset of the dataset. The model is still performing well in correcly classify samples of the ''resistor'' class (98.08% F1-score), ''inductor'' class (97.10% F1-score) and, ''capacitor'' class (96.88% F1-score). The worst results of the model in the classification task can be found in the ''transistor'' class (89.78% F1-score) because both its measured precision and recall metrics are below 90.00% (89.96% precision  and, 89.60% recall) and, in the ''diode'' class (88.59% F1-score) because the precision metric is very low (83.77% precision).  
+
After performing the quantization with the ''vai_q_tensorflow'' tool and after the deployment on the target device, the model has an overall value of '''''accuracy of 93.27%''''' and an overall weighted average '''''F1-score of 93.29%''''' on the test subset of the dataset. The model is still performing well in classify correctly samples belonging to the ''resistor'' class (98.08% F1-score), ''inductor'' class (97.10% F1-score), and ''capacitor'' class (96.88% F1-score). The worst results of the model in the classification task can be found  
 
+
* in the ''transistor'' class (89.78% F1-score) because both measured precision and recall metrics are below 90.00% (89.96% precision  and, 89.60% recall)  
 +
* in the ''diode'' class (88.59% F1-score) because the precision metric is very low (83.77% precision).  
 
{| align="center" style="background: transparent; margin: auto; width: 60%;"
 
{| align="center" style="background: transparent; margin: auto; width: 60%;"
 
|-  
 
|-  
Line 184: Line 215:
 
|}
 
|}
  
To perform the inference over the images, only one DPU core is used for 1 thread, leading to almost a 55% utilization of the DPU-01 core. By increasing the number of threads i.e. with 4 threads, more cores are used and the percentage gets higher, very close to 100% on DPU-00 core and close to 90% on DPU-01 core. Concerning the DPU latency, for 1 thread the average latency for one image is about 12ms (11526.41μs). By increasing the concurrency, the latency for both cores is higher, about 13ms (13318.01μs) for DPU-00 core, and 12ms (12019.21μs) for DPU-01 core when using 2 threads and about 14ms (14200.19μs) for DPU-00 core, and 13ms (12776.24μs) for DPU-01 core with 4 concurrent threads.  
+
To perform the inference over the images, only one DPU core is used for 1 thread, leading to almost a 55% utilization of the DPU-01 core. By increasing the number of threads i.e. with 4 threads, more cores are used and the percentage gets higher, very close to 100% on DPU-00 core and close to 90% on DPU-01 core. Concerning the DPU latency, for 1 thread the average latency for one image is about 12ms (11526.41μs). By increasing the concurrency, the latency for both cores is higher, about 13ms (13318.01μs) for DPU-00 core, and 12ms (12019.21μs) for DPU-01 core when using 2 threads and about 14ms (14200.19μs) for DPU-00 core, and 13ms (12776.24μs) for DPU-01 core with 4 concurrent threads. <!--Start of table definition-->
 
 
 
 
<!--Start of table definition-->
 
 
{| style="background:transparent; color:black" border="0" align="center" cellpadding="10px" cellspacing="0px" height="550" valign="bottom"
 
{| style="background:transparent; color:black" border="0" align="center" cellpadding="10px" cellspacing="0px" height="550" valign="bottom"
 
|- align="center"
 
|- align="center"
Line 210: Line 238:
 
|}
 
|}
  
The model, before performing the quantization with the ''vai_q_tensorflow'' tool, has an overall value of '''''accuracy of 97.10%''''' and an overall weighted average '''''F1-score of 97.11%''''' over the test subset of the dataset, showing a very high generalization capability on unseen samples. All the classes have a F1-score above 96.00%. In particular it is very high in the ''resistor'' class (98.65% F1-score) and, in the ''inductor'' class (98.50% F1-score). The only exception is the ''diode'' class (95.40% F1-score) mainly because it has a low value of recall metric (94.40% recall).
+
The model, before performing the quantization with the ''vai_q_tensorflow'' tool, has an overall value of '''''accuracy of 97.10%''''' and an overall weighted average '''''F1-score of 97.11%''''' over the test subset of the dataset, showing a very high generalization capability on unseen samples. All the classes have a F1-score above 96.00%. In particular, it is very high in the ''resistor'' class (98.65% F1-score) and in the ''inductor'' class (98.50% F1-score). The only exception is the ''diode'' class (95.40% F1-score) mainly because it has a low value of recall metric (94.40% recall).
  
 
{| align="center" style="background: transparent; margin: auto; width: 60%;"
 
{| align="center" style="background: transparent; margin: auto; width: 60%;"
Line 269: Line 297:
 
|}
 
|}
  
After performing the quantization with the ''vai_q_tensorflow'' tool and after the deployment on the target device, the model has an overall value of '''''accuracy of 93.95%'''''  and an overall weighted average '''''F1-score of 93.91%''''' on the test subset of the dataset. The model is still performing very well in correcly classify samples of the ''capacitor'' class by keeping the F1-score above 96.00% (97.03% F1-score). On the other hand for the remaining classes, there is a substantial reduction in the value of this metric. The classes that exhibit the worst results are ''diode'' class (92.09% F1-score) and, ''IC'' class (92.06% F1-score) because both class shows a low value of the recall metric (90.30% recall for the former, 88.20% recall for the latter). In general, the performance of the model is still good, similar to the one obtained with the ResNet50 model.
+
After performing the quantization with the ''vai_q_tensorflow'' tool and after the deployment on the target device, the model has an overall value of '''''accuracy of 93.95%'''''  and an overall weighted average '''''F1-score of 93.91%''''' on the test subset of the dataset. The model is still performing very well in classify correctly samples belonging to the ''capacitor'' class by keeping the F1-score above 96.00% (97.03% F1-score). On the other hand, for the remaining classes, there is a substantial reduction in the value of this metric. The classes that exhibit the worst results are ''diode'' class (92.09% F1-score) and ''IC'' class (92.06% F1-score) because both class show a low value of the recall metric (90.30% recall for the former, 88.20% recall for the latter). In general, the performance of the model is still good, similar to the one obtained with the ResNet50 model.
  
 
{| align="center" style="background: transparent; margin: auto; width: 60%;"
 
{| align="center" style="background: transparent; margin: auto; width: 60%;"
Line 328: Line 356:
 
|}
 
|}
  
To perform the inference over the images, only one DPU core is used for 1 thread, leading to almost a 70% utilization of the DPU-01 core. By increasing the number of threads i.e. with 4 threads, more cores are used and the percentage gets higher, very close to 100% on DPU-00 core and close to 95% on DPU-01 core. Concerning the DPU latency, for 1 thread the average latency for one image is about 21ms (21339.73μs). By increasing the concurrency, the latency for both cores is higher, about 24ms (24313.61μs) for DPU-00 core, and 22ms (22231.22μs) for DPU-01 core when using 2 threads and about 25ms (25385.51μs) for DPU-00 core, and 23ms (23025.89μs) for DPU-01 core with 4 concurrent threads.
+
To perform the inference over the images, only one DPU core is used for 1 thread, leading to almost a 70% utilization of the DPU-01 core. By increasing the number of threads i.e. with 4 threads, more cores are used and the percentage gets higher, very close to 100% on DPU-00 core and close to 95% on DPU-01 core. Concerning the DPU latency, for 1 thread the average latency for one image is about 21ms (21339.73μs). By increasing the concurrency, the latency for both cores is higher, about 24ms (24313.61μs) for DPU-00 core, and 22ms (22231.22μs) for DPU-01 core when using 2 threads and about 25ms (25385.51μs) for DPU-00 core, and 23ms (23025.89μs) for DPU-01 core with 4 concurrent threads.<!--Start of table definition-->
 
 
 
 
<!--Start of table definition-->
 
 
{| style="background:transparent; color:black" border="0" align="center" cellpadding="10px" cellspacing="0px" height="550" valign="bottom"
 
{| style="background:transparent; color:black" border="0" align="center" cellpadding="10px" cellspacing="0px" height="550" valign="bottom"
 
|- align="center"
 
|- align="center"
Line 354: Line 379:
 
|}
 
|}
  
The model, before performing the quantization with the ''vai_q_tensorflow'' tool, has an overall value of '''''accuracy of 96.46%''''' and an overall weighted average '''''F1-score of 96.48%''''' over the test subset of the dataset, showing a good generalization capability on unseen samples. The classes with the highest F1-score, above 96.00% are respectively ''resistor'' class (98.58% F1-score), ''inductor'' class (98.03% F1-score) and, ''capacitor'' class (96.99% F1-score). The worst performance is the one displayed by the ''transistor'' class by having "only" a F1-score around 94.00% (94.18% F1-score) mainly bacause the model exhibits a low value of the precision metric in this class (92.89% precision).
+
The model, before performing the quantization with the ''vai_q_tensorflow'' tool, has an overall value of '''''accuracy of 96.46%''''' and an overall weighted average '''''F1-score of 96.48%''''' over the test subset of the dataset, showing a good generalization capability on unseen samples. The classes with the highest F1-score, above 96.00% are respectively ''resistor'' class (98.58% F1-score), ''inductor'' class (98.03% F1-score) and ''capacitor'' class (96.99% F1-score). The worst performance is the one displayed by the ''transistor'' class by having "only" a F1-score around 94.00% (94.18% F1-score) mainly because the model exhibits a low value of the precision metric in this class (92.89% precision).
  
 
{| align="center" style="background: transparent; margin: auto; width: 60%;"
 
{| align="center" style="background: transparent; margin: auto; width: 60%;"
Line 413: Line 438:
 
|}
 
|}
  
After performing the quantization with the ''vai_q_tensorflow'' tool and after the deployment on the target device, the model has an overall value of '''''accuracy of 93.40%''''' and an overall weighted average '''''F1-score of 93.36%''''' on the test subset of the dataset. The model is still performing very well in correcly classify samples of the ''capacitor'' class by keeping a F1-score above 96.00% (96.62% F1-score). On the other hand for the remaining classes, there is a substantial reduction in the value of this metric. The classes that exhibit the worst results are ''diode'' class (91.65% F1-score) because the value of the recall metric is very low (87.30% recall), ''IC'' class (91.09% F1-score) having a low value of both precision and recall metrics (91.18% precision, 91.00% recall) and, ''transistor'' (90.62% F1-score) having a low value of precision and recall (90.35% precision, 90.62% recall) in the same way as the previous case. In general, the performance of the model is still good, similar to the performance obtained with two previous models, especially to the one of ResNet101 model.
+
After performing the quantization with the ''vai_q_tensorflow'' tool and after the deployment on the target device, the model has an overall value of '''''accuracy of 93.40%''''' and an overall weighted average '''''F1-score of 93.36%''''' on the test subset of the dataset. The model is still performing very well in correctly classify samples belonging to the ''capacitor'' class by keeping a F1-score above 96.00% (96.62% F1-score). On the other hand, for the remaining classes, there is a substantial reduction in the value of this metric. The classes that exhibit the worst results are ''diode'' class (91.65% F1-score) because the value of the recall metric is very low (87.30% recall), ''IC'' class (91.09% F1-score) by having a low value measured for both precision and recall metrics (91.18% precision, 91.00% recall), and ''transistor'' class (90.62% F1-score) having a low value of precision and recall (90.35% precision, 90.62% recall) in the same way as the previous case. In general, the performance of the model is still good, similar to the performance obtained with two previous models, especially to the one of ResNet101 model.
  
 
{| align="center" style="background: transparent; margin: auto; width: 60%;"
 
{| align="center" style="background: transparent; margin: auto; width: 60%;"
Line 472: Line 497:
 
|}
 
|}
  
To perform the inference over the images, only one DPU core is used for 1 thread, leading to almost an 80% utilization of the DPU-01 core. By increasing the number of threads i.e. with 4 threads, more cores are used and the percentage gets higher, very close to 100% on both DPU-00 and DPU-01 cores. Concerning the DPU latency, for 1 thread the average latency for one image is about 28ms (28867.86μs). By increasing the concurrency, the latency for both cores is higher, about 33ms (32702.59μs) for DPU-00 core, and 30ms (30046.64μs) for DPU-01 core when using 2 threads and about 34ms (33826.30μs) for DPU-00 core, and 30ms (30834.46μs) for DPU-01 core with 4 concurrent threads.
+
To perform the inference over the images, only one DPU core is used for 1 thread, leading to almost an 80% utilization of the DPU-01 core. By increasing the number of threads i.e. with 4 threads, more cores are used and the percentage gets higher, very close to 100% on both DPU-00 and DPU-01 cores. Concerning the DPU latency, for 1 thread the average latency for one image is about 28ms (28867.86μs). By increasing the concurrency, the latency for both cores is higher, about 33ms (32702.59μs) for DPU-00 core, and 30ms (30046.64μs) for DPU-01 core when using 2 threads and about 34ms (33826.30μs) for DPU-00 core, and 30ms (30834.46μs) for DPU-01 core with 4 concurrent threads.<!--Start of table definition-->
 
 
 
 
<!--Start of table definition-->
 
 
{| style="background:transparent; color:black" border="0" align="center" cellpadding="10px" cellspacing="0px" height="550" valign="bottom"
 
{| style="background:transparent; color:black" border="0" align="center" cellpadding="10px" cellspacing="0px" height="550" valign="bottom"
 
|- align="center"
 
|- align="center"
Line 498: Line 520:
 
|}
 
|}
  
The model, before performing the quantization with the vai_q_tensorflow tool, has an overall value of '''''accuracy of 92.68%''''' and an overall weighted average '''''F1-score of 92.69%''''' over the test subset of the dataset, showing a good generalization capability on unseen samples, although lower than in the three previous models. The classes with the highest F1-score, above 96.00% are: ''resistor'' (97.56% F1-score), ''capacitor'' (96.81% F1-score) and, ''inductor'' (96.38% F1-score). However, the model performance on the three remaining classes is poorly compared w.r.t the previous models, showing an F1-score below 90.00% in the class ''diode'' (87.94% F1-score) and, class ''transistor'' (87.27% F1-score) due to having low precision and recall in the former case (88.38% precision, 87.50% recall) and, low precision in the latter (83.67% precision).
+
The model, before performing the quantization with the vai_q_tensorflow tool, has an overall value of '''''accuracy of 92.68%''''' and an overall weighted average '''''F1-score of 92.69%''''' over the test subset of the dataset, showing a good generalization capability on unseen samples, although lower than in the three ResNet models. The classes with the highest F1-score, above 96.00% are: ''resistor'' (97.56% F1-score), ''capacitor'' (96.81% F1-score) and, ''inductor'' (96.38% F1-score). However, the model performance on the classification task in the three remaining classes is poorly compared w.r.t the previous models, showing an F1-score below 90.00% in the ''diode'' class (87.94% F1-score) and, ''transistor'' class (87.27% F1-score) because both classes have a low value of precision and recall metrics for the former (88.38% precision, 87.50% recall) and, a low value in precision metric for the latter (83.67% precision).
  
 
{| align="center" style="background: transparent; margin: auto; width: 60%;"
 
{| align="center" style="background: transparent; margin: auto; width: 60%;"
Line 557: Line 579:
 
|}
 
|}
  
After performing the quantization with the vai_q_tensorflow tool and after the deployment on the target device, the model has an overall value of '''''accuracy of 88.87%''''' and an overall weighted average '''''F1-score of 88.91%''''' on the test subset of the dataset. The model is still performing well on ''resistor'' (97.65% F1-score) but, on the other hand for the remaining classes, there is a substantial drop in the value of the metric. The classes that exhibit the worst results are ''diode'' (85.15% F1-score), ''IC'' (83.27% F1-score) and, ''transistor'' (81.97% F1-score). In general, the performance of the model is still good, but it is decidedly lower than the models analyzed previously.
+
After performing the quantization with the vai_q_tensorflow tool and after the deployment on the target device, the model has an overall value of '''''accuracy of 88.87%''''' and an overall weighted average '''''F1-score of 88.91%''''' on the test subset of the dataset. The model is still performing well in correcly classify samples belonging to ''resistor'' class (97.65% F1-score). On the other hand, for the remaining classes, there is a substantial reduction in the value of this metric. The classes that exhibit the worst results are ''diode'' class (85.15% F1-score), ''IC'' (83.27% F1-score), and ''transistor'' class (81.97% F1-score). In general, the performance of the model is still good, but it is definitely lower than the one obtained with the ResNet models analyzed previously.
  
 
{| align="center" style="background: transparent; margin: auto; width: 60%;"
 
{| align="center" style="background: transparent; margin: auto; width: 60%;"
Line 616: Line 638:
 
|}
 
|}
  
To perform the inference over the images, only one DPU core is used for 1 thread, leading to almost a 70% utilization of the DPU-01 core. By increasing the number of threads i.e. with 4 threads, more cores are used and the percentage gets higher, very close to 100% on both DPU-00 and DPU-01 cores. Concerning the DPU latency, for 1 thread the average latency for one image is about 30ms (30127.38μs). By increasing the concurrency, the latency for both cores is higher, about 34ms (34105.45μs) for DPU-00 core, and 31ms (30981.59μs) for DPU-01 core when using 2 threads and about 35ms (35273.61μs) for DPU-00 core, and 31ms (31761.21μs) for DPU-01 core with 4 concurrent threads.
+
To perform the inference over the images, only one DPU core is used for 1 thread, leading to almost a 70% utilization of the DPU-01 core. By increasing the number of threads i.e. with 4 threads, more cores are used and the percentage gets higher, very close to 100% on both DPU-00 and DPU-01 cores. Concerning the DPU latency, for 1 thread the average latency for one image is about 30ms (30127.38μs). By increasing the concurrency, the latency for both cores is higher, about 34ms (34105.45μs) for DPU-00 core, and 31ms (30981.59μs) for DPU-01 core when using 2 threads and about 35ms (35273.61μs) for DPU-00 core, and 31ms (31761.21μs) for DPU-01 core with 4 concurrent threads.<!--Start of table definition-->
 
 
 
 
<!--Start of table definition-->
 
 
{| style="background:transparent; color:black" border="0" align="center" cellpadding="10px" cellspacing="0px" height="550" valign="bottom"
 
{| style="background:transparent; color:black" border="0" align="center" cellpadding="10px" cellspacing="0px" height="550" valign="bottom"
 
|- align="center"
 
|- align="center"
Line 642: Line 661:
 
|}
 
|}
  
The model, before performing the quantization with the vai_q_tensorflow tool, has an overall value of '''''accuracy of 97.66%''''' and an overall weighted average '''''F1-score of 97.36%''''' over the test subset of the dataset, showing a very high generalization capability on unseen samples. All the classes have a F1-score above 96.00%, actually very high for the class ''resistor'' (98.50% F1-score).
+
The model, before performing the quantization with the vai_q_tensorflow tool, has an overall value of '''''accuracy of 97.66%''''' and an overall weighted average '''''F1-score of 97.36%''''' over the test subset of the dataset, showing a very high generalization capability on unseen samples. All the classes have a F1-score above 96.00%, actually very high for the ''resistor'' class (98.50% F1-score).
  
 
{| align="center" style="background: transparent; margin: auto; width: 60%;"
 
{| align="center" style="background: transparent; margin: auto; width: 60%;"
Line 701: Line 720:
 
|}
 
|}
  
After performing the quantization with the vai_q_tensorflow tool and after the deployment on the target device, the model has an overall value of '''''accuracy of 93.34%''''' and an overall weighted average '''''F1-score of 93.34%''''' on the test subset of the dataset. The model is still performing very well in three classes i.e ''resistor'' (97.12% F1-score), ''inductor'' (97.00% F1-score) and, ''capacitor'' (96.59% F1-score) by keeping a F1-score above 96.00%. However, for the remaining classes, the value of the metric is substantially reduced. The classes that exhibit the worst results are ''IC'' (89.41% F1-score) due to having low precision (84.12% precision) and, ''transistor'' (87.75% F1-score) due to having a very low recall (82.80% recall). In general, the performance of the model is still good, similar to the one obtained with ResNet models.
+
After performing the quantization with the ''vai_q_tensorflow tool'' and after the deployment on the target device, the model has an overall value of '''''accuracy of 93.34%''''' and an overall weighted average '''''F1-score of 93.34%''''' on the test subset of the dataset. The model is still performing very well in correcly classify samples belonging to ''resistor'' class (97.12% F1-score), ''inductor'' class (97.00% F1-score), and ''capacitor'' class (96.59% F1-score) by keeping a F1-score above 96.00%. However, for the remaining classes, the value of the metric is substantially reduced. The classes that exhibit the worst results are ''IC'' class(89.41% F1-score) because of a low value measured for precision metric (84.12% precision), and ''transistor'' class (87.75% F1-score) because of a very low value of the recall metric (82.80% recall). In general, the performance of the model is still good, similar to the one obtained with ResNet models.
  
 
{| align="center" style="background: transparent; margin: auto; width: 60%;"
 
{| align="center" style="background: transparent; margin: auto; width: 60%;"
Line 760: Line 779:
 
|}
 
|}
  
To perform the inference over the images, only one DPU core is used for 1 thread, leading to almost a 60% utilization of the DPU-01 core. By increasing the number of threads i.e. with 4 threads, more cores are used and the percentage gets higher, very close to 100% on DPU-00 core and to 90% on DPU-01 core. Concerning the DPU latency, for 1 thread the average latency for one image is about 18ms (17651.31μs). By increasing the concurrency, the latency for both cores is higher, about 21ms (20511.79μs) for DPU-00 core, and 18ms (18466.97μs) for DPU-01 core when using 2 threads and about 22ms (21654.99μs) for DPU-00 core, and 20ms (19503.17μs) for DPU-01 core with 4 concurrent threads.
+
To perform the inference over the images, only one DPU core is used for 1 thread, leading to almost a 60% utilization of the DPU-01 core. By increasing the number of threads i.e. with 4 threads, more cores are used and the percentage gets higher, very close to 100% on DPU-00 core and to 90% on DPU-01 core. Concerning the DPU latency, for 1 thread the average latency for one image is about 18ms (17651.31μs). By increasing the concurrency, the latency for both cores is higher, about 21ms (20511.79μs) for DPU-00 core, and 18ms (18466.97μs) for DPU-01 core when using 2 threads and about 22ms (21654.99μs) for DPU-00 core, and 20ms (19503.17μs) for DPU-01 core with 4 concurrent threads.<!--Start of table definition-->
 
 
 
 
<!--Start of table definition-->
 
 
{| style="background:transparent; color:black" border="0" align="center" cellpadding="10px" cellspacing="0px" height="550" valign="bottom"
 
{| style="background:transparent; color:black" border="0" align="center" cellpadding="10px" cellspacing="0px" height="550" valign="bottom"
 
|- align="center"
 
|- align="center"
Line 786: Line 802:
 
|}
 
|}
  
The model, before performing the quantization with the vai_q_tensorflow tool, has an overall value of '''''accuracy of 97.53%''''' and an overall weighted average '''''F1-score of 97.53%''''' over the test subset of the dataset, showing a very high generalization capability on unseen samples. Five classes have a F1-score above 96.00%, actually very high for class ''inductor'' (98.66% F1-score)
+
The model, before performing the quantization with the ''vai_q_tensorflow'' tool, has an overall value of '''''accuracy of 97.53%''''' and an overall weighted average '''''F1-score of 97.53%''''' over the test subset of the dataset, showing a very high generalization capability on unseen samples. Five classes have a F1-score above 96.00%, actually very high for ''inductor'' class (98.66% F1-score), and ''resistor'' class (98.55% F1-score). The worst result is the one displayed by the ''transistor'' class by having a F1-score below 96.00% but still very close (95.86% F1-score) mainly due to a low value of the precision metric (93.36% precision).
and, class ''resistor'' (98.55% F1-score). The worst result is the one displayed by the class ''transistor'' by having a F1-score below 96.00% but, still very close (95.86% F1-score) mainly due to a low value of the precision metric (93.36% precision).
 
  
 
{| align="center" style="background: transparent; margin: auto; width: 60%;"
 
{| align="center" style="background: transparent; margin: auto; width: 60%;"
Line 846: Line 861:
 
|}
 
|}
  
After performing the quantization with the ''vai_q_tensorflow'' tool and after the deployment on the target device, the model has an overall value of accuracy of 93.34%' and an overall weighted average F1-score of 93.34% on the test subset of the dataset. The model is still performing very well in two classes i.e ''resistor'' (98.07% F1-score) and, ''capacitor'' (96.23% F1-score) by keeping a F1-score above 96.00%. However, for the remaining classes, the value of the metric is reduced. In particular the worst results can be found in the class ''IC'' (90.80% F1-score) by having a low value for precision and recall metrics (91.73% precision, 89.90% recall) and, class ''transistor'' due to have low precision (87.88% precision).
+
After performing the quantization with the ''vai_q_tensorflow'' tool and after the deployment on the target device, the model has an overall value of '''''accuracy of 93.34%''''' and an overall weighted average '''''F1-score of 93.34%''''' on the test subset of the dataset. The model is still performing very well in correcly classify samples belonging to ''resistor'' class (98.07% F1-score) and, ''capacitor'' class (96.23% F1-score) by keeping a F1-score above 96.00%. However, for the remaining classes, the value of the metric is reduced. In particular the worst results can be found in the ''IC'' class (90.80% F1-score) by having a low value of precision and recall metrics (91.73% precision, 89.90% recall), and ''transistor'' class by having a low value of precision metric (87.88% precision).
  
 
{| align="center" style="background: transparent; margin: auto; width: 60%;"
 
{| align="center" style="background: transparent; margin: auto; width: 60%;"
Line 906: Line 921:
  
 
To perform the inference over the images, only one DPU core is used for 1 thread, leading to almost a 65% utilization of the DPU-01 core. By increasing the number of threads i.e. with 4 threads, more cores are used and the percentage gets higher, very close to 100% on DPU-00 core and to 95% on DPU-01 core. Concerning the DPU latency, for 1 thread the average latency for one image is about 25ms (25185.03μs). By increasing the concurrency, the latency for both cores is higher, about 29ms (28858.88μs) for DPU-00 core, and 26ms (26336.11μs) for DPU-01 core when using 2 threads and about 30ms (30229.27μs) for DPU-00 core, and 27ms (27452.70μs) for DPU-01 core with 4 concurrent threads.
 
To perform the inference over the images, only one DPU core is used for 1 thread, leading to almost a 65% utilization of the DPU-01 core. By increasing the number of threads i.e. with 4 threads, more cores are used and the percentage gets higher, very close to 100% on DPU-00 core and to 95% on DPU-01 core. Concerning the DPU latency, for 1 thread the average latency for one image is about 25ms (25185.03μs). By increasing the concurrency, the latency for both cores is higher, about 29ms (28858.88μs) for DPU-00 core, and 26ms (26336.11μs) for DPU-01 core when using 2 threads and about 30ms (30229.27μs) for DPU-00 core, and 27ms (27452.70μs) for DPU-01 core with 4 concurrent threads.
 
  
 
<!--Start of table definition-->
 
<!--Start of table definition-->
Line 919: Line 933:
  
 
==Comparison==
 
==Comparison==
After reviewing all the created models, showing their performances in terms of accuracy and other classification metrics such as precision, recall, and F1-score, and after evaluating DPU usage/latency for a single inference over the test samples, a comparison between them should be made. The aim is to understand if there exists one model that can be considered the best to be used for solving the problem among the proposed ones.  
+
After reviewing all the created models, showing their performances in terms of accuracy and other classification metrics such as precision, recall, and F1-score, and after evaluating DPU usage/latency for a single inference over the test samples, a comparison between them can be made. The aim is to understand if there exists one model that can be considered the best to be used for solving the problem between the proposed ones.  
  
Since the original dataset was augmented to compensate for the lack of data, hence resulting in a balanced dataset with the same number of samples for each of the six classes, metrics such as precision, recall and, F1-score can be omitted and only the accuracy can be taken into account. Note that, the accuracy of a model can actually be enhanced by furtherly tweaking the training hyperparameters or simply by training the model for a higher number of epochs. Thus, the value of this metric can actually be higher (or even lower in case of overfitting) than the one obtained for this particular configuration (all the models were trained using the same configuration).
+
Since the original dataset was augmented to compensate for the lack of data, hence resulting in a balanced dataset with the same number of samples for each of the six classes, metrics such as precision, recall, and F1-score can be omitted and only the accuracy can be taken into account. Note that the accuracy of a model can actually be enhanced by further tweaking the training hyperparameters or simply by training the model for a higher number of epochs. Thus, the value of this metric can actually be higher (or even lower in case of overfitting) than the one obtained for this particular configuration (all the models were trained using the same configuration).
  
 
For the purpose of this evaluation, it should be noted that only considering the accuracy as a metric might not be the best idea because there are other elements, given by the complexity of the models, that make the choice more complex. There are also features that depend exclusively on the chosen network architecture, such as the number of layers or the total number of training parameters (results in memory occupation), that become fixed parameters in the DPU kernel after model compilation.
 
For the purpose of this evaluation, it should be noted that only considering the accuracy as a metric might not be the best idea because there are other elements, given by the complexity of the models, that make the choice more complex. There are also features that depend exclusively on the chosen network architecture, such as the number of layers or the total number of training parameters (results in memory occupation), that become fixed parameters in the DPU kernel after model compilation.
  
Therefore, to proceed with the evaluation, these features must be taken into account for a better understanding of the whole situation: accuracy pre and post quantization, DPU Kernel parameters size and total tensor count, DPU cores latency, and finally DPU throughput.
+
Therefore, to proceed with the evaluation, the followgin features must be taken into account for a better understanding of the whole situation:
 
+
* accuracy pre and post quantization
 +
* DPU Kernel parameters size and total tensor count
 +
* DPU cores latency
 +
* DPU throughput.
 
By initially considering the accuracy of the models before the quantization, it is possible to see that the ones that have a higher capability of correctly classifying the test samples are, in descending order, the Inception ResNet V2, Inception ResNet V1, and the ResNet101. These three models show an accuracy above 97%. In contrast, the models that display two of the lowest accuracy values are the ResNet50 and the Inception V4. After doing the quantization, the situation changes radically, having at the top of the list the ResNet101, followed by the ResNet50 model, while the Inception ResNet V1 and inception ResNet V2 stand at the bottom, with an accuracy drop of 6.65% for the former and 5.55% for the latter. Moreover, the worst model among those analyzed is the Inception V4, with an accuracy below 90%.
 
By initially considering the accuracy of the models before the quantization, it is possible to see that the ones that have a higher capability of correctly classifying the test samples are, in descending order, the Inception ResNet V2, Inception ResNet V1, and the ResNet101. These three models show an accuracy above 97%. In contrast, the models that display two of the lowest accuracy values are the ResNet50 and the Inception V4. After doing the quantization, the situation changes radically, having at the top of the list the ResNet101, followed by the ResNet50 model, while the Inception ResNet V1 and inception ResNet V2 stand at the bottom, with an accuracy drop of 6.65% for the former and 5.55% for the latter. Moreover, the worst model among those analyzed is the Inception V4, with an accuracy below 90%.
 
  
 
[[File:Pre and post quantization accuracy.png|center|thumb|500x500px|Models pre and post quantization accuracy with vai_q_tensorflow tool]]
 
[[File:Pre and post quantization accuracy.png|center|thumb|500x500px|Models pre and post quantization accuracy with vai_q_tensorflow tool]]
 
  
 
As mentioned before, two other aspects should be taken into account when comparing the models: the DPU Kernel parameters size and the total tensor count. Recall that these two data can be easily retrieved by looking at the Vitis-AI compiler log file when compiling a model or by executing on the target device the command ''ddump''.
 
As mentioned before, two other aspects should be taken into account when comparing the models: the DPU Kernel parameters size and the total tensor count. Recall that these two data can be easily retrieved by looking at the Vitis-AI compiler log file when compiling a model or by executing on the target device the command ''ddump''.
  
*'''Parameters size''': indicates in the unit of MB, kB, or bytes, the amount of memory occupied by the DPU Kernel, including weight and bias. It is straightforward to check that the greater the number of parameters for an implemented model on the host, the greater the amount of memory occupied on the target device.
+
*'''Parameters size''': indicates in the unit of MB, kB, or bytes the amount of memory occupied by the DPU Kernel, including weight and bias. It is straightforward to check that the greater the number of parameters for an implemented model on the host, the greater the amount of memory occupied on the target device.
*'''Total tensor count''': is the total number of DPU tensors for a DPU Kernel. This value depends on the number of stacked layers between input and output layers of the model and obviously the greater the number of stacked layers, the higher the number of tensors, leading to a more complex computation on the DPU. This is directly responsible for increasing the requested amount of time for a single inference on a single image.
+
*'''Total tensor count''': is the total number of DPU tensors for a DPU Kernel. This value depends on the number of stacked layers between input and output layers of the model and obviously the greater the number of stacked layers, the higher the number of tensors, leading to a more complex computation on the DPU. This is directly responsible for increasing the requested amount of time for a single inference on a single image.<!--Start of table definition-->
 
 
 
 
<!--Start of table definition-->
 
 
{| style="background:transparent; color:black" border="0" align="center" cellpadding="10px" cellspacing="0px" height="550" valign="bottom"
 
{| style="background:transparent; color:black" border="0" align="center" cellpadding="10px" cellspacing="0px" height="550" valign="bottom"
 
|- align="center"
 
|- align="center"
Line 949: Line 961:
 
|}
 
|}
  
In the two figures below it is shown the DPU cores latency for 1, 2, and 4 threads; it is interesting to note that the core latency of Inception ResNet V1 is lower than the one of ResNet152, even though they have similar ''total tensor count'' and different values of DPU Kernel ''parameters size'' (actually greater for ResNet152). Vice versa, the ResNet101 and Inception V4 have a similar DPU Kernel ''parameters size'' and different values of ''total tensor count'', and in this case, the core latency is higher for the latter. The same observation can be made for models ResNet50 and Inception ResNet V1 leading to the following statements:  
+
In the two figures below it is shown the DPU cores latency for 1, 2, and 4 threads. It is interesting to note that the core latency of Inception ResNet V1 is lower than the one of ResNet152, even though they have similar ''total tensor count'' and different values of DPU Kernel ''parameters size'' (actually greater for ResNet152). Vice versa, the ResNet101 and Inception V4 have a similar DPU Kernel ''parameters size'' and different values of ''total tensor count'', and, in this case, the core latency is higher for the latter. The same observation can be made for models ResNet50 and Inception ResNet V1 leading to the following statements:  
 
*with the same ''total tensor count'', the latency increases along with the DPU Kernel parameters size.
 
*with the same ''total tensor count'', the latency increases along with the DPU Kernel parameters size.
 
*with the same DPU Kernel ''parameters size'', the latency decreases if the total tensor count lowers.  
 
*with the same DPU Kernel ''parameters size'', the latency decreases if the total tensor count lowers.  
  
These considerations suggest that the best models among the implemented ones are ResNet50, ResNet101, and Inception ResNet V1.
+
These considerations suggest that the best models among the implemented ones are ResNet50, ResNet101, and Inception ResNet V1.<!--Start of table definition-->
 
 
 
 
<!--Start of table definition-->
 
 
{| style="background:transparent; color:black" border="0" align="center" cellpadding="10px" cellspacing="0px" height="550" valign="bottom"
 
{| style="background:transparent; color:black" border="0" align="center" cellpadding="10px" cellspacing="0px" height="550" valign="bottom"
 
|- align="center"
 
|- align="center"
Line 966: Line 975:
 
|}
 
|}
  
Finally, it is possible to evaluate the DPU throughput in relation to the number of threads used by the benchmark application. In the figure below, it is really interesting to observe how all the models for 1 thread, have similar values of FPS but by increasing the level of concurrency the difference is more and more evident.
+
Finally, it is possible to evaluate the DPU throughput in relation to the number of threads used by the benchmark application. In the figure below, it is really interesting to observe how all the models, in the case of 1 thread, have similar values of FPS, but the difference is more and more evident when increasing the level of concurrency.
 
 
 
 
 
[[File:DPU throughput for 1-2-4 threads.png|center|thumb|500x500px|Deployed models DPU throughput for 1, 2, and 4 threads]]
 
[[File:DPU throughput for 1-2-4 threads.png|center|thumb|500x500px|Deployed models DPU throughput for 1, 2, and 4 threads]]
  
 
+
In conclusion, by summing up all the considerations that have been made, it is clearly evident that the solution with the best compromise between accuracy and inference latency is the ResNet50 model, followed by the ResNet101, and Inception ResNet V1 models.
In conclusion, by summing up all the considerations that have been made, it is clearly evident that the solution with the best compromise between accuracy and inference latency is the ResNet50 model, followed by the ResNet101 and Inception ResNet V1 models.
 
  
 
==Useful links==
 
==Useful links==

Latest revision as of 15:37, 3 June 2021

Info Box
NeuralNetwork.png Applies to Machine Learning


History[edit | edit source]

Version Date Notes
1.0.0 June 2021 First public release

Introduction[edit | edit source]

This Technical Note (TN for short) belongs to the series introduced here. Specifically, it illustrates the first issue to be considered to implement a device capable to spot mounting defects on an assembled PCB, which is to detect and recognize the electronic components populating the board. In ML terminology, we are dealing again with a classification problem.

From an engineering perspective, one of the main goals of this work is still the evaluation of the Xilinx's DPU performances on inference tasks in terms of latency and throughput as Xilinx devices are in principle good candidates for implementing an automatic visual inspection machine as described in the opening article of this series. It is worth remembering that the characterization of Xilinx DPU used for solving a classification task is covered in this TN as well. Nevertheless, coping with another classification task also allowed to evaluate how different deep CNN models perform on the same problem.

Test Bed[edit | edit source]

The following table details the test bed used for this Technical Note.

Host and target configurations
System Component Name Version
Host Operating system GNU/Linux Ubuntu 18.04
Software development platform Vitis 1.2
Machine learning framework TensorFlow 1.15.2
Target Hardware platform ZCU104 1.0
Linux BSP Petalinux 2020.1
Software binary image (microSD card) xilinx-zcu104-dpu-v2020.1-v1.2.0 v2020.1-v1.2.0
Neural network hardware accelerator DPU 3.3

For more details about Xilinx's hardware configuration and the usage of the Vitis-AI software platform, please refer to this article.

FICS-PCB dataset overview[edit | edit source]

Over the years, computer vision and ML disciplines have considerably advanced the field of Automated Visual Inspection for Printed Circuit Board (PCB-AVI) assurance. It is already well-known that, to develop a robust model for any ML-based application, a dataset is required, possibly as large as possible, with many examples for a better generalization. Although a few large datasets for PCB-AVI are publicly available, they lack variances that simulate real-world scenarios, such as illumination and scale variations, which are necessary for developing robust PCB-AVI approaches. Therefore, to represent such non-ideal conditions, the FICS-PCB dataset was proposed for evaluating and improving methods for PCB-AVI. This dataset consists of PCB images featuring multiple types of components and various image conditions to facilitate performance evaluation in challenging scenarios that are likely to be encountered in practice.

The dataset consists of 9,912 images of 31 PCB samples and contains a total amount of 77,347 labeled components distributed in six classes: IC, capacitor, diode, inductor, resistor, and transistor. These components were collected using two image sensor types, a digital microscope and a Digital Single-Lens Reflex (DSLR) camera. To ensure that the dataset also includes samples that represent variations in illumination, the authors collected images using three different intensities from the built-in ring light of the microscope i.e. 20, 40, and 60, where 60 represents the brightest illumination. In addition, variations in scale were included using three difference magnifications i.e. 1×, 1.5×, and 2×.

FICS-PCB dataset, examples of six types of components

This dataset is highly unbalanced, having a lot of samples only for two classes i.e. capacitor and resistor. In this situation, it is not a good idea to use this dataset as it is, simply because the models will be trained on image batches mainly composed of the most common components, hence learning only a restricted number of features. This has as a consequence that the models will probably be very good at classifying capacitor and resistor classes and pretty bad at classifying the remaining ones. Therefore, the missing data should be increased with oversampling.

Before proceeding further, please note that the number of DSLR subset examples is by far lower than the number of the Microscope subset samples. As the two subsets were acquired using two different kinds of instruments, their characteristics — the resolution, for example — differ significantly. In order to have homogeneous images w.r.t. the characteristics, it is preferable to keep only one of them, specifically the most numerous.

FICS-PCB dataset, component count per class in DSLR and Microscope subsets

The dataset was created by random sampling 150000 component images from Microscope subset of the FICS-PCB dataset, providing a total of 25000 images per class. The 72% of the images were used for training, the 24% was used as a validation subset during training and the remaining 4% was used as a test set, providing exactly 108000 training images, 32000 validation images, and 6000 test images, equally distributed among the six classes of the dataset. Each image was preprocessed and padded with a constant value in order to adapt its scale to the input tensor size of the models. During this process the aspect ratio of the image was not modified. To increase variety among the examples, random contrast, brightness, saturation, and rotation were applied too. Diode, inductor, IC, and, transistor classes were oversampled.

FICS-PCB dataset: An example of image augmentation as compensation for lack of data in IC, diode, inductor, and transistor classes

Training configuration and hyperparameters setup[edit | edit source]

The training was done in the cloud using Google Colab. All the models were trained with the same configuration for 1000 epochs, providing at each step of an epoch a mini-batch of 32 images. Adam was chosen as optimizer, providing an initial learn rate of 0.0001 and a learning rate schedule that uses an exponential decay schedule with a decay rate of 0.96. Dropout rate was set at 0.4 for all models. Patience for early stopping was set at 100 epochs. The training images were further augmented with random zoom, shift, and rotation in order to improve model robustness on validation and test subsets and prevent the risk of overfitting.

FICS-PCB dataset: An example of image augmentation on training images to increase the robustness of the models

Proposed models[edit | edit source]

ResNet50[edit | edit source]

The model, during the training phase, shows an increasing trend in accuracy over the train subset samples and over the validation subset samples. This is a sign that the model is learning in a correct way since it is not underfitting the train data. Furthermore, by looking at the trend of the loss during the 1000 training epochs, the model is clearly not overfitting the train data. By saving the status of the model with checkpoints each time there is an improvement in the validation loss, the best result is found at epoch 993 with an accuracy of 93.59% and a loss of 0.1912 on the validation data.

Train and validation accuracy trend over 1000 training epochs for ResNet50 model
Train and validation loss trend over 1000 training epochs for ResNet50 model

The model, before performing the quantization with the vai_q_tensorflow tool, has an overall value of accuracy of 94.85% and an overall weighted average F1-score of 94.86% over the test subset of the dataset, showing a good generalization capability on unseen samples. The classes with the highest F1-score, above 96.00%, are: resistor (98.08% F1-score), inductor (97.10% F1-score), and capacitor (96.88% F1-score). On the contrary, the class in which the model performs poorly w.r.t. the others is the diode class (91.75% F1-score). This is attributable to a low value of precision metric (88.55% precision).

Confusion matrix of ResNet50 model on host machine before quantization
Host machine, classification report
Class Precision Recall F1-score Support
IC 0.95740 0.89900 0.92728 1000
capacitor 0.97278 0.96500 0.96888 1000
diode 0.88558 0.95200 0.91759 1000
inductor 0.97006 0.97200 0.97103 1000
resistor 0.98882 0.97300 0.98085 1000
transistor 0.92262 0.93000 0.92629 1000
Weighted avg 0.94954 0.94850 0.94865 6000

After performing the quantization with the vai_q_tensorflow tool and after the deployment on the target device, the model has an overall value of accuracy of 93.27% and an overall weighted average F1-score of 93.29% on the test subset of the dataset. The model is still performing well in classify correctly samples belonging to the resistor class (98.08% F1-score), inductor class (97.10% F1-score), and capacitor class (96.88% F1-score). The worst results of the model in the classification task can be found

  • in the transistor class (89.78% F1-score) because both measured precision and recall metrics are below 90.00% (89.96% precision and, 89.60% recall)
  • in the diode class (88.59% F1-score) because the precision metric is very low (83.77% precision).
Confusion matrix of ResNet50 model on target device after quantization
Target device, classification report
Class Precision Recall F1-score Support
IC 0.96384 0.85300 0.90504 1000
capacitor 0.99068 0.95700 0.97355 1000
diode 0.83779 0.94000 0.88596 1000
inductor 0.94839 0.97400 0.96103 1000
resistor 0.97211 0.97600 0.97405 1000
transistor 0.89960 0.89600 0.89780 1000
Weighted avg 0.93540 0.93267 0.93290 6000

To perform the inference over the images, only one DPU core is used for 1 thread, leading to almost a 55% utilization of the DPU-01 core. By increasing the number of threads i.e. with 4 threads, more cores are used and the percentage gets higher, very close to 100% on DPU-00 core and close to 90% on DPU-01 core. Concerning the DPU latency, for 1 thread the average latency for one image is about 12ms (11526.41μs). By increasing the concurrency, the latency for both cores is higher, about 13ms (13318.01μs) for DPU-00 core, and 12ms (12019.21μs) for DPU-01 core when using 2 threads and about 14ms (14200.19μs) for DPU-00 core, and 13ms (12776.24μs) for DPU-01 core with 4 concurrent threads.

Utilization of CPU and DPU cores of ResNet50 model for 1, 2, and 4 threads
DPU latency of ResNet50 model for 1, 2, and 4 threads

ResNet101[edit | edit source]

The model, during the training phase, shows an increasing trend in accuracy over the train subset samples and over the validation subset samples. This is a sign that the model is learning in a correct way since it is not underfitting the train data. Furthermore, by looking at the trend of the loss during the 1000 training epochs, the model is clearly not overfitting the train data. By saving the status of the model with checkpoints each time there is an improvement in the validation loss, the best result is found at epoch 944 with an accuracy of 98.12% and a loss of 0.0781 on the validation data.

Train and validation accuracy trend over 1000 training epochs for ResNet101 model
Train and validation loss trend over 1000 training epochs for ResNet101 model

The model, before performing the quantization with the vai_q_tensorflow tool, has an overall value of accuracy of 97.10% and an overall weighted average F1-score of 97.11% over the test subset of the dataset, showing a very high generalization capability on unseen samples. All the classes have a F1-score above 96.00%. In particular, it is very high in the resistor class (98.65% F1-score) and in the inductor class (98.50% F1-score). The only exception is the diode class (95.40% F1-score) mainly because it has a low value of recall metric (94.40% recall).

Confusion matrix of ResNet101 model on host machine before quantization
Host machine, classification report
Class Precision Recall F1-score Support
IC 0.96375 0.95700 0.96036 1000
capacitor 0.96373 0.98300 0.97327 1000
diode 0.96425 0.94400 0.95402 1000
inductor 0.98500 0.98500 0.98500 1000
resistor 0.98504 0.98800 0.98652 1000
transistor 0.96517 0.97000 0.96758 1000
Weighted avg 0.97116 0.97117 0.97112 6000

After performing the quantization with the vai_q_tensorflow tool and after the deployment on the target device, the model has an overall value of accuracy of 93.95% and an overall weighted average F1-score of 93.91% on the test subset of the dataset. The model is still performing very well in classify correctly samples belonging to the capacitor class by keeping the F1-score above 96.00% (97.03% F1-score). On the other hand, for the remaining classes, there is a substantial reduction in the value of this metric. The classes that exhibit the worst results are diode class (92.09% F1-score) and IC class (92.06% F1-score) because both class show a low value of the recall metric (90.30% recall for the former, 88.20% recall for the latter). In general, the performance of the model is still good, similar to the one obtained with the ResNet50 model.

Confusion matrix of ResNet101 model on target device after quantization
Target device, classification report
Class Precision Recall F1-score Support
IC 0.96288 0.88200 0.92067 1000
capacitor 0.95898 0.98200 0.97036 1000
diode 0.93965 0.90300 0.92096 1000
inductor 0.93719 0.95500 0.94601 1000
resistor 0.90428 0.99200 0.94611 1000
transistor 0.93896 0.92300 0.93091 1000
Weighted avg 0.94033 0.93950 0.93917 6000

To perform the inference over the images, only one DPU core is used for 1 thread, leading to almost a 70% utilization of the DPU-01 core. By increasing the number of threads i.e. with 4 threads, more cores are used and the percentage gets higher, very close to 100% on DPU-00 core and close to 95% on DPU-01 core. Concerning the DPU latency, for 1 thread the average latency for one image is about 21ms (21339.73μs). By increasing the concurrency, the latency for both cores is higher, about 24ms (24313.61μs) for DPU-00 core, and 22ms (22231.22μs) for DPU-01 core when using 2 threads and about 25ms (25385.51μs) for DPU-00 core, and 23ms (23025.89μs) for DPU-01 core with 4 concurrent threads.

Utilization of CPU and DPU cores of ResNet101 model for 1, 2, and 4 threads
DPU latency of ResNet101 model for 1, 2, and 4 threads

ResNet152[edit | edit source]

The model, during the training phase, shows an increasing trend in accuracy over the train subset samples and over the validation subset samples. This is a sign that the model is learning in a correct way since it is not underfitting the train data. Furthermore, by looking at the trend of the loss during the 1000 training epochs, the model is clearly not overfitting the train data. By saving the status of the model with checkpoints each time there is an improvement in the validation loss, the best result is found at epoch 969 with an accuracy of 97.66% and a loss of 0.0721 on the validation data.

Train and validation accuracy trend over 1000 training epochs for ResNet152 model
Train and validation loss trend over 1000 training epochs for ResNet152 model

The model, before performing the quantization with the vai_q_tensorflow tool, has an overall value of accuracy of 96.46% and an overall weighted average F1-score of 96.48% over the test subset of the dataset, showing a good generalization capability on unseen samples. The classes with the highest F1-score, above 96.00% are respectively resistor class (98.58% F1-score), inductor class (98.03% F1-score) and capacitor class (96.99% F1-score). The worst performance is the one displayed by the transistor class by having "only" a F1-score around 94.00% (94.18% F1-score) mainly because the model exhibits a low value of the precision metric in this class (92.89% precision).

Confusion matrix of ResNet152 model on host machine before quantization
Host machine, classification report
Class Precision Recall F1-score Support
IC 0.94553 0.97200 0.95858 1000
capacitor 0.95538 0.98500 0.96997 1000
diode 0.98298 0.92400 0.95258 1000
inductor 0.98584 0.97500 0.98039 1000
resistor 0.99390 0.97800 0.98589 1000
transistor 0.92899 0.95500 0.94181 1000
Weighted avg 0.96544 0.96483 0.96487 6000

After performing the quantization with the vai_q_tensorflow tool and after the deployment on the target device, the model has an overall value of accuracy of 93.40% and an overall weighted average F1-score of 93.36% on the test subset of the dataset. The model is still performing very well in correctly classify samples belonging to the capacitor class by keeping a F1-score above 96.00% (96.62% F1-score). On the other hand, for the remaining classes, there is a substantial reduction in the value of this metric. The classes that exhibit the worst results are diode class (91.65% F1-score) because the value of the recall metric is very low (87.30% recall), IC class (91.09% F1-score) by having a low value measured for both precision and recall metrics (91.18% precision, 91.00% recall), and transistor class (90.62% F1-score) having a low value of precision and recall (90.35% precision, 90.62% recall) in the same way as the previous case. In general, the performance of the model is still good, similar to the performance obtained with two previous models, especially to the one of ResNet101 model.

Confusion matrix of ResNet152 model on target device after quantization
Target device, classification report
Class Precision Recall F1-score Support
IC 0.91182 0.91000 0.91091 1000
capacitor 0.94460 0.98900 0.96629 1000
diode 0.96464 0.87300 0.91654 1000
inductor 0.94124 0.94500 0.94311 1000
resistor 0.94038 0.97800 0.95882 1000
transistor 0.90358 0.90900 0.90628 1000
Weighted avg 0.93438 0.93400 0.93366 6000

To perform the inference over the images, only one DPU core is used for 1 thread, leading to almost an 80% utilization of the DPU-01 core. By increasing the number of threads i.e. with 4 threads, more cores are used and the percentage gets higher, very close to 100% on both DPU-00 and DPU-01 cores. Concerning the DPU latency, for 1 thread the average latency for one image is about 28ms (28867.86μs). By increasing the concurrency, the latency for both cores is higher, about 33ms (32702.59μs) for DPU-00 core, and 30ms (30046.64μs) for DPU-01 core when using 2 threads and about 34ms (33826.30μs) for DPU-00 core, and 30ms (30834.46μs) for DPU-01 core with 4 concurrent threads.

Utilization of CPU and DPU cores of ResNet152 model for 1, 2, and 4 threads
DPU latency of ResNet152 model for 1, 2, and 4 threads

InceptionV4[edit | edit source]

The model, during the training phase, shows an increasing trend in accuracy over the train subset samples and over the validation subset samples. This is a sign that the model is learning in a correct way since it is not underfitting the train data. Furthermore, by looking at the trend of the loss during the 1000 training epochs, the model is clearly not overfitting the train data. By saving the status of the model with checkpoints each time there is an improvement in the validation loss, the best result is found at epoch 957 with an accuracy of 95.00% and a loss of 0.1729 on the validation data.

Train and validation accuracy trend over 1000 training epochs for InceptionV4 model
Train and validation loss trend over 1000 training epochs for InceptionV4 model

The model, before performing the quantization with the vai_q_tensorflow tool, has an overall value of accuracy of 92.68% and an overall weighted average F1-score of 92.69% over the test subset of the dataset, showing a good generalization capability on unseen samples, although lower than in the three ResNet models. The classes with the highest F1-score, above 96.00% are: resistor (97.56% F1-score), capacitor (96.81% F1-score) and, inductor (96.38% F1-score). However, the model performance on the classification task in the three remaining classes is poorly compared w.r.t the previous models, showing an F1-score below 90.00% in the diode class (87.94% F1-score) and, transistor class (87.27% F1-score) because both classes have a low value of precision and recall metrics for the former (88.38% precision, 87.50% recall) and, a low value in precision metric for the latter (83.67% precision).

Confusion matrix of InceptionV4 model on host machine before quantization
Host machine, classification report
Class Precision Recall F1-score Support
IC 0.94524 0.86300 0.90225 1000
capacitor 0.98051 0.95600 0.96810 1000
diode 0.88384 0.87500 0.87940 1000
inductor 0.95575 0.97200 0.96381 1000
resistor 0.96847 0.98300 0.97568 1000
transistor 0.83670 0.91200 0.87273 1000
Weighted avg 0.92842 0.92683 0.92699 6000

After performing the quantization with the vai_q_tensorflow tool and after the deployment on the target device, the model has an overall value of accuracy of 88.87% and an overall weighted average F1-score of 88.91% on the test subset of the dataset. The model is still performing well in correcly classify samples belonging to resistor class (97.65% F1-score). On the other hand, for the remaining classes, there is a substantial reduction in the value of this metric. The classes that exhibit the worst results are diode class (85.15% F1-score), IC (83.27% F1-score), and transistor class (81.97% F1-score). In general, the performance of the model is still good, but it is definitely lower than the one obtained with the ResNet models analyzed previously.

Confusion matrix of InceptionV4 model on target device after quantization
Target device, classification report
Class Precision Recall F1-score Support
IC 0.78158 0.89100 0.83271 1000
capacitor 0.99220 0.89000 0.93832 1000
diode 0.88553 0.82000 0.85151 1000
inductor 0.88973 0.94400 0.91606 1000
resistor 0.97319 0.98000 0.97658 1000
transistor 0.83282 0.80700 0.81971 1000
Weighted avg 0.89251 0.88867 0.88915 6000

To perform the inference over the images, only one DPU core is used for 1 thread, leading to almost a 70% utilization of the DPU-01 core. By increasing the number of threads i.e. with 4 threads, more cores are used and the percentage gets higher, very close to 100% on both DPU-00 and DPU-01 cores. Concerning the DPU latency, for 1 thread the average latency for one image is about 30ms (30127.38μs). By increasing the concurrency, the latency for both cores is higher, about 34ms (34105.45μs) for DPU-00 core, and 31ms (30981.59μs) for DPU-01 core when using 2 threads and about 35ms (35273.61μs) for DPU-00 core, and 31ms (31761.21μs) for DPU-01 core with 4 concurrent threads.

Utilization of CPU and DPU cores of InceptionV4 model for 1, 2, and 4 threads
DPU latency of InceptionV4 model for 1, 2, and 4 threads

Inception ResNet V1[edit | edit source]

The model, during the training phase, shows an increasing trend in accuracy over the train subset samples and over the validation subset samples. This is a sign that the model is learning in a correct way since it is not underfitting the train data. Furthermore, by looking at the trend of the loss during the 1000 training epochs, the model is clearly not overfitting the train data. By saving the status of the model with checkpoints each time there is an improvement in the validation loss, the best result is found at epoch 959 with an accuracy of 97.97% and a loss of 0.0751 on the validation data.

Train and validation accuracy trend over 1000 training epochs for Inception ResNet V1 model
Train and validation loss trend over 1000 training epochs for Inception ResNet V1 model

The model, before performing the quantization with the vai_q_tensorflow tool, has an overall value of accuracy of 97.66% and an overall weighted average F1-score of 97.36% over the test subset of the dataset, showing a very high generalization capability on unseen samples. All the classes have a F1-score above 96.00%, actually very high for the resistor class (98.50% F1-score).

Confusion matrix of Inception ResNet V1 model on host machine before quantization
Host machine, classification report
Class Precision Recall F1-score Support
IC 0.98274 0.96800 0.97531 1000
capacitor 0.97571 0.96400 0.96982 1000
diode 0.94889 0.98400 0.96613 1000
inductor 0.98085 0.97300 0.97691 1000
resistor 0.98211 0.98800 0.98504 1000
transistor 0.97278 0.96500 0.96888 1000
Weighted avg 0.97385 0.97367 0.97368 6000

After performing the quantization with the vai_q_tensorflow tool and after the deployment on the target device, the model has an overall value of accuracy of 93.34% and an overall weighted average F1-score of 93.34% on the test subset of the dataset. The model is still performing very well in correcly classify samples belonging to resistor class (97.12% F1-score), inductor class (97.00% F1-score), and capacitor class (96.59% F1-score) by keeping a F1-score above 96.00%. However, for the remaining classes, the value of the metric is substantially reduced. The classes that exhibit the worst results are IC class(89.41% F1-score) because of a low value measured for precision metric (84.12% precision), and transistor class (87.75% F1-score) because of a very low value of the recall metric (82.80% recall). In general, the performance of the model is still good, similar to the one obtained with ResNet models.

Confusion matrix of Inception ResNet V1 model on target device after quantization
Target device, classification report
Class Precision Recall F1-score Support
IC 0.84127 0.95400 0.89410 1000
capacitor 0.99787 0.93600 0.96594 1000
diode 0.94346 0.90100 0.92174 1000
inductor 0.95275 0.98800 0.97005 1000
resistor 0.94852 0.99500 0.97121 1000
transistor 0.93348 0.82800 0.87758 1000
Weighted avg 0.93622 0.93367 0.93344 6000

To perform the inference over the images, only one DPU core is used for 1 thread, leading to almost a 60% utilization of the DPU-01 core. By increasing the number of threads i.e. with 4 threads, more cores are used and the percentage gets higher, very close to 100% on DPU-00 core and to 90% on DPU-01 core. Concerning the DPU latency, for 1 thread the average latency for one image is about 18ms (17651.31μs). By increasing the concurrency, the latency for both cores is higher, about 21ms (20511.79μs) for DPU-00 core, and 18ms (18466.97μs) for DPU-01 core when using 2 threads and about 22ms (21654.99μs) for DPU-00 core, and 20ms (19503.17μs) for DPU-01 core with 4 concurrent threads.

Utilization of CPU and DPU cores of Inception ResNet V1 model for 1, 2, and 4 threads
DPU latency of Inception ResNet V1 model for 1, 2, and 4 threads

Inception ResNet V2[edit | edit source]

The model, during the training phase, shows an increasing trend in accuracy over the train subset samples and over the validation subset samples. This is a sign that the model is learning in a correct way since it is not underfitting the train data. Furthermore, by looking at the trend of the loss during the 1000 training epochs, the model is clearly not overfitting the train data. By saving the status of the model with checkpoints each time there is an improvement in the validation loss, the best result is found at epoch 974 with an accuracy of 97.50% and a loss of 0.0724 on the validation data.

Train and validation accuracy trend over 1000 training epochs for Inception ResNet V2 model
Train and validation loss trend over 1000 training epochs for Inception ResNet V2 model

The model, before performing the quantization with the vai_q_tensorflow tool, has an overall value of accuracy of 97.53% and an overall weighted average F1-score of 97.53% over the test subset of the dataset, showing a very high generalization capability on unseen samples. Five classes have a F1-score above 96.00%, actually very high for inductor class (98.66% F1-score), and resistor class (98.55% F1-score). The worst result is the one displayed by the transistor class by having a F1-score below 96.00% but still very close (95.86% F1-score) mainly due to a low value of the precision metric (93.36% precision).

Confusion matrix of Inception ResNet V2 model on host machine before quantization
Host machine, classification report
Class Precision Recall F1-score Support
IC 0.97872 0.96600 0.97232 1000
capacitor 0.99177 0.96400 0.97769 1000
diode 0.98963 0.95400 0.97149 1000
inductor 0.97931 0.99400 0.98660 1000
resistor 0.98213 0.98900 0.98555 1000
transistor 0.93365 0.98500 0.95864 1000
Weighted avg 0.97587 0.97533 0.97538 6000

After performing the quantization with the vai_q_tensorflow tool and after the deployment on the target device, the model has an overall value of accuracy of 93.34% and an overall weighted average F1-score of 93.34% on the test subset of the dataset. The model is still performing very well in correcly classify samples belonging to resistor class (98.07% F1-score) and, capacitor class (96.23% F1-score) by keeping a F1-score above 96.00%. However, for the remaining classes, the value of the metric is reduced. In particular the worst results can be found in the IC class (90.80% F1-score) by having a low value of precision and recall metrics (91.73% precision, 89.90% recall), and transistor class by having a low value of precision metric (87.88% precision).

Confusion matrix of Inception ResNet V2 model on target device after quantization
Target device, classification report
Class Precision Recall F1-score Support
IC 0.91735 0.89900 0.90808 1000
capacitor 0.99466 0.93200 0.96231 1000
diode 0.98793 0.90000 0.94192 1000
inductor 0.92066 0.99800 0.95777 1000
resistor 0.96970 0.99200 0.98072 1000
transistor 0.87887 0.93600 0.90654 1000
Weighted avg 0.94486 0.94283 0.94289 6000

To perform the inference over the images, only one DPU core is used for 1 thread, leading to almost a 65% utilization of the DPU-01 core. By increasing the number of threads i.e. with 4 threads, more cores are used and the percentage gets higher, very close to 100% on DPU-00 core and to 95% on DPU-01 core. Concerning the DPU latency, for 1 thread the average latency for one image is about 25ms (25185.03μs). By increasing the concurrency, the latency for both cores is higher, about 29ms (28858.88μs) for DPU-00 core, and 26ms (26336.11μs) for DPU-01 core when using 2 threads and about 30ms (30229.27μs) for DPU-00 core, and 27ms (27452.70μs) for DPU-01 core with 4 concurrent threads.

Utilization of CPU and DPU cores of Inception ResNet V2 model for 1, 2, and 4 threads
DPU latency of Inception ResNet V2 model for 1, 2, and 4 threads

Comparison[edit | edit source]

After reviewing all the created models, showing their performances in terms of accuracy and other classification metrics such as precision, recall, and F1-score, and after evaluating DPU usage/latency for a single inference over the test samples, a comparison between them can be made. The aim is to understand if there exists one model that can be considered the best to be used for solving the problem between the proposed ones.

Since the original dataset was augmented to compensate for the lack of data, hence resulting in a balanced dataset with the same number of samples for each of the six classes, metrics such as precision, recall, and F1-score can be omitted and only the accuracy can be taken into account. Note that the accuracy of a model can actually be enhanced by further tweaking the training hyperparameters or simply by training the model for a higher number of epochs. Thus, the value of this metric can actually be higher (or even lower in case of overfitting) than the one obtained for this particular configuration (all the models were trained using the same configuration).

For the purpose of this evaluation, it should be noted that only considering the accuracy as a metric might not be the best idea because there are other elements, given by the complexity of the models, that make the choice more complex. There are also features that depend exclusively on the chosen network architecture, such as the number of layers or the total number of training parameters (results in memory occupation), that become fixed parameters in the DPU kernel after model compilation.

Therefore, to proceed with the evaluation, the followgin features must be taken into account for a better understanding of the whole situation:

  • accuracy pre and post quantization
  • DPU Kernel parameters size and total tensor count
  • DPU cores latency
  • DPU throughput.

By initially considering the accuracy of the models before the quantization, it is possible to see that the ones that have a higher capability of correctly classifying the test samples are, in descending order, the Inception ResNet V2, Inception ResNet V1, and the ResNet101. These three models show an accuracy above 97%. In contrast, the models that display two of the lowest accuracy values are the ResNet50 and the Inception V4. After doing the quantization, the situation changes radically, having at the top of the list the ResNet101, followed by the ResNet50 model, while the Inception ResNet V1 and inception ResNet V2 stand at the bottom, with an accuracy drop of 6.65% for the former and 5.55% for the latter. Moreover, the worst model among those analyzed is the Inception V4, with an accuracy below 90%.

Models pre and post quantization accuracy with vai_q_tensorflow tool

As mentioned before, two other aspects should be taken into account when comparing the models: the DPU Kernel parameters size and the total tensor count. Recall that these two data can be easily retrieved by looking at the Vitis-AI compiler log file when compiling a model or by executing on the target device the command ddump.

  • Parameters size: indicates in the unit of MB, kB, or bytes the amount of memory occupied by the DPU Kernel, including weight and bias. It is straightforward to check that the greater the number of parameters for an implemented model on the host, the greater the amount of memory occupied on the target device.
  • Total tensor count: is the total number of DPU tensors for a DPU Kernel. This value depends on the number of stacked layers between input and output layers of the model and obviously the greater the number of stacked layers, the higher the number of tensors, leading to a more complex computation on the DPU. This is directly responsible for increasing the requested amount of time for a single inference on a single image.
Deployed models DPU Kernel parameters size
Deployed models DPU Kernel total tensor count

In the two figures below it is shown the DPU cores latency for 1, 2, and 4 threads. It is interesting to note that the core latency of Inception ResNet V1 is lower than the one of ResNet152, even though they have similar total tensor count and different values of DPU Kernel parameters size (actually greater for ResNet152). Vice versa, the ResNet101 and Inception V4 have a similar DPU Kernel parameters size and different values of total tensor count, and, in this case, the core latency is higher for the latter. The same observation can be made for models ResNet50 and Inception ResNet V1 leading to the following statements:

  • with the same total tensor count, the latency increases along with the DPU Kernel parameters size.
  • with the same DPU Kernel parameters size, the latency decreases if the total tensor count lowers.

These considerations suggest that the best models among the implemented ones are ResNet50, ResNet101, and Inception ResNet V1.

Deployed models DPU-00 core latency for 1, 2, and 4 threads
Deployed models DPU-01 core latency for 1, 2, and 4 threads

Finally, it is possible to evaluate the DPU throughput in relation to the number of threads used by the benchmark application. In the figure below, it is really interesting to observe how all the models, in the case of 1 thread, have similar values of FPS, but the difference is more and more evident when increasing the level of concurrency.

Deployed models DPU throughput for 1, 2, and 4 threads

In conclusion, by summing up all the considerations that have been made, it is clearly evident that the solution with the best compromise between accuracy and inference latency is the ResNet50 model, followed by the ResNet101, and Inception ResNet V1 models.

Useful links[edit | edit source]