Changes

Jump to: navigation, search
no edit summary
|3.3
|}
For more details about the Xilinx's hardware configuration and, the usage of the Vitis-AI software platform please refer to [[ML-TN-001 - AI at the edge: comparison of different embedded platforms - Part 3]].
 
==FICS-PCB dataset overview==
Over the years, computer vision and ML disciplines have considerably advanced the field of Automated Visual Inspection for Printed Circuit Board (PCB-AVI) assurance. It is already well-known that to develop a robust modelfor any ML-based application a dataset is required, possibly as large as possible, with many examples for a better generalization. Although a few large datasets for PCB-AVI are publicly available, they lack variances that simulate real-world scenarios, such as illumination and scale variations, which are necessary for developing robust PCB-AVI approaches. Therefore, to represent such non-ideal conditions, the FICS-PCB dataset was proposed for evaluating and improving methods for PCB-AVI. This dataset consists of PCB images featuring multiple types of components and various image conditions to facilitate performance evaluation in challenging scenarios that are likely to be encountered in practice.
The dataset consists of 9,912 images of 31 PCB samples and contains a total amount of 77,347 labeled components distributed in six classes: ''IC'', ''capacitor'', ''diode'', ''inductor'', ''resistor'' and, ''transistor''. These components were collected using two image sensor types, a digital microscope and a Digital Single-Lens Reflex (DSLR) camera. To ensure that the dataset also includes samples that represent variations in illumination, the authors collected images using three different intensities from the built-in ring light of the microscope i.e. 20, 40, and 60, where 60 represents the brightest illumination. In addition, variations in scale were included using three difference magnifications i.e. 1×, 1.5×, and 2×.
 
[[File:FICS-PCB samples.png|center|thumb|500x500px|FICS-PCB dataset, examples of six types of components]]
It is straightforward just by looking at the figure below that this dataset is highly unbalanced, having a lot of samples only for two classes i.e. ''capacitor '' and ''resistor''. This is indeed no surprise, simply because these two component types are more commonly mounted on a PCB with respect to the others. Unfortunately, in In this situation, it is not a good idea to use this dataset as it is, simply because the models will be trained on image batches mainly composed by the most common components, hence learning only a restricted number of features. This has as a consequence that the models will probably be very good at classifying ''capacitor '' and ''resistor '' classes and pretty bad at classifying the remaining classesones. Therefore, the missing data must be increased with image augmentation.
Before proceeding further, please note that the number of DSLR subset examples is by far lower than the number of the Microscope subset samples. As the two subsets were acquired using two different kind of instruments, their characteristics — the resolution, for example — differ significantly. In order to have homogeneous images w.r.t. the characteristics, it is preferable to keep only one of them, specifically the most numerous.
 
 
[[File:Samples per class in Microscope and DSLR subsets.png|center|thumb|500x500px|FICS-PCB dataset, component count per class in DSLR and Microscope subsets]]
[[File:Image augmentation for training samples.png|center|thumb|500x500px|FICS-PCB dataset, an example of image augmentation on training images to increase the robustness of the models]]
 
==Proposed models==
===ResNet50===
The model, during the training phase, shows an increasing trend in accuracy over the train subset samples and over the validation subset samples. This is a sign that the model is learning in a correct way since it is not underfitting the train data. Furthermore, by looking at the trend of the loss during the 1000 training epochs, the model is clearly not overfitting the train data. By saving the status of the model with checkpoints each time there is an improvement in the validation loss, the best result is found at '''''epoch 993''''' with an '''''accuracy of 93.59%''''' and a '''''loss of 0.1912''''' on the validation data.  <!--Start of table definition-->
{| style="background:transparent; color:black" border="0" align="center" cellpadding="10px" cellspacing="0px" height="550" valign="bottom"
|- align="center"
|}
To perform the inference over the images, only one DPU core is used for 1 thread, leading to almost a 55% utilization of the DPU-01 core. By increasing the number of threads i.e. with 4 threads, more cores are used and the percentage gets higher, very close to 100% on DPU-00 core and close to 90% on DPU-01 core. Concerning the DPU latency, for 1 thread the average latency for one image is about 12ms (11526.41μs). By increasing the concurrency, the latency for both cores is higher, about 13ms (13318.01μs) for DPU-00 core, and 12ms (12019.21μs) for DPU-01 core when using 2 threads and about 14ms (14200.19μs) for DPU-00 core, and 13ms (12776.24μs) for DPU-01 core with 4 concurrent threads.   <!--Start of table definition-->
{| style="background:transparent; color:black" border="0" align="center" cellpadding="10px" cellspacing="0px" height="550" valign="bottom"
|- align="center"
|}
To perform the inference over the images, only one DPU core is used for 1 thread, leading to almost a 70% utilization of the DPU-01 core. By increasing the number of threads i.e. with 4 threads, more cores are used and the percentage gets higher, very close to 100% on DPU-00 core and close to 95% on DPU-01 core. Concerning the DPU latency, for 1 thread the average latency for one image is about 21ms (21339.73μs). By increasing the concurrency, the latency for both cores is higher, about 24ms (24313.61μs) for DPU-00 core, and 22ms (22231.22μs) for DPU-01 core when using 2 threads and about 25ms (25385.51μs) for DPU-00 core, and 23ms (23025.89μs) for DPU-01 core with 4 concurrent threads.  <!--Start of table definition-->
{| style="background:transparent; color:black" border="0" align="center" cellpadding="10px" cellspacing="0px" height="550" valign="bottom"
|- align="center"
|}
To perform the inference over the images, only one DPU core is used for 1 thread, leading to almost an 80% utilization of the DPU-01 core. By increasing the number of threads i.e. with 4 threads, more cores are used and the percentage gets higher, very close to 100% on both DPU-00 and DPU-01 cores. Concerning the DPU latency, for 1 thread the average latency for one image is about 28ms (28867.86μs). By increasing the concurrency, the latency for both cores is higher, about 33ms (32702.59μs) for DPU-00 core, and 30ms (30046.64μs) for DPU-01 core when using 2 threads and about 34ms (33826.30μs) for DPU-00 core, and 30ms (30834.46μs) for DPU-01 core with 4 concurrent threads.  <!--Start of table definition-->
{| style="background:transparent; color:black" border="0" align="center" cellpadding="10px" cellspacing="0px" height="550" valign="bottom"
|- align="center"
|}
To perform the inference over the images, only one DPU core is used for 1 thread, leading to almost a 70% utilization of the DPU-01 core. By increasing the number of threads i.e. with 4 threads, more cores are used and the percentage gets higher, very close to 100% on both DPU-00 and DPU-01 cores. Concerning the DPU latency, for 1 thread the average latency for one image is about 30ms (30127.38μs). By increasing the concurrency, the latency for both cores is higher, about 34ms (34105.45μs) for DPU-00 core, and 31ms (30981.59μs) for DPU-01 core when using 2 threads and about 35ms (35273.61μs) for DPU-00 core, and 31ms (31761.21μs) for DPU-01 core with 4 concurrent threads.  <!--Start of table definition-->
{| style="background:transparent; color:black" border="0" align="center" cellpadding="10px" cellspacing="0px" height="550" valign="bottom"
|- align="center"
|}
To perform the inference over the images, only one DPU core is used for 1 thread, leading to almost a 60% utilization of the DPU-01 core. By increasing the number of threads i.e. with 4 threads, more cores are used and the percentage gets higher, very close to 100% on DPU-00 core and to 90% on DPU-01 core. Concerning the DPU latency, for 1 thread the average latency for one image is about 18ms (17651.31μs). By increasing the concurrency, the latency for both cores is higher, about 21ms (20511.79μs) for DPU-00 core, and 18ms (18466.97μs) for DPU-01 core when using 2 threads and about 22ms (21654.99μs) for DPU-00 core, and 20ms (19503.17μs) for DPU-01 core with 4 concurrent threads.  <!--Start of table definition-->
{| style="background:transparent; color:black" border="0" align="center" cellpadding="10px" cellspacing="0px" height="550" valign="bottom"
|- align="center"
To perform the inference over the images, only one DPU core is used for 1 thread, leading to almost a 65% utilization of the DPU-01 core. By increasing the number of threads i.e. with 4 threads, more cores are used and the percentage gets higher, very close to 100% on DPU-00 core and to 95% on DPU-01 core. Concerning the DPU latency, for 1 thread the average latency for one image is about 25ms (25185.03μs). By increasing the concurrency, the latency for both cores is higher, about 29ms (28858.88μs) for DPU-00 core, and 26ms (26336.11μs) for DPU-01 core when using 2 threads and about 30ms (30229.27μs) for DPU-00 core, and 27ms (27452.70μs) for DPU-01 core with 4 concurrent threads.
 
<!--Start of table definition-->
By initially considering the accuracy of the models before the quantization, it is possible to see that the ones that have a higher capability of correctly classifying the test samples are, in descending order, the Inception ResNet V2, Inception ResNet V1, and the ResNet101. These three models show an accuracy above 97%. In contrast, the models that display two of the lowest accuracy values are the ResNet50 and the Inception V4. After doing the quantization, the situation changes radically, having at the top of the list the ResNet101, followed by the ResNet50 model, while the Inception ResNet V1 and inception ResNet V2 stand at the bottom, with an accuracy drop of 6.65% for the former and 5.55% for the latter. Moreover, the worst model among those analyzed is the Inception V4, with an accuracy below 90%.
 
[[File:Pre and post quantization accuracy.png|center|thumb|500x500px|Models pre and post quantization accuracy with vai_q_tensorflow tool]]
*'''Parameters size''': indicates in the unit of MB, kB, or bytes, the amount of memory occupied by the DPU Kernel, including weight and bias. It is straightforward to check that the greater the number of parameters for an implemented model on the host, the greater the amount of memory occupied on the target device.
*'''Total tensor count''': is the total number of DPU tensors for a DPU Kernel. This value depends on the number of stacked layers between input and output layers of the model and obviously the greater the number of stacked layers, the higher the number of tensors, leading to a more complex computation on the DPU. This is directly responsible for increasing the requested amount of time for a single inference on a single image.  <!--Start of table definition-->
{| style="background:transparent; color:black" border="0" align="center" cellpadding="10px" cellspacing="0px" height="550" valign="bottom"
|- align="center"
*with the same DPU Kernel ''parameters size'', the latency decreases if the total tensor count lowers.
These considerations suggest that the best models among the implemented ones are ResNet50, ResNet101, and Inception ResNet V1.  <!--Start of table definition-->
{| style="background:transparent; color:black" border="0" align="center" cellpadding="10px" cellspacing="0px" height="550" valign="bottom"
|- align="center"
Finally, it is possible to evaluate the DPU throughput in relation to the number of threads used by the benchmark application. In the figure below, it is really interesting to observe how all the models for 1 thread, have similar values of FPS but by increasing the level of concurrency the difference is more and more evident.
 
 
[[File:DPU throughput for 1-2-4 threads.png|center|thumb|500x500px|Deployed models DPU throughput for 1, 2, and 4 threads]]
dave_user
207
edits

Navigation menu