ML-TN-009 — AI at the edge: IoT real-time endoscopes and Federated Learning

From DAVE Developer's Wiki
Jump to: navigation, search
Info Box
NeuralNetwork.png Applies to Machine Learning


History[edit | edit source]

Version Date Notes
0.1.0 March 2025 First public draft
1.0.0 April 2025 First public release

Abstract[edit | edit source]

This article summarizes the work carried out by Niccolò Brusadin for his internship at DAVE Embedded System. He prototyped a Federated Learning system whose edge devices are machines emulating IoT smart endoscopes for automatic early detection of gastro-intestinal tract polyps.

Early detection of colorectal polyps is crucial for colorectal cancer prevention, however current endoscopic techniques have limitations regarding detection accuracy and efficiency, since most studies proved a significant number of polyps can be missed out during routine exam procedures. Deep Learning-based computer detection systems have been developed to assist endoscopists with real-time polyp detection but training these models on medical data raises concerns about privacy, data security, and regulatory compliance.

This thesis explores the integration of Federated Learning (FL), a Machine learning paradigm that allows multiple institutions to collaboratively train a distributed AI model without sharing sensitive patient data. The work develops a Federated Learning system for automatic GI polyp detection based on an IoT-based platform. In such a context, the role of edge devices will be played by a small fleet of prototypes of a "smart" endoscope able to process in real-time streams through a YOLOv5 deep learning model. The training infrastructure will be created using the NVFlare framework, which is an open-source FL platform for privacy-preserving AI applications. This approach ensures data privacy while improving accuracy and reliability for polyp detection without sharing personal data.

Experimental assessments of the centralized approach demonstrated the efficacy and high accuracy of the YOLOv5 model for polyp detection, achieving strong performance in terms of mAP. Evaluations of the federated training scenarios indicated that it produced comparable performance compared to the centralized learning environment while maintaining data privacy. Moreover, the applicability of the inference-federated hybrid system was demonstrated, which allowed real-time polyp detection. Limitations were identified, including hardware capabilities and communications latency. This suggests future optimizations could be made to the growing area of AI that focuses on patient privacy in imaging studies, and create opportunities for using federated AI, in practice.

The work in based on the achievements of two previous internships detailed here and here.

If you are interested in having Brusadin's entire thesis, please send a request to this address.

Introduction[edit | edit source]

Colorectal cancer (CRC) is a prominent malignant tumour in the digestive system, primarily affecting the colon or rectum, and it significantly contributes to global cancer mortality, accounting for around 10% of all cases. Early detection is crucial, as survival rates can rise from approximately 63% to 91% when diagnosed early. Colonoscopy is the gold standard for identifying CRC, allowing detailed examination of polyps. However, the endoscopist’s expertise greatly influences its sensitivity, and polyps can be easily missed due to various factors, highlighting the need for advanced digital solutions in detection.

Recent research has turned toward AI-assisted screening, employing Deep Learning techniques for object detection and semantic segmentation in polyp diagnosis. YOLO models, known for their efficacy in medical image analysis, offer real-time polyp detection capabilities, balancing speed and performance. Nevertheless, training AI models in healthcare raises significant data privacy and security concerns due to the centralized nature of data collection.

Federated Learning (FL) addresses these issues by enabling devices to collaboratively train a shared AI model while keeping patient data localized, thus maintaining privacy. This decentralized approach not only protects sensitive information but also enhances model generalization across varied clinical contexts. Various FL frameworks have been developed for healthcare applications, with NVFlare offering high-efficiency simulation tools for research and robust production capabilities for enterprise users.

Purpose of the work and research questions[edit | edit source]

This work aims to develop a Federated Learning system for automatic detection of gastrointestinal polyps, by integrating deep-learning inference tasks with privacy-preserving training. This research addresses the challenges of AI-assisted colonoscopy, focusing on diagnostic accuracy, computational efficiency, and protection of patient data.

The proposed system will employ an IoT-based smart endoscope device prototyped with an embedded device running a Debian-derived GNU/Linux distribution and integrated with a YOLOv5 deep learning model for real-time polyp detection. The FL training infrastructure will be implemented using NVFlare, with two separate nodes to jointly train the global model without sharing raw medical data. This approach aims to enhance model generalization across various clinical environments while ensuring data privacy regulations. The key points of this research will be as follows:

  1. Development of a deep-learning model for polyp-detection: a YOLOv5 Deep Learning model will be selected to be used as the training model in the federated learning framework and as the inference model to perform detection tasks on smart endoscopic devices.
  2. Integration of Nvidia NVFlare Federated Learning paradigm: the developed model will be implemented in the NVFlare framework as the starting point for the federated learning scenario.
  3. Investigation of the impact of decentralized training: a comparison between FL training and centralized training will be assessed to determine whether a federated learning environment improves model robustness across different nodes.
  4. Implementation of FL polyp detection in actual practice: the purpose is to investigate the applicability of NVFlare framework in real-world environments and evaluate its ability to maintain model accuracy, reduce reliance on centralized datasets, and ensure compliance with healthcare privacy regulations.
  5. Design a system that integrates federated training and detection: the goal is to design a server-edge system where devices initially perform inference locally, then select only poorly detected images to be used for federated training. This iterative process focuses on enhancing model accuracy by training the model only in challenging cases and avoiding redundant training on well-classified images.
  6. Model Optimization for the designed workflow: finally, we aim to ensure that the model runs on edge devices with adequate computational resources without compromising detection accuracy.

Based on these objectives, the research aims to answer the following questions:

  1. How can Federated Learning be integrated with a deep-learning YOLOv5 model for polyp detection while preserving patient data privacy?
  2. How does the Federated Learning approach impact model generalization and robustness across different nodes?
  3. What is the trade-off between accuracy and computational efficiency during YOLOv5 model deployment in real-time polyp detection for an IoT-edge device system?
  4. What are the practical challenges in implementing the FL-based AI models in real-world medical environments, and how could they be addressed?
  5. Is the selected embedded platform suitable, in terms of hardware resources, for implementing an actual product?

This work attempts to find answers to the questions above while demonstrating the applicability of the federated framework in polyp detection.

Hardware/software test-bed[edit | edit source]

The following picture illustrates the test-bed used for this work.

ML-TN-009-testbed.png

It includes a host machine acting as both an FL server and one client (client 1), while the second client operates on an embedded device, specifically DAVE Embedded Systems ORCA Single Board Computer, powered by NXP i.MX8M Plus SoC. This system-on-chip, in turn, integrates a Neural Processing Unit (NPU). The NPU hardware-accelerates ML workloads during the execution of inference algorithms that make use of the most common Deep Neural Network (DNN) architectures. Basically, NPU is a dedicated processor optimized for executing mathematical computations required by DNNs. Typical advantages when leveraging an NPU are:

  • Off-loading the CPU so that it can performs other processing
  • By reducing the inference time of DNN-based algorithms, increasing the throughput of processed samples (in this case, expressed in frames per second)
  • Improving power efficiency.

Regarding the host machine, to ensure a consistent and reproducible environment, a containerized architecture was deployed based on Docker. The Docker image is built on top of a recommended PyTorch image alongside necessary dependencies.

For what concerns client 1, to keep its configuration simple, no containers were used. Instead, a Python virtual environment was created reproducing the containerized environment of client 2. To facilitate these steps, a Debian-derived distribution was installed, called Armbian. As such, this distribution allows to install pre-built packages easily with well-known apt and friends tools. Armbian is very convenient for developing and testing tasks, but it is not highly optimized as it generally does not provide software modules required to exploit proprietary hardware accelerators such as i.MX8MP’s NPU. As it was more important to verify the functionality of the proposed solution rather than optimize its performance, it was deemed appropriate to prioritize speed in implementing and testing it at the expense of performance in executing the inference algorithms. This is taken into account in this section, where the results achieved are discussed.

Specifications of both machines are detailed in the following table.

Spec / Machine Host

(server + client 2)

Client 1 Notes
Architecture AMD64 AARCH64
Processor / SoC AMD Ryzen 9 5950X NXP i.MX8M Plus
Hardware accelerator NVIDIA RTX 3080 Ti GPU 2.3 TOPS NPU (1) (1) Not exploited.
RAM [GB] 64 6

As explained here, NVFlare had been chosen as the FL framework due to its capability in developing FL applications on embedded systems and addressing privacy regulations inherent in medical data processing.

NVFlare[edit | edit source]

NVFlare adopts a modular design, focusing on essential collaboration components, including a Controller that manages communication between the FL server and clients. The main workflow involves parameter initialization by the FL server, task delegation to clients, local model training, model updates submission back to the server, and aggregation of model updates using algorithms like FedAvg.

Among the available open-source frameworks, Nvidia NVFlare has been selected for developing a Federated Learning environment in this project. A recent study conducted at DAVE Embedded Systems company supports this choice, emphasizing the applicability of NVFlare for developing FL applications in embedded systems [7]. The findings of the study confirm that NVFlare facilitates the creation of real-world Federated Learning environments using Linux-powered embedded platforms, thus showing its adaptability. Furthermore, in the context of medical data and in real-world clinical scenarios, this software development kit (SDK) addresses the requirements for real-time processing privacy regulations. With its suite of powerful tools designed for enhanced collaboration and efficiency, NVFlare emerges as the most suitable and reliable framework for building a Federated Learning system aimed at polyp detection using embedded devices.

Nvidia NVFlare is an open-source framework selected for developing a Federated Learning (FL) environment, primarily due to its adaptability and effectiveness for building FL applications, particularly in embedded systems. Supported by a study conducted at DAVE Embedded Systems, NVFlare has demonstrated its capability in creating real-world FL environments on Linux-powered embedded platforms, which is essential for applications like medical data processing that require adherence to privacy regulations and real-time processing demands. With NVFlare's suite of tools, developers can enhance collaboration and efficiency, making it a reliable choice for building systems focused on polyp detection and other applications.

Overview of NVFlare[edit | edit source]

NVFlare adopts a "less is more" philosophy in its construction and is designed around an Application Programming Interface (API) approach that emphasizes essential functionality while maintaining flexibility and reduced complexity. This design allows developers to easily customize and build Federated Learning workflows. The framework's architecture includes various components, such as Controllers, Task Executors, and Filters, which streamline the process of client coordination and task execution in federated environments.

The Federated Learning Server plays a central role in NVFlare, managing client communications, assigning tasks, aggregating model updates, and overseeing the overall workflow. It interacts with a Job component that specifies the federated learning tasks, while the FL Client represents the distributed nodes executing these tasks. Each Client includes an Executor responsible for locally processing the training tasks assigned by the Controller.

Furthermore, NVFlare ensures secure and efficient deployment of FL applications through an end-to-end operational environment. It provides security credentials and secure communication capabilities essential for real-world applications. Researchers can carry out FL studies and simulations using either admin commands through Notebooks or the NVFlare Console, an interactive command tool, facilitating streamlined operations.

Architecture and Workflow of NVFlare[edit | edit source]

The NVFlare workflow aligns with common FL algorithms, such as FedAvg, consisting of several key operational steps:

  1. The FL Server initiates a job with parameters, including a global model to be distributed to clients.
  2. The Controller assigns training tasks to clients, requesting model updates based on local data.
  3. Each Client's Executor processes the assigned tasks by training the model locally.
  4. Once training concludes, Executors upload model updates to the FL Server.
  5. The Controller collects and aggregates these updates using the designated federated learning algorithm to refine the global model.
  6. The updated global model is redistributed to clients, continuing this iterative process until the desired model accuracy is achieved.

In addition, optional filters can be integrated into task interactions to enhance data privacy through techniques such as differential privacy or homomorphic encryption, which do not hinder the training process.

Structured communication between the Controller and Executor is organized using Shareable Objects, which contain information transmitted between the client and server, and Data Exchange Objects (DXOs) that specify the content of these communications. An important element of NVFlare is the FLComponent class, which serves as the foundation for various components within the system, offering built-in mechanisms for auditing, event handling, logging, and error handling, facilitating organized FL activity.

NVFlare Simulator[edit | edit source]

The NVFlare Simulator is a crucial tool that allows developers and data scientists to expedite the development of FL Components and learning workflows. It enables local testing and debugging of applications on a single machine without needing a realistic project setup, as all clients and servers are simulated within the same process. The Simulator manages client instances and executes multiple federated learning rounds in a controlled environment, allowing components developed here to transition seamlessly into real-world federated scenarios.

Real-world deployment and provisioning tools[edit | edit source]

To facilitate the implementation of FL systems in real-world contexts, NVFlare includes a comprehensive provisioning system that secures communications and simplifies deployment processes. The Provisioning tool generates security credentials and configurations for all participants in an FL study. Each participant receives a Startup Kit containing essential configuration files, certificates, and local authorization policies, ensuring a consistent and secure setup across different locations.

In practical applications, NVFlare employs client-server communication channels secured by signed certificates for identity verification and SSL to establish secure connections between clients and servers. The framework utilizes its own Certificate Authority (CA) to generate and sign certificates for each participant, thus ensuring unique identities. The gRPC protocol facilitates efficient and secure communication, verifying credentials via generated tokens before allowing clients to join the training process, thereby reinforcing security and preventing unauthorized access.

In summary, Nvidia NVFlare empowers developers to build highly adaptable and secure Federated Learning environments, especially suited for applications like medical data processing. Its architecture, tools, and focus on privacy and security make it a frontrunner in the Federated Learning framework realm.

Model development[edit | edit source]

Selection[edit | edit source]

The YOLO (You Only Look Once) family stands out with its remarkable efficiency and accuracy among the various object detection algorithms. Developed by the team at Ultralytics, YOLO has become a popular choice for real-time object detection. Our study implements YOLOv5, a decision supported by prior research from DAVE Embedded Systems company that highlighted its performance in detecting polyps. The name YOLO reflects its unique approach: it examines the entire image at once to identify objects and their locations, unlike other traditional methods that use a two-stage detection process. In the YOLO framework, object detection is treated as a regression problem with a single convolutional neural network that predicts bounding boxes and class probabilities for the entire image.

YOLOv5 models[edit | edit source]

The YOLOv5 architecture consists of five distinct models, ranging from the computationally efficient YOLOv5n to the high-precision YOLOv5x. Each version is tailored for different deployment scenarios and varies in speed, size, and accuracy.

  • YOLOv5n (Nano): it is designed for resource-constrained environments and is the smallest and fastest model in the series. With a compact size of less than 2.5 MB in INT8 format and approximately 4 MB in FP32 format, it is ideal for deployment on edge devices and IoT platforms.
  • YOLOv5s (Small): YOLOv5s consists of approximately 7.2 million parameters. Its balance between efficiency and accuracy makes it suitable for CPU-based inference tasks as well as IoT platforms
  • YOLOv5m (Medium): this mid-sized model contains 21.2 million parameters, offering a trade-off between speed and accuracy. YOLOv5m is often considered a versatile option for a broad range of object detection applications and datasets.
  • YOLOv5l (Large): with 46.5 million parameters, YOLOv5l is designed for scenarios that require higher precision, particularly in detecting smaller objects within images.
  • YOLOv5x (Extra Large): YOLOv5x boasts 86.7 million parameters, achieving the highest mean Average Precision (mAP) among its counterparts. However, this increased performance comes at the cost of higher computational requirements.

Performance metrics[edit | edit source]

The performance metrics discussed in this thesis are those used by the YOLOv5 model, which is employed in both centralized and federated training. Since federated training is implemented using the NVFlare framework, the evaluation metrics remain the same for both approaches, as YOLOv5 will be considered as integral part of the NVFlare environment. To evaluate the performance of object detection models effectively, several metrics are employed. Each of these metrics provides insights into different aspects of the model's accuracy and reliability. Below are the metrics used for evaluating YOLOv5 models, focusing on AP, mAP, and confidence scores. These last mentioned are essential for assessing the effectiveness of object detection models, offering insights into their performance in identifying and localizing objects within images.

  • Intersection over Union (IoU): quantifies the overlap between a predicted bounding box and a ground truth bounding box. It plays a crucial role in evaluating the accuracy of object localization.
  • Precision (P): quantifies the percentage of true positives among all positive predictions, evaluating the model’s ability to avoid false positives.
  • Recall (R): calculates the ratio of correctly identified positive instances by the object detector.
  • F1-score: this score is the harmonic mean of precision and recall, providing a balanced evaluation of a model’s performance by considering both false positives and false negatives.
  • Average Precision (AP): this metric is calculated based on the Precision-Recall (PR) curve. Basically, it calculates the area under the PR curve (AUC), providing a single value that encapsulates the model’s precision and recall performance.
  • Mean Average Precison (mAP): it extends the concept of AP by computing the average AP values across multiple object classes. This will be the primary metric to evaluate federated learning (FL) training performance and to compare results between centralized and federated scenarios.
  • Confidence score: when the deal with inference task, the confidence score is the one to consider. It belongs to the predicted output of the model and it reflects how confident the model is that the box contains an object and also how accurate it thinks the box is that it predicts. If no object exists in that cell, the confidence scores should be zero, otherwise it represents the IOU between the predicted box and any ground truth box.

Datasets[edit | edit source]

High-quality data that accurately reflects the size, shape, texture, and variability of polyps is essential for achieving accurate and robust detection of polyps during colonoscopy. To train and test machine learning models, several public datasets — including Kvasir-SEG, PolypDB, PolypGEN, Etis-LaribPolypDB, and CVC-ColonDB — were used to design algorithms for polyp detection and segmentation. These datasets provide a representative sample of images and videos captured during colonoscopy procedures in clinical practice. The datasets used in this research were selected following a thorough search of public resources, including an online benchmark table. In essence, the method involved a phased approach utilizing different datasets for pre-training, architecture evaluation, and performance testing of the YOLOv5 model in a federated learning context.

To enhance model generalization and reduce overfitting, image augmentation was applied during training as well. YOLOv5 incorporates various augmentation techniques that include:

  • Mosaic augmentation: an image processing technique that combines four training images into one to encourage object detection models to better handle various object scales and translations.
  • Random affine transformations: include random rotation, scaling, translation, and cropping of images.
  • HSV augmentation: random modifications to the hue, saturation, and value of images.

Central training[edit | edit source]

Before the central training process of the initial model, a pre-training phase was executed to fine-tune the network's head for use in both central and federated training. After performing data preprocessing, a YOLOv5 model was trained to reproduce the results obtained in this prior project.

Example of a polyp image annotation with bounding boxes in VOC format to be converted into YOLO format.
EndoCV_C2_0197.jpg with predicted bonding boxes. Image from: PolypGEN dataset, Center 2.
Associated class and coordinates of the bounding boxes.

In doing so, comparisons with the previous work and literature were drawn to ensure model robustness. Moreover, an effort was undertaken to guarantee model reproducibility by enriching the original dataset with additional publicly available datasets. The new trained model was evaluated on a dedicated test dataset of endoscopic images to demonstrate its improvement. A comparative analysis among various YOLOv5 versions was conducted to further refine the development process to determine the most suitable model for Federated Learning implementation.

Project result directory of YOLOv5 containing all training outputs and visualizations of performance metrics.

Integration into the Federated Learning workflow[edit | edit source]

The integration of the model into a Federated Learning workflow was facilitated using the NVFlare open-source framework and the pre-trained YOLOv5 weights. A simulation was carried out to test NVFlare tools and evaluate the performance of the model trained across two distinct clients. This was performed using NVFlare FL Simulator, leveraging the original dataset and the widely adopted FedAvg algorithm, which is well-documented in existing literature and included in NVFlare’s repository examples. Moreover, advanced techniques such as Secure Aggregation and Homomorphic Encryption were rigorously tested to enhance the security and integrity of the training process.

Federated training[edit | edit source]

The NVFlare Provisioning tool was then employed to execute federated training, with the attention on considering computational constraints inherent to the embedded device. To simplify the process and make it reproducible, a software framework was devised for smart endoscopes, integrating both inference capabilities and federate training. The starting point of this framework was the trained YOLOv5 model based on the enriched dataset. The workflow began with the distribution of the global YOLO model to individual clients, followed by the conversion of the model into a streamlined TFLite format to optimize the usage of hardware resources. Inference was then executed, where images that met predetermined quality criteria were segregated and excluded from the next federated training. The concept was to train only on images where the model still needs to learn while discarding those on which it performed well, as these would not bring new information to the next training round.

Model exportation in .tflite format on the embedded board.

Results[edit | edit source]

The outcomes of the experiments conducted on both the centralized and federated training processes are briefly presented in this chapter. Initially, the focus will be on evaluating the performance of YOLOv5 models within a central environment to determine the most appropriate model and its best parameters for federated learning. This was followed by several tests conducted within the federated learning environment using NVFlare, with the Simulator and Provisioning Tool. Finally, the chapter will illustrate the proposed system's performance, which integrates federated learning with real-time inference to emulate a realistic deployment scenario where smart endoscopic devices collectively improve their detection capability. The results will highlight the impact of model configurations and optimization strategies, providing insights into the effectiveness of the designed FL system in enhancing polyp detection performance on smart IoT platforms.

Centralized approach results[edit | edit source]

Training YOLOv5 models on dataset 1[edit | edit source]

This section presents the results of training the entire YOLOv5 family of models (n, s, m, l, and x) on one of the datasets listed here, started from the obtained warm-up weights and ranging from the smallest to

the largest version. The models were trained by implementing two different optimizer configurations, SGD and Adam, both for 100 epochs with an input image size of 640 pixels and a batch size of 16. Model performance was evaluated using the test set of Dataset 1.

Model parameter specifications for training YOLOv5 family of models.
Model performance on Dataset 1 test set using SGD optimizer.
Model performance on Dataset 1 test set using Adam optimizer.

The objective of this initial analysis was to evaluate the performance of different architectures of YOLOv5 model family. Since the task was to detect one only class, the performance of the smaller YOLOv5 models was close to that of the largest model, which has over 40 million parameters. Larger models performed better if we analyze the mAP metric, especially with variable IoU thresholds, demonstrating greater model robustness. However, one of the key goals of this test was also to evaluate the performance of the two different optimizers. It was observed that, in general, larger models performed better with the Adam optimizer, showing greater stability during training. On the other hand, for the smaller models, such as version s and version n, the best performance was achieved using the SGD optimizer, in terms of both precision and mAP. Given the good results obtained with the smaller models, in the subsequent analyses, the lighter models — YOLOv5s and YOLOv5n — were used, which will later be implemented for Federated Learning.

An additional effort was made to optimize these models by evaluating different batch sizes and training epochs by visual inspection of both training and validation loss curves, as well as model's performance assessment. The following analyses focused on the YOLOv5s model with the SGD optimizer, using 100 epochs and different batch size configurations.

Model performance on Dataset 1 test set using yolov5s model and different batch size configurations.
Training and validation loss curves of yolov5s
Batch size of 4 Batch size of 16 Batch size of 64
ML-TN-009-img-00.png
ML-TN-009-img-01.png
ML-TN-009-img-03.png

The batch size parameter did not significantly influence the training process. Although a more stable loss is observed with larger batch sizes, the focus was more on smaller batch sizes, which will be implemented in federated training on the embedded board. Moreover, as was expected, an increased batch size improved the training execution time, though at the cost of higher computational resource requirements. Similar results were observed for yolov5n architecture.

Training YOLOv5 models on Dataset 2[edit | edit source]

Once the v5s and v5n architectures were selected, their performance was evaluated using Dataset 2, which contains a larger number of image samples. This was done to build a more robust model and assess its generalization on a separate test dataset. Again, an optimizer comparison was conducted with Adam and SGD optimizers. Both models were trained for 100 epochs. The results for the v5s model are presented below. Similar considerations can be made for the v5n model as well.

Training and validation loss curves of yolov5s trained on Dataset 2
Adam optimizer SGD optimizer SGD optimizer with early stopping at 30 epochs
ML-TN-009-img-04.png
ML-TN-009-img-05.png
ML-TN-009-img-06.png

It can be observed that training the v5s model for 100 epochs with the SGD optimizer leads to overfitting, as the validation loss starts to increase after the 50th epoch. Given the previous good result with SGD optimizer, early stopping was applied to mitigate this issue. The same scenario was also observed during training with the lighter v5n version. When evaluating the performance on the validation set of Dataset 2, the first case (Adap optimizer) and the third case (SGD optimizer with early stopping at 30 epochs) were taken into consideration for both model architectures.

Model performance of yolov5s with Adam and SGD optimizer on validation set.
Model performance of yolov5n with Adam and SGD optimizer on validation set.

It can be observed that the early stopping technique helped achieve a performance similar to that of training with the Adam optimizer, with the added benefit of fewer iterations and reduced training

time.

Validation batch composed of 8 images of the yolov5s model trained on SGD for 30 epochs on the left and validation predictions on the same batch on the right. The predictions are expressed in terms of confidence score.
ML-TN-009-img-07.png
ML-TN-009-img-08.png

Generalizability assessment of training on Dataset 1 vs. Dataset 2[edit | edit source]

Both yolov5s and yolov5n models trained with the SGD optimizer on Dataset 2 were used to evaluate their performance on the two test datasets: the Etis-Larib dataset and the Center 4 of the PolypGEN dataset. Their performance is then compared with the previously trained versions on Database 1.

Model performance of yolov5 models trained on Dataset 2 and tested on Etis-Larib dataset and Center 4 of PolypGEN dataset.
Model performance of yolov5 models trained on Dataset 1 and tested on Etis-Larib dataset and Center 4 of PolypGEN dataset.

Overall, the performance of the models trained on the enriched dataset achieved good results even on different datasets, demonstrating the robustness and generalizability of the odel. Between the two test sets, the lowest performance was achieved on the Center 4 dataset, especially for the n architecture. Figure 46 shows that the model still struggles to recognize polyps at the edges of the image, particularly when acquisitions are blurred. In general, the v5s version showed better ability to capture features, leading to improved performance with respect to the n architecture. However, when comparing the test results with the model trained on the initial dataset, a significant improvement in the metrics was evident. Specifically, there was an average improvement of 8% in F1-score and 9.27% in mAP, showing that the model trained on the richer dataset generalizes better than the one trained on a single dataset.

Center 4 test batch composed of 8 images of the yolov5s model trained on SGD for 30 epochs on the left and test predictions on the same batch on the right. The predictions are expressed in terms of confidence score.
ML-TN-009-img-10.png
ML-TN-009-img-11.png

Federated approach results[edit | edit source]

Comparison of centralized and Federated Learning using FL Simulator[edit | edit source]

In the first test conducted using the NVFlare Simulator tool, the weights used were from the YOLOv5s model, obtained after the warm-up training phase. Following the federated training workflow, the entire Dataset 1 was divided into two splits using the PolypDataSplitter component, which was further split into a training set and a validation set. Therefore, each site consisted of 500 images, with 75% reserved for training and 25% for model validation to be implemented with Cross-Site Model Evaluation workflow. The models were evaluated in terms of mAP during the validation phase and compared with centralized training. To assess the ability to perform federated training, multiple configurations of training rounds and local epochs were tested. The study aimed to examine the effect of increasing the number of training rounds on the server side and local training epochs on client side. To assess this effect, four different experiments were conducted.

FL Simulator tests conducted varying training rounds on server side and local epochs on client side.
mAP curve: the orange line refers to the centralized approach, the blue line to the federated, obtained with Cross-site Model Evaluation after each federated training round. On x-axis is represented the number of epochs, and on y-axes the metric mAP with IoU threshold 0.5.
test-1 sim
test-2 sim
test-3 sim
test-4 sim

The performance of the model with federated training is evaluated at each training round, when the model weights are aggregated to form the global model. The mAP provided an indicator to assess the model’s response when trained with federated and centralized approaches. It is evident that the performance curve of the federated approach follows that of the centralized approach, with a mAP above 0.8. The YoloModelLearner was implemented so that if the performance did not show significant improvement at each aggregation round, the model weights remained those aggregated from the previous iteration. This is noticeable towards the later steps, where the curve reaches saturation with a plateau. This indicates that the performance, after the 20th epoch, does not show significant improvement when trained with the federated approach. However, this result is also visible with centralized training. Since the objective was to minimize computational cost for the smart endoscope, this result provided an indication of the total number of epochs required for training the network in a real-world scenario.

Specifically, it can be observed that a low number of local epochs, figure test-2 sim, results in lower performance. If we consider the mAP at the 15th epoch, the curve with 4 local epochs aggregation is lower compared to those with a higher number of local epochs. This suggested that the model needs a certain number of local epochs to adjust the weights and improve training. On the other hand, results with more than 8 local epochs appear more promising.

Considering that the goal was to reduce the number of training rounds, which are computationally expensive (since each round also runs the Cross-Site Model Evaluation workflow), a good compromise between training epochs and local rounds was to keep the number of rounds relatively low (up to 3 rounds) and increase the number of local epochs from 8 to 20, as demonstrated in the literature when using the FedAvg algorithm. This will be considered for analysis in federated training with the Provisioning Tool.

Federated Learning in real-world scenarios with Provisioning Tool[edit | edit source]

Federated Training with Provisioning Tool results[edit | edit source]

The "real" federated learning test was conducted with the Provision tool, using as global weights the ones obtained by training YOLOv5s on Dataset 2; first the startup kits and the training dataset were transferred to each client, one on the embedded board and the other on the host machine. Site-1, corresponding to the embedded board, received the dataset from Center 1 of the PolypGEN for training and Center 4 as the validation set for Cross-Site Model Evaluation. Site-2, located on the host machine, received the dataset from Center 3 of the PolypGEN for training and Etis-Larib as the validation set.

This configuration allowed for the evaluation of the aggregated models and a comparison of the performance of federated training against the tests conducted with a centralized approach.

Additionally, the system's security in generating certificates for connecting the server and clients was analyzed, and an attempt was made to test the use of Homomorphic Encryption.

Client’s configuration of the federated real-world experiment based on yolov5s model weights obtained from Dataset 2.

An initial attempt to train the network with a batch size of 16 and an image size of 640 failed, due to insufficient computational resources on board. Therefore, the batch size was reduced to 4 and the image size to 320, at the cost of a potential drop in performance when detecting smaller polyps. Three tests were conducted to evaluate the performance of federated training. Details are reported in the following table.

Federated, "real"-world tests configuration with different number of rounds and local epochs.

First, test-1 real and test-2 real were compared to investigate whether freezing the backbone layers leads to a local improvement in performance by reducing, as an effect, training time. To analyze this, the figures related to training loss and validation loss were inspected, along with the mAP curve and validation loss curve.

Training curves for test-1 without freezing on site-1. On the x-axis is represented the number of training epochs.
ML-TN-009-img-30.png
ML-TN-009-img-31.png
Training curves for test-1 with frozen backbone on site-1
ML-TN-009-img-32.png
ML-TN-009-img-33.png
Training curves for test-1 without freezing on site-2
ML-TN-009-img-34.png
ML-TN-009-img-35.png
Training curves for test-1 with frozen backbone on site-2
ML-TN-009-img-36.png
ML-TN-009-img-37.png

The comparison generally shows that training without freezing the backbone produces better results, as it achieves a lower validation loss and a higher mAP curve. This is even more evident when observing the validation loss at site-1, which tends to increase. Therefore, to evaluate the performance at each site after the federated training rounds, test-1 and test-3 will be considered against the performance obtained with centralized training using Dataset 1 and Dataset 2.

Performance metrics on test set of each site conducted in real-environment.

The two tests yielded similar results, although the first one, which had a higher number of epochs and fewer training rounds, showed slightly better performance. In addition to the performance comparison, it is also important to note the impact that the computational constraint at site-1 had on federated training efficiency overall. While site-2, using the GPU, significantly reduced training time with 4 minutes on average, site-1 which relied on the CPU needed a significantly longer training time. Such contrast emphasizes the need for adequate computational resources, especially when applying federated learning on edge devices that are based on resource-constrained embedded platforms. Furthermore, while the first test showed slightly better performance, it's essential to evaluate the trade-off between improved accuracy and the computational cost in real-world applications, where time efficiency and resource constraints play a crucial role in system deployment.

CPU hardware utilization on executing federated learning on SBC ORCA.
Performance metrics on test set for the centralized approach with yolov5s trained on Dataset 1 and Dataset 2 and federated approach based on test-1.

As mentioned earlier, in the federated training, Cross-Site Model Evaluation was implemented to gather performance data from the aggregated methods on each site. The same validation sets from each site were also used in the centralized method. It can be observed that federated training, starting from the weights obtained from Dataset 1 and then trained on new data, resulted in a global model that performed better compared to the centralized approach on Dataset 1, when evaluated on the same test datasets. In fact, it almost reached the performance of the model trained on the richer dataset. This demonstrated the effectiveness of the federated training approach in improving model performance by aggregating model weights trained on new data from different sites.

Inference-federated system results[edit | edit source]

The results of the inference-federated designated system for smart endoscopes, which integrates inference with federated learning, are reported in this section. Based on the assumptions made in the chapter Model development, only Center 1 from the PolypGEN dataset was used for Site-1, while Center 3 was used for Site-2. Both datasets were split into 80% for training and 20% for validation.

Client’s configuration of the inference-federated system.

The proposed “active” learning system leveraged both the s and n network architectures, starting from the weights obtained by training the network on the complete Dataset 2. This system followed a two-step process: an initial inference phase, where the model predicted detections, followed by federated training on images that did not reach a confidence score higher than 0.7. This ensured that the model focused on hard-to-detect samples, improving the overall performance over time. Both tests were conducted on s and n architectures to compare their efficiency and accuracy in polyp detection. The results of the first inference phase are presented below, including the detection speed, measured in frames per second (FPS), to evaluate the computational efficiency of each model configuration.

Inference results on first detection task on both FL clients, before federated training, using yolov5s architecture.

The results showed that the first filtering phase significantly reduced the number of images to be trained in FL environment; this reduction helped lower the computational load in the subsequent FL training, particularly on the embedded device. This aspect was crucial for optimizing resource usage and ensuring the feasibility of the learning process on low-power devices, especially considering that in the previous training session, a significant amount of time was required to train a model for just 20 epochs. One of the main parameters used to assess hardware performance on the embedded device was the frames per second (FPS) achieved during inference. The XNNPack delegate, invoked by TensorFlow to run inference on the CPU of the embedded board, achieved 2 FPS, which is quite low for real-time inference analysis but not surprising considering that the Linux distribution used for testing did not support the NPU as mentioned in the this section. This suggested that using a hardware acceleration solution, such as a dedicated NPU, could improve inference speed significantly. In this regard, this benchmarking by NXP Semiconductors can help to make ballpark estimation of the performance boost achievable by enabling NPU acceleration. In essence, it is reasonable to expect that an order-of-magnitude improvement is reasonable by migrating to a Linux distribution that supports the NPU. As such, a frame rate of approximately 20 FPS — which is assumed to be the minimum threshold to have a truly usable endoscope for actual utilization — should be achievable.

Following this step, two additional rounds of federated training were performed, each followed by an inference and filtering phase to evaluate improvements in detection, particularly for the samples that had previously performed poorly. To optimize computational efficiency, only six epochs were used for federated training, with a progressively reduced batch size starting from 4, resulting in an average training time of 22 minutes per round. This setup was designed to strike a balance between computational efficiency and effective model updates with concerns on the hardware constraints of the embedded system.

Inference results on first detection task on both FL clients, before federated training, using yolvo5s architecture.

The iterative process was created to refine the model's ability to recognize polyp patterns with the goal of preventing unnecessary retraining on already well classified data.

Overall, this idea was able to improve the model to learn more complex patterns after multiple training rounds, particularly in cases where the model initially struggled to correctly classify them. Although the improvements were slight, they demonstrate the system’s ability to adapt and refine its understanding of complex cases through iterative training. The results obtained using the YOLOv5n architecture are reported.

Inference results on first detection task on both Fl clients, before federated training, using yolov5n architecture.
Inference results on first detection task on both Fl clients, before federated training, using yolvo5n architecture.

The YOLOv5n architecture also showed improvements in performance on images that previously had low accuracy, with each training round taking about 17 minutes. It can be observed that the performance in terms of FPS has slightly improved, though there is still room for further optimization, especially considering the limitations of the embedded device's hardware. In the next chapter, conclusions will be drawn, and the trade-offs between the different configurations, such as the s and n models, will be evaluated in relation to their respective limitations. These considerations were crucial for optimizing both model accuracy and efficiency in real-world deployment scenarios, especially on resource-constrained devices like embedded systems.

Migrating to an NPU-enabled Yocto distribution[edit | edit source]

Once that the effectiveness of the final configuration was completed successfully, it was migrated to an NPU-enabled Yocto distribution. Specifically, it was used the one provided by DESK-MX8M-L 4.2.1, which is based on Yocto Kirkstone:

root@desk-mx8mp:~# cat /etc/os-release
ID=dave-virtualization-wayland
NAME="NXP i.MX Release Distro"
VERSION="5.15-kirkstone (kirkstone)"
VERSION_ID=5.15-kirkstone
PRETTY_NAME="NXP i.MX Release Distro 5.15-kirkstone (kirkstone)"
DISTRO_CODENAME="kirkstone"
BUILD_VERSION="desk-mx8m-l-4.2.1"

This step allowed us to test the FL deployment on ORCA SBC in a near production-grade set up as Yocto is conceived mainly for utilization in production. Two factors were addressed: the improvement in throughput thanks to the NPU acceleration and the power consumption during the inference phase. As known, ORCA SBC embodies [SOM]. ORCA SOM is the building block from which a hypothetical smart endoscope would be designed. Therefore, we focussed on the SOM from the consumption perspective. To do that, the shunt resistor depicted in the following picture was used.

Shunt resistor used to measure the power consumption of ORCA SOM embodied in ORCA SBC.
Throughput improvement and power consumption[edit | edit source]

To make the model suitable for optimal execution on the platform, the model was converted with the eIQ Toolkit provided by NXP. Then, different configurations were tested, as illustrated in the following sections.

Test #1: FP16, no NPU[edit | edit source]

Please note that the starting model's precision was FP16. To make a fair comparison with previous tests conducted with Armbian, the precision was first maintained untouched, and the NPU was not exploited, i.e. a CPU-only inference was run. The log of the inference test associated with this configuration is reported in the following box.

 1 STARTING!
 2 Log parameter values verbosely: [0]
 3 Graph: [val_yolo-fp16.tflite]
 4 Enable op profiling: [1]
 5 Loaded model val_yolo-fp16.tflite
 6 The input model file size (MB): 14.1229
 7 Initialized session in 967.16ms.
 8 Running benchmark for at least 1 iterations and at least 0.5 seconds but terminate if exceeding 150 seconds.
 9 count=1 curr=603409
10 
11 Running benchmark for at least 50 iterations and at least 1 seconds but terminate if exceeding 150 seconds.
12 count=50 first=544762 curr=542001 min=542001 max=546058 avg=544830 std=720
13 
14 Inference timings in us: Init: 967160, First inference: 603409, Warmup (avg): 603409, Inference (avg): 544830
15 Note: as the benchmark tool itself affects memory footprint, the following is only APPROXIMATE to the actual memory footprint of the model at runtime. Take the information at your discretion.
16 Memory footprint delta from the start of the tool (MB): init=71.6367 overall=108.352
17 Profiling Info for Benchmark Initialization:
18 ============================== Run Order ==============================
19 	             [node type]	          [start]	  [first]	 [avg ms]	     [%]	  [cdf%]	  [mem KB]	[times called]	[Name]
20 	 ModifyGraphWithDelegate	            0.000	  732.825	  732.825	 99.951%	 99.951%	 67056.000	        1	ModifyGraphWithDelegate/0
21 	         AllocateTensors	          732.645	    0.358	    0.180	  0.049%	100.000%	     0.000	        2	AllocateTensors/0
22 
23 ============================== Top by Computation Time ==============================
24 	             [node type]	          [start]	  [first]	 [avg ms]	     [%]	  [cdf%]	  [mem KB]	[times called]	[Name]
25 	 ModifyGraphWithDelegate	            0.000	  732.825	  732.825	 99.951%	 99.951%	 67056.000	        1	ModifyGraphWithDelegate/0
26 	         AllocateTensors	          732.645	    0.358	    0.180	  0.049%	100.000%	     0.000	        2	AllocateTensors/0
27 
28 Number of nodes executed: 2
29 ============================== Summary by node type ==============================
30 	             [Node type]	  [count]	  [avg ms]	    [avg %]	    [cdf %]	  [mem KB]	[times called]
31 	 ModifyGraphWithDelegate	        1	   732.825	    99.951%	    99.951%	 67056.000	        1
32 	         AllocateTensors	        1	     0.361	     0.049%	   100.000%	     0.000	        2
33 
34 Timings (microseconds): count=1 curr=733186
35 Memory (bytes): count=0
36 2 nodes observed
37 
38 
39 
40 Operator-wise Profiling Info for Regular Benchmark Runs:
41 ============================== Run Order ==============================
42 	             [node type]	          [start]	  [first]	 [avg ms]	     [%]	  [cdf%]	  [mem KB]	[times called]	[Name]
43 	   TfLiteXNNPackDelegate	            0.029	  373.826	  373.872	 68.631%	 68.631%	     0.000	        1	[model/tfc3_4/tf_conv_34/conv2d_34/BiasAdd/ReadVariableOp1, model/tfc3_4/sequential_9/tf_bottleneck_7/tf_conv_37/conv2d_37/BiasAdd/ReadVariableOp1, model/tfc3_4/sequential_9/tf_bottleneck_7/tf_conv_38/conv2d_38/BiasAdd/ReadVariableOp1, model/tfc3_4/tf_conv_35/conv2d_35/BiasAdd/ReadVariableOp1, model/tfc3_4/tf_conv_36/conv2d_36/BiasAdd/ReadVariableOp1, model/tf_conv_39/conv2d_39/BiasAdd/ReadVariableOp1, model/tfc3_5/tf_conv_40/conv2d_40/BiasAdd/ReadVariableOp1, model/tfc3_5/sequential_10/tf_bottleneck_8/tf_conv_43/conv2d_43/BiasAdd/ReadVariableOp1, model/tfc3_5/sequential_10/tf_bottleneck_8/tf_conv_44/conv2d_44/BiasAdd/ReadVariableOp1, model/tfc3_5/tf_conv_41/conv2d_41/BiasAdd/ReadVariableOp1, model/tfc3_5/tf_conv_42/conv2d_42/BiasAdd/ReadVariableOp1, model/tf_conv_45/sequential_11/conv2d_45/BiasAdd/ReadVariableOp1, model/tfc3_6/tf_conv_46/conv2d_46/BiasAdd/ReadVariableOp1, model/tfc3_6/sequential_12/tf_bottleneck_9/tf_conv_49/conv2d_49/BiasAdd/ReadVariableOp1, model/tfc3_6/sequential_12/tf_bottleneck_9/tf_conv_50/conv2d_50/BiasAdd/ReadVariableOp1, model/tfc3_6/tf_conv_47/conv2d_47/BiasAdd/ReadVariableOp1, model/tfc3_6/tf_conv_48/conv2d_48/BiasAdd/ReadVariableOp1, model/tf_conv_51/sequential_13/conv2d_51/BiasAdd/ReadVariableOp1, model/tfc3_7/tf_conv_52/conv2d_52/BiasAdd/ReadVariableOp1, model/tfc3_7/sequential_14/tf_bottleneck_10/tf_conv_55/conv2d_55/BiasAdd/ReadVariableOp1, model/tfc3_7/sequential_14/tf_bottleneck_10/tf_conv_56/conv2d_56/BiasAdd/ReadVariableOp1, model/tfc3_7/tf_conv_53/conv2d_53/BiasAdd/ReadVariableOp1, model/tfc3_7/tf_conv_54/conv2d_54/BiasAdd/ReadVariableOp1, model/tf_detect/tf_conv2d_2/conv2d_59/BiasAdd/ReadVariableOp1, model/tf_detect/tf_conv2d_1/conv2d_58/BiasAdd/ReadVariableOp1, model/tf_detect/tf_conv2d/conv2d_57/BiasAdd/ReadVariableOp1, model/tfc3_4/tf_conv_34/conv2d_34/Conv2D1, model/tfc3_4/sequential_9/tf_bottleneck_7/tf_conv_37/conv2d_37/Conv2D1, model/tfc3_4/sequential_9/tf_bottleneck_7/tf_conv_38/conv2d_38/Conv2D1, model/tfc3_4/tf_conv_35/conv2d_35/Conv2D1, model/tfc3_4/tf_conv_36/conv2d_36/Conv2D1, model/tf_conv_39/conv2d_39/Conv2D1, model/tfc3_5/tf_conv_40/conv2d_40/Conv2D1, model/tfc3_5/sequential_10/tf_bottleneck_8/tf_conv_43/conv2d_43/Conv2D1, model/tfc3_5/sequential_10/tf_bottleneck_8/tf_conv_44/conv2d_44/Conv2D1, model/tfc3_5/tf_conv_41/conv2d_41/Conv2D1, model/tfc3_5/tf_conv_42/conv2d_42/Conv2D1, model/tf_conv_45/sequential_11/conv2d_45/Conv2D1, model/tfc3_6/tf_conv_46/conv2d_46/Conv2D1, model/tfc3_6/sequential_12/tf_bottleneck_9/tf_conv_49/conv2d_49/Conv2D1, model/tfc3_6/sequential_12/tf_bottleneck_9/tf_conv_50/conv2d_50/Conv2D1, model/tfc3_6/tf_conv_47/conv2d_47/Conv2D1, model/tfc3_6/tf_conv_48/conv2d_48/Conv2D1, model/tf_conv_51/sequential_13/conv2d_51/Conv2D1, model/tfc3_7/tf_conv_52/conv2d_52/Conv2D1, model/tfc3_7/sequential_14/tf_bottleneck_10/tf_conv_55/conv2d_55/Conv2D1, model/tfc3_7/sequential_14/tf_bottleneck_10/tf_conv_56/conv2d_56/Conv2D1, model/tfc3_7/tf_conv_53/conv2d_53/Conv2D1, model/tfc3_7/tf_conv_54/conv2d_54/Conv2D1, model/tf_detect/tf_conv2d_2/conv2d_59/Conv2D1, model/tf_detect/tf_conv2d_1/conv2d_58/Conv2D1, model/tf_detect/tf_conv2d/conv2d_57/Conv2D1, model/tf_detect/sub1, model/tf_detect/sub_21, model/tf_detect/sub_11, model/tf_detect/mul_11, model/tf_detect/mul_151, model/tf_detect/mul_81, model/tf_detect/strided_slice_161, model/tf_detect/strided_slice_121, model/tf_detect/strided_slice1, model/tf_detect/truediv_4;model/tf_detect/Const1, model/tf_detect/mul_16/y1, model/tfc3_1/tf_conv_10/mul_1, model/tfc3_2/tf_conv_18/mul_1, model/tf_conv_33/mul_1]:386
44 	 RESIZE_NEAREST_NEIGHBOR	          373.902	    0.105	    0.103	  0.019%	 68.650%	     0.000	        1	[model/tf_upsample/resize/ResizeNearestNeighbor]:253
45 	   TfLiteXNNPackDelegate	          374.006	   37.059	   37.191	  6.827%	 75.477%	     0.000	        1	[model/tf_conv_39/mul_1]:387
46 	 RESIZE_NEAREST_NEIGHBOR	          411.199	    0.211	    0.210	  0.039%	 75.516%	     0.000	        1	[model/tf_upsample_1/resize/ResizeNearestNeighbor]:274
47 	   TfLiteXNNPackDelegate	          411.410	  131.158	  131.037	 24.054%	 99.570%	     0.000	        1	[model/tf_detect/Reshape_4, model/tf_detect/Reshape_2, model/tf_detect/Reshape]:388
48 	           STRIDED_SLICE	          542.448	    0.020	    0.021	  0.004%	 99.574%	     0.000	        1	[model/tf_detect/strided_slice_19]:336
49 	           STRIDED_SLICE	          542.470	    0.009	    0.010	  0.002%	 99.576%	     0.000	        1	[model/tf_detect/strided_slice_21]:342
50 	           STRIDED_SLICE	          542.480	    0.010	    0.009	  0.002%	 99.578%	     0.000	        1	[model/tf_detect/strided_slice_22]:347
51 	           STRIDED_SLICE	          542.490	    0.034	    0.034	  0.006%	 99.584%	     0.000	        1	[model/tf_detect/strided_slice_11]:353
52 	           STRIDED_SLICE	          542.524	    0.034	    0.032	  0.006%	 99.590%	     0.000	        1	[model/tf_detect/strided_slice_13]:359
53 	           STRIDED_SLICE	          542.557	    0.031	    0.032	  0.006%	 99.596%	     0.000	        1	[model/tf_detect/strided_slice_14]:364
54 	           STRIDED_SLICE	          542.591	    0.128	    0.124	  0.023%	 99.618%	     0.000	        1	[model/tf_detect/strided_slice_31]:370
55 	           STRIDED_SLICE	          542.715	    0.127	    0.124	  0.023%	 99.641%	     0.000	        1	[model/tf_detect/strided_slice_51]:376
56 	           STRIDED_SLICE	          542.840	    0.119	    0.123	  0.023%	 99.664%	     0.000	        1	[model/tf_detect/strided_slice_63]:381
57 	   TfLiteXNNPackDelegate	          542.964	    1.829	    1.831	  0.336%	100.000%	     0.000	        1	[StatefulPartitionedCall:0]:389
58 
59 ============================== Top by Computation Time ==============================
60 	             [node type]	          [start]	  [first]	 [avg ms]	     [%]	  [cdf%]	  [mem KB]	[times called]	[Name]
61 	   TfLiteXNNPackDelegate	            0.029	  373.826	  373.872	 68.631%	 68.631%	     0.000	        1	[model/tfc3_4/tf_conv_34/conv2d_34/BiasAdd/ReadVariableOp1, model/tfc3_4/sequential_9/tf_bottleneck_7/tf_conv_37/conv2d_37/BiasAdd/ReadVariableOp1, model/tfc3_4/sequential_9/tf_bottleneck_7/tf_conv_38/conv2d_38/BiasAdd/ReadVariableOp1, model/tfc3_4/tf_conv_35/conv2d_35/BiasAdd/ReadVariableOp1, model/tfc3_4/tf_conv_36/conv2d_36/BiasAdd/ReadVariableOp1, model/tf_conv_39/conv2d_39/BiasAdd/ReadVariableOp1, model/tfc3_5/tf_conv_40/conv2d_40/BiasAdd/ReadVariableOp1, model/tfc3_5/sequential_10/tf_bottleneck_8/tf_conv_43/conv2d_43/BiasAdd/ReadVariableOp1, model/tfc3_5/sequential_10/tf_bottleneck_8/tf_conv_44/conv2d_44/BiasAdd/ReadVariableOp1, model/tfc3_5/tf_conv_41/conv2d_41/BiasAdd/ReadVariableOp1, model/tfc3_5/tf_conv_42/conv2d_42/BiasAdd/ReadVariableOp1, model/tf_conv_45/sequential_11/conv2d_45/BiasAdd/ReadVariableOp1, model/tfc3_6/tf_conv_46/conv2d_46/BiasAdd/ReadVariableOp1, model/tfc3_6/sequential_12/tf_bottleneck_9/tf_conv_49/conv2d_49/BiasAdd/ReadVariableOp1, model/tfc3_6/sequential_12/tf_bottleneck_9/tf_conv_50/conv2d_50/BiasAdd/ReadVariableOp1, model/tfc3_6/tf_conv_47/conv2d_47/BiasAdd/ReadVariableOp1, model/tfc3_6/tf_conv_48/conv2d_48/BiasAdd/ReadVariableOp1, model/tf_conv_51/sequential_13/conv2d_51/BiasAdd/ReadVariableOp1, model/tfc3_7/tf_conv_52/conv2d_52/BiasAdd/ReadVariableOp1, model/tfc3_7/sequential_14/tf_bottleneck_10/tf_conv_55/conv2d_55/BiasAdd/ReadVariableOp1, model/tfc3_7/sequential_14/tf_bottleneck_10/tf_conv_56/conv2d_56/BiasAdd/ReadVariableOp1, model/tfc3_7/tf_conv_53/conv2d_53/BiasAdd/ReadVariableOp1, model/tfc3_7/tf_conv_54/conv2d_54/BiasAdd/ReadVariableOp1, model/tf_detect/tf_conv2d_2/conv2d_59/BiasAdd/ReadVariableOp1, model/tf_detect/tf_conv2d_1/conv2d_58/BiasAdd/ReadVariableOp1, model/tf_detect/tf_conv2d/conv2d_57/BiasAdd/ReadVariableOp1, model/tfc3_4/tf_conv_34/conv2d_34/Conv2D1, model/tfc3_4/sequential_9/tf_bottleneck_7/tf_conv_37/conv2d_37/Conv2D1, model/tfc3_4/sequential_9/tf_bottleneck_7/tf_conv_38/conv2d_38/Conv2D1, model/tfc3_4/tf_conv_35/conv2d_35/Conv2D1, model/tfc3_4/tf_conv_36/conv2d_36/Conv2D1, model/tf_conv_39/conv2d_39/Conv2D1, model/tfc3_5/tf_conv_40/conv2d_40/Conv2D1, model/tfc3_5/sequential_10/tf_bottleneck_8/tf_conv_43/conv2d_43/Conv2D1, model/tfc3_5/sequential_10/tf_bottleneck_8/tf_conv_44/conv2d_44/Conv2D1, model/tfc3_5/tf_conv_41/conv2d_41/Conv2D1, model/tfc3_5/tf_conv_42/conv2d_42/Conv2D1, model/tf_conv_45/sequential_11/conv2d_45/Conv2D1, model/tfc3_6/tf_conv_46/conv2d_46/Conv2D1, model/tfc3_6/sequential_12/tf_bottleneck_9/tf_conv_49/conv2d_49/Conv2D1, model/tfc3_6/sequential_12/tf_bottleneck_9/tf_conv_50/conv2d_50/Conv2D1, model/tfc3_6/tf_conv_47/conv2d_47/Conv2D1, model/tfc3_6/tf_conv_48/conv2d_48/Conv2D1, model/tf_conv_51/sequential_13/conv2d_51/Conv2D1, model/tfc3_7/tf_conv_52/conv2d_52/Conv2D1, model/tfc3_7/sequential_14/tf_bottleneck_10/tf_conv_55/conv2d_55/Conv2D1, model/tfc3_7/sequential_14/tf_bottleneck_10/tf_conv_56/conv2d_56/Conv2D1, model/tfc3_7/tf_conv_53/conv2d_53/Conv2D1, model/tfc3_7/tf_conv_54/conv2d_54/Conv2D1, model/tf_detect/tf_conv2d_2/conv2d_59/Conv2D1, model/tf_detect/tf_conv2d_1/conv2d_58/Conv2D1, model/tf_detect/tf_conv2d/conv2d_57/Conv2D1, model/tf_detect/sub1, model/tf_detect/sub_21, model/tf_detect/sub_11, model/tf_detect/mul_11, model/tf_detect/mul_151, model/tf_detect/mul_81, model/tf_detect/strided_slice_161, model/tf_detect/strided_slice_121, model/tf_detect/strided_slice1, model/tf_detect/truediv_4;model/tf_detect/Const1, model/tf_detect/mul_16/y1, model/tfc3_1/tf_conv_10/mul_1, model/tfc3_2/tf_conv_18/mul_1, model/tf_conv_33/mul_1]:386
62 	   TfLiteXNNPackDelegate	          411.410	  131.158	  131.037	 24.054%	 92.686%	     0.000	        1	[model/tf_detect/Reshape_4, model/tf_detect/Reshape_2, model/tf_detect/Reshape]:388
63 	   TfLiteXNNPackDelegate	          374.006	   37.059	   37.191	  6.827%	 99.513%	     0.000	        1	[model/tf_conv_39/mul_1]:387
64 	   TfLiteXNNPackDelegate	          542.964	    1.829	    1.831	  0.336%	 99.849%	     0.000	        1	[StatefulPartitionedCall:0]:389
65 	 RESIZE_NEAREST_NEIGHBOR	          411.199	    0.211	    0.210	  0.039%	 99.887%	     0.000	        1	[model/tf_upsample_1/resize/ResizeNearestNeighbor]:274
66 	           STRIDED_SLICE	          542.715	    0.127	    0.124	  0.023%	 99.910%	     0.000	        1	[model/tf_detect/strided_slice_51]:376
67 	           STRIDED_SLICE	          542.591	    0.128	    0.124	  0.023%	 99.933%	     0.000	        1	[model/tf_detect/strided_slice_31]:370
68 	           STRIDED_SLICE	          542.840	    0.119	    0.123	  0.023%	 99.956%	     0.000	        1	[model/tf_detect/strided_slice_63]:381
69 	 RESIZE_NEAREST_NEIGHBOR	          373.902	    0.105	    0.103	  0.019%	 99.975%	     0.000	        1	[model/tf_upsample/resize/ResizeNearestNeighbor]:253
70 	           STRIDED_SLICE	          542.490	    0.034	    0.034	  0.006%	 99.981%	     0.000	        1	[model/tf_detect/strided_slice_11]:353
71 
72 Number of nodes executed: 15
73 ============================== Summary by node type ==============================
74 	             [Node type]	  [count]	  [avg ms]	    [avg %]	    [cdf %]	  [mem KB]	[times called]
75 	   TfLiteXNNPackDelegate	        4	   543.928	    99.850%	    99.850%	     0.000	        4
76 	           STRIDED_SLICE	        9	     0.506	     0.093%	    99.943%	     0.000	        9
77 	 RESIZE_NEAREST_NEIGHBOR	        2	     0.313	     0.057%	   100.000%	     0.000	        2
78 
79 Timings (microseconds): count=50 first=544700 curr=541925 min=541925 max=545953 avg=544755 std=719
80 Memory (bytes): count=0
81 15 nodes observed

The interesting value is the avg of TfLiteXNNPackDelegate. It is about 544 ms, corresponding to ~1.8 FPS. As expected, the resulting frame rate is close to the one observed when running the same model on Armbian distribution. This value is the baseline against which we will evaluate other configurations as described in the rest of the paragraph.

Test #2: FP16, NPU[edit | edit source]

For the sake of completeness, FP16 precision, NPU-enabled configuration was tested too, although it does not represent an interesting set-up for realistic use cases. As such, it is not interesting to investigate why the inference time exploded.

 1 STARTING!
 2 Log parameter values verbosely: [0]
 3 Graph: [val_yolo-fp16.tflite]
 4 Enable op profiling: [1]
 5 External delegate path: [/usr/lib/libvx_delegate.so]
 6 Loaded model val_yolo-fp16.tflite
 7 Vx delegate: allowed_cache_mode set to 0.
 8 Vx delegate: device num set to 0.
 9 Vx delegate: allowed_builtin_code set to 0.
10 Vx delegate: error_during_init set to 0.
11 Vx delegate: error_during_prepare set to 0.
12 Vx delegate: error_during_invoke set to 0.
13 EXTERNAL delegate created.
14 Explicitly applied EXTERNAL delegate, and the model graph will be completely executed by the delegate.
15 The input model file size (MB): 14.1229
16 Initialized session in 15.618ms.
17 Running benchmark for at least 1 iterations and at least 0.5 seconds but terminate if exceeding 150 seconds.
18 count=1 curr=17454681
19 
20 Running benchmark for at least 50 iterations and at least 1 seconds but terminate if exceeding 150 seconds.
21 count=14 first=10768305 curr=10769175 min=10762125 max=10773582 avg=1.07692e+07 std=3182
22 
23 Inference timings in us: Init: 15618, First inference: 17454681, Warmup (avg): 1.74547e+07, Inference (avg): 1.07692e+07
24 Note: as the benchmark tool itself affects memory footprint, the following is only APPROXIMATE to the actual memory footprint of the model at runtime. Take the information at your discretion.
25 Memory footprint delta from the start of the tool (MB): init=9.88672 overall=124.406
26 Profiling Info for Benchmark Initialization:
27 ============================== Run Order ==============================
28 	             [node type]	          [start]	  [first]	 [avg ms]	     [%]	  [cdf%]	  [mem KB]	[times called]	[Name]
29 	 ModifyGraphWithDelegate	            0.000	    1.186	    1.186	 62.487%	 62.487%	     0.000	        1	ModifyGraphWithDelegate/0
30 	         AllocateTensors	            1.244	    0.712	    0.712	 37.513%	100.000%	     0.000	        1	AllocateTensors/0
31 
32 ============================== Top by Computation Time ==============================
33 	             [node type]	          [start]	  [first]	 [avg ms]	     [%]	  [cdf%]	  [mem KB]	[times called]	[Name]
34 	 ModifyGraphWithDelegate	            0.000	    1.186	    1.186	 62.487%	 62.487%	     0.000	        1	ModifyGraphWithDelegate/0
35 	         AllocateTensors	            1.244	    0.712	    0.712	 37.513%	100.000%	     0.000	        1	AllocateTensors/0
36 
37 Number of nodes executed: 2
38 ============================== Summary by node type ==============================
39 	             [Node type]	  [count]	  [avg ms]	    [avg %]	    [cdf %]	  [mem KB]	[times called]
40 	 ModifyGraphWithDelegate	        1	     1.186	    62.487%	    62.487%	     0.000	        1
41 	         AllocateTensors	        1	     0.712	    37.513%	   100.000%	     0.000	        1
42 
43 Timings (microseconds): count=1 curr=1898
44 Memory (bytes): count=0
45 2 nodes observed
46 
47 
48 
49 Operator-wise Profiling Info for Regular Benchmark Runs:
50 ============================== Run Order ==============================
51 	             [node type]	          [start]	  [first]	 [avg ms]	     [%]	  [cdf%]	  [mem KB]	[times called]	[Name]
52 	             Vx Delegate	            0.030	10768.251	10769.164	100.000%	100.000%	     0.000	        1	[StatefulPartitionedCall:0]:386
53 
54 ============================== Top by Computation Time ==============================
55 	             [node type]	          [start]	  [first]	 [avg ms]	     [%]	  [cdf%]	  [mem KB]	[times called]	[Name]
56 	             Vx Delegate	            0.030	10768.251	10769.164	100.000%	100.000%	     0.000	        1	[StatefulPartitionedCall:0]:386
57 
58 Number of nodes executed: 1
59 ============================== Summary by node type ==============================
60 	             [Node type]	  [count]	  [avg ms]	    [avg %]	    [cdf %]	  [mem KB]	[times called]
61 	             Vx Delegate	        1	 10769.163	   100.000%	   100.000%	     0.000	        1
62 
63 Timings (microseconds): count=14 first=10768251 curr=10769103 min=10762057 max=10773515 avg=1.07692e+07 std=3181
64 Memory (bytes): count=0
65 1 nodes observed
Test #3: INT8, no NPU[edit | edit source]

The precision was reduced to INT8 for the third round of testing. The NPU was not enabled and the average inference time is a little bit better than the first round. As such, no significant improvement was achieved.

  1 STARTING!
  2 Log parameter values verbosely: [0]
  3 Graph: [model_int8.tflite]
  4 Enable op profiling: [1]
  5 Loaded model model_int8.tflite
  6 The input model file size (MB): 7.34413
  7 Initialized session in 7.394ms.
  8 Running benchmark for at least 1 iterations and at least 0.5 seconds but terminate if exceeding 150 seconds.
  9 count=1 curr=519739
 10 
 11 Running benchmark for at least 50 iterations and at least 1 seconds but terminate if exceeding 150 seconds.
 12 count=50 first=511096 curr=510535 min=508383 max=511096 avg=509862 std=630
 13 
 14 Inference timings in us: Init: 7394, First inference: 519739, Warmup (avg): 519739, Inference (avg): 509862
 15 Note: as the benchmark tool itself affects memory footprint, the following is only APPROXIMATE to the actual memory footprint of the model at runtime. Take the information at your discretion.
 16 Memory footprint delta from the start of the tool (MB): init=6.24219 overall=18.1562
 17 Profiling Info for Benchmark Initialization:
 18 ============================== Run Order ==============================
 19 	             [node type]	          [start]	  [first]	 [avg ms]	     [%]	  [cdf%]	  [mem KB]	[times called]	[Name]
 20 	         AllocateTensors	            0.000	    3.495	    3.495	100.000%	100.000%	   264.000	        1	AllocateTensors/0
 21 
 22 ============================== Top by Computation Time ==============================
 23 	             [node type]	          [start]	  [first]	 [avg ms]	     [%]	  [cdf%]	  [mem KB]	[times called]	[Name]
 24 	         AllocateTensors	            0.000	    3.495	    3.495	100.000%	100.000%	   264.000	        1	AllocateTensors/0
 25 
 26 Number of nodes executed: 1
 27 ============================== Summary by node type ==============================
 28 	             [Node type]	  [count]	  [avg ms]	    [avg %]	    [cdf %]	  [mem KB]	[times called]
 29 	         AllocateTensors	        1	     3.495	   100.000%	   100.000%	   264.000	        1
 30 
 31 Timings (microseconds): count=1 curr=3495
 32 Memory (bytes): count=0
 33 1 nodes observed
 34 
 35 
 36 
 37 Operator-wise Profiling Info for Regular Benchmark Runs:
 38 ============================== Run Order ==============================
 39 	             [node type]	          [start]	  [first]	 [avg ms]	     [%]	  [cdf%]	  [mem KB]	[times called]	[Name]
 40 	                 CONV_2D	            0.017	   33.055	   32.181	  6.315%	  6.315%	     0.000	        1	[model/tf_conv/sequential/conv2d/BiasAdd;model/tf_conv/sequential/conv2d/Conv2D;model/tf_conv/sequential/conv2d/BiasAdd/ReadVariableOp/resource;;model/tf_conv/sequential/tf_pad/Pad]:0
 41 	                LOGISTIC	           32.200	    4.010	    3.933	  0.772%	  7.087%	     0.000	        1	[model/tf_conv/Sigmoid]:1
 42 	                     MUL	           36.135	    2.325	    2.284	  0.448%	  7.535%	     0.000	        1	[model/tf_conv/mul_1]:2
 43 	                     PAD	           38.420	    5.494	    5.582	  1.095%	  8.630%	     0.000	        1	[model/tf_conv_1/sequential_1/tf_pad_1/Pad]:3
 44 	                 CONV_2D	           44.003	   28.280	   28.208	  5.535%	 14.165%	     0.000	        1	[model/tf_conv_1/sequential_1/conv2d_1/BiasAdd;model/tf_conv_1/sequential_1/conv2d_1/Conv2D;]:4
 45 	                LOGISTIC	           72.213	    1.970	    1.946	  0.382%	 14.547%	     0.000	        1	[model/tf_conv_1/Sigmoid]:5
 46 	                     MUL	           74.160	    1.127	    1.149	  0.225%	 14.773%	     0.000	        1	[model/tf_conv_1/mul_1]:6
 47 	                 CONV_2D	           75.310	    4.441	    4.481	  0.879%	 15.652%	     0.000	        1	[model/tfc3/tf_conv_2/conv2d_2/BiasAdd;model/tfc3/tf_conv_2/conv2d_2/Conv2D;]:7
 48 	                LOGISTIC	           79.792	    0.965	    0.966	  0.190%	 15.841%	     0.000	        1	[model/tfc3/tf_conv_2/Sigmoid]:8
 49 	                     MUL	           80.759	    0.636	    0.553	  0.108%	 15.950%	     0.000	        1	[model/tfc3/tf_conv_2/mul_1]:9
 50 	                 CONV_2D	           81.313	    3.045	    3.045	  0.598%	 16.547%	     0.000	        1	[model/tfc3/sequential_2/tf_bottleneck/tf_conv_5/conv2d_5/BiasAdd;model/tfc3/sequential_2/tf_bottleneck/tf_conv_5/conv2d_5/Conv2D;]:10
 51 	                LOGISTIC	           84.359	    0.955	    0.939	  0.184%	 16.732%	     0.000	        1	[model/tfc3/sequential_2/tf_bottleneck/tf_conv_5/Sigmoid]:11
 52 	                     MUL	           85.299	    0.602	    0.555	  0.109%	 16.841%	     0.000	        1	[model/tfc3/sequential_2/tf_bottleneck/tf_conv_5/mul_1]:12
 53 	                 CONV_2D	           85.855	   16.922	   16.824	  3.301%	 20.142%	     0.000	        1	[model/tfc3/sequential_2/tf_bottleneck/tf_conv_6/conv2d_6/BiasAdd;model/tfc3/sequential_2/tf_bottleneck/tf_conv_6/conv2d_6/Conv2D;]:13
 54 	                LOGISTIC	          102.680	    0.988	    0.993	  0.195%	 20.337%	     0.000	        1	[model/tfc3/sequential_2/tf_bottleneck/tf_conv_6/Sigmoid]:14
 55 	                     MUL	          103.674	    0.545	    0.558	  0.110%	 20.446%	     0.000	        1	[model/tfc3/sequential_2/tf_bottleneck/tf_conv_6/mul_1]:15
 56 	                     ADD	          104.233	    0.785	    0.773	  0.152%	 20.598%	     0.000	        1	[model/tfc3/sequential_2/tf_bottleneck/add]:16
 57 	                 CONV_2D	          105.007	    4.537	    4.459	  0.875%	 21.473%	     0.000	        1	[model/tfc3/tf_conv_3/conv2d_3/BiasAdd;model/tfc3/tf_conv_3/conv2d_3/Conv2D;]:17
 58 	                LOGISTIC	          109.467	    0.933	    0.958	  0.188%	 21.661%	     0.000	        1	[model/tfc3/tf_conv_3/Sigmoid]:18
 59 	                     MUL	          110.426	    0.545	    0.555	  0.109%	 21.770%	     0.000	        1	[model/tfc3/tf_conv_3/mul_1]:19
 60 	           CONCATENATION	          110.982	    0.419	    0.387	  0.076%	 21.846%	     0.000	        1	[model/tfc3/concat]:20
 61 	                 CONV_2D	          111.370	    8.681	    8.642	  1.696%	 23.542%	     0.000	        1	[model/tfc3/tf_conv_4/conv2d_4/BiasAdd;model/tfc3/tf_conv_4/conv2d_4/Conv2D;]:21
 62 	                LOGISTIC	          120.013	    1.939	    1.944	  0.381%	 23.923%	     0.000	        1	[model/tfc3/tf_conv_4/Sigmoid]:22
 63 	                     MUL	          121.958	    1.169	    1.140	  0.224%	 24.147%	     0.000	        1	[model/tfc3/tf_conv_4/mul_1]:23
 64 	                     PAD	          123.100	    2.747	    2.735	  0.537%	 24.683%	     0.000	        1	[model/tf_conv_7/sequential_3/tf_pad_2/Pad]:24
 65 	                 CONV_2D	          125.836	   24.168	   24.293	  4.767%	 29.450%	     0.000	        1	[model/tf_conv_7/sequential_3/conv2d_7/BiasAdd;model/tf_conv_7/sequential_3/conv2d_7/Conv2D;]:25
 66 	                LOGISTIC	          150.130	    0.968	    0.973	  0.191%	 29.641%	     0.000	        1	[model/tf_conv_7/Sigmoid]:26
 67 	                     MUL	          151.104	    0.578	    0.562	  0.110%	 29.752%	     0.000	        1	[model/tf_conv_7/mul_1]:27
 68 	                 CONV_2D	          151.668	    3.416	    3.376	  0.662%	 30.414%	     0.000	        1	[model/tfc3_1/tf_conv_8/conv2d_8/BiasAdd;model/tfc3_1/tf_conv_8/conv2d_8/Conv2D;]:28
 69 	                LOGISTIC	          155.044	    0.464	    0.475	  0.093%	 30.507%	     0.000	        1	[model/tfc3_1/tf_conv_8/Sigmoid]:29
 70 	                     MUL	          155.520	    0.276	    0.278	  0.055%	 30.562%	     0.000	        1	[model/tfc3_1/tf_conv_8/mul_1]:30
 71 	                 CONV_2D	          155.799	    1.933	    1.940	  0.381%	 30.943%	     0.000	        1	[model/tfc3_1/sequential_4/tf_bottleneck_1/tf_conv_11/conv2d_11/BiasAdd;model/tfc3_1/sequential_4/tf_bottleneck_1/tf_conv_11/conv2d_11/Conv2D;]:31
 72 	                LOGISTIC	          157.740	    0.460	    0.468	  0.092%	 31.034%	     0.000	        1	[model/tfc3_1/sequential_4/tf_bottleneck_1/tf_conv_11/Sigmoid]:32
 73 	                     MUL	          158.208	    0.266	    0.272	  0.053%	 31.088%	     0.000	        1	[model/tfc3_1/sequential_4/tf_bottleneck_1/tf_conv_11/mul_1]:33
 74 	                 CONV_2D	          158.481	   12.662	   12.634	  2.479%	 33.567%	     0.000	        1	[model/tfc3_1/sequential_4/tf_bottleneck_1/tf_conv_12/conv2d_12/BiasAdd;model/tfc3_1/sequential_4/tf_bottleneck_1/tf_conv_12/conv2d_12/Conv2D;]:34
 75 	                LOGISTIC	          171.117	    0.495	    0.499	  0.098%	 33.665%	     0.000	        1	[model/tfc3_1/sequential_4/tf_bottleneck_1/tf_conv_12/Sigmoid]:35
 76 	                     MUL	          171.616	    0.263	    0.273	  0.054%	 33.718%	     0.000	        1	[model/tfc3_1/sequential_4/tf_bottleneck_1/tf_conv_12/mul_1]:36
 77 	                     ADD	          171.890	    0.387	    0.389	  0.076%	 33.795%	     0.000	        1	[model/tfc3_1/sequential_4/tf_bottleneck_1/add]:37
 78 	                 CONV_2D	          172.279	    2.094	    2.075	  0.407%	 34.202%	     0.000	        1	[model/tfc3_1/sequential_4/tf_bottleneck_2/tf_conv_13/conv2d_13/BiasAdd;model/tfc3_1/sequential_4/tf_bottleneck_2/tf_conv_13/conv2d_13/Conv2D;]:38
 79 	                LOGISTIC	          174.355	    0.468	    0.478	  0.094%	 34.296%	     0.000	        1	[model/tfc3_1/sequential_4/tf_bottleneck_2/tf_conv_13/Sigmoid]:39
 80 	                     MUL	          174.833	    0.277	    0.278	  0.055%	 34.350%	     0.000	        1	[model/tfc3_1/sequential_4/tf_bottleneck_2/tf_conv_13/mul_1]:40
 81 	                 CONV_2D	          175.112	   12.470	   12.630	  2.478%	 36.829%	     0.000	        1	[model/tfc3_1/sequential_4/tf_bottleneck_2/tf_conv_14/conv2d_14/BiasAdd;model/tfc3_1/sequential_4/tf_bottleneck_2/tf_conv_14/conv2d_14/Conv2D;]:41
 82 	                LOGISTIC	          187.743	    0.484	    0.497	  0.098%	 36.926%	     0.000	        1	[model/tfc3_1/sequential_4/tf_bottleneck_2/tf_conv_14/Sigmoid]:42
 83 	                     MUL	          188.241	    0.272	    0.279	  0.055%	 36.981%	     0.000	        1	[model/tfc3_1/sequential_4/tf_bottleneck_2/tf_conv_14/mul_1]:43
 84 	                     ADD	          188.521	    0.390	    0.391	  0.077%	 37.058%	     0.000	        1	[model/tfc3_1/sequential_4/tf_bottleneck_2/add]:44
 85 	                 CONV_2D	          188.912	    3.329	    3.348	  0.657%	 37.715%	     0.000	        1	[model/tfc3_1/tf_conv_9/conv2d_9/BiasAdd;model/tfc3_1/tf_conv_9/conv2d_9/Conv2D;]:45
 86 	                LOGISTIC	          192.261	    0.471	    0.472	  0.093%	 37.807%	     0.000	        1	[model/tfc3_1/tf_conv_9/Sigmoid]:46
 87 	                     MUL	          192.733	    0.276	    0.278	  0.055%	 37.862%	     0.000	        1	[model/tfc3_1/tf_conv_9/mul_1]:47
 88 	           CONCATENATION	          193.012	    0.129	    0.129	  0.025%	 37.887%	     0.000	        1	[model/tfc3_1/concat]:48
 89 	                 CONV_2D	          193.142	    6.566	    6.531	  1.282%	 39.169%	     0.000	        1	[model/tfc3_1/tf_conv_10/conv2d_10/BiasAdd;model/tfc3_1/tf_conv_10/conv2d_10/Conv2D;]:49
 90 	                LOGISTIC	          199.674	    1.007	    0.991	  0.194%	 39.363%	     0.000	        1	[model/tfc3_1/tf_conv_10/Sigmoid]:50
 91 	                     MUL	          200.666	    0.581	    0.545	  0.107%	 39.470%	     0.000	        1	[model/tfc3_1/tf_conv_10/mul_1]:51
 92 	                     PAD	          201.212	    1.382	    1.395	  0.274%	 39.744%	     0.000	        1	[model/tf_conv_15/sequential_5/tf_pad_3/Pad]:52
 93 	                 CONV_2D	          202.608	   23.515	   23.534	  4.618%	 44.362%	     0.000	        1	[model/tf_conv_15/sequential_5/conv2d_15/BiasAdd;model/tf_conv_15/sequential_5/conv2d_15/Conv2D;]:53
 94 	                LOGISTIC	          226.144	    0.476	    0.489	  0.096%	 44.458%	     0.000	        1	[model/tf_conv_15/Sigmoid]:54
 95 	                     MUL	          226.634	    0.273	    0.279	  0.055%	 44.513%	     0.000	        1	[model/tf_conv_15/mul_1]:55
 96 	                 CONV_2D	          226.913	    2.948	    2.967	  0.582%	 45.095%	     0.000	        1	[model/tfc3_2/tf_conv_16/conv2d_16/BiasAdd;model/tfc3_2/tf_conv_16/conv2d_16/Conv2D;]:56
 97 	                LOGISTIC	          229.881	    0.234	    0.234	  0.046%	 45.141%	     0.000	        1	[model/tfc3_2/tf_conv_16/Sigmoid]:57
 98 	                     MUL	          230.116	    0.142	    0.140	  0.027%	 45.168%	     0.000	        1	[model/tfc3_2/tf_conv_16/mul_1]:58
 99 	                 CONV_2D	          230.256	    1.513	    1.534	  0.301%	 45.469%	     0.000	        1	[model/tfc3_2/sequential_6/tf_bottleneck_3/tf_conv_19/conv2d_19/BiasAdd;model/tfc3_2/sequential_6/tf_bottleneck_3/tf_conv_19/conv2d_19/Conv2D;]:59
100 	                LOGISTIC	          231.791	    0.230	    0.235	  0.046%	 45.515%	     0.000	        1	[model/tfc3_2/sequential_6/tf_bottleneck_3/tf_conv_19/Sigmoid]:60
101 	                     MUL	          232.027	    0.129	    0.132	  0.026%	 45.541%	     0.000	        1	[model/tfc3_2/sequential_6/tf_bottleneck_3/tf_conv_19/mul_1]:61
102 	                 CONV_2D	          232.159	   11.862	   11.838	  2.323%	 47.864%	     0.000	        1	[model/tfc3_2/sequential_6/tf_bottleneck_3/tf_conv_20/conv2d_20/BiasAdd;model/tfc3_2/sequential_6/tf_bottleneck_3/tf_conv_20/conv2d_20/Conv2D;]:62
103 	                LOGISTIC	          243.998	    0.238	    0.246	  0.048%	 47.912%	     0.000	        1	[model/tfc3_2/sequential_6/tf_bottleneck_3/tf_conv_20/Sigmoid]:63
104 	                     MUL	          244.244	    0.135	    0.139	  0.027%	 47.940%	     0.000	        1	[model/tfc3_2/sequential_6/tf_bottleneck_3/tf_conv_20/mul_1]:64
105 	                     ADD	          244.384	    0.195	    0.193	  0.038%	 47.978%	     0.000	        1	[model/tfc3_2/sequential_6/tf_bottleneck_3/add]:65
106 	                 CONV_2D	          244.578	    1.646	    1.553	  0.305%	 48.282%	     0.000	        1	[model/tfc3_2/sequential_6/tf_bottleneck_4/tf_conv_21/conv2d_21/BiasAdd;model/tfc3_2/sequential_6/tf_bottleneck_4/tf_conv_21/conv2d_21/Conv2D;]:66
107 	                LOGISTIC	          246.132	    0.232	    0.234	  0.046%	 48.328%	     0.000	        1	[model/tfc3_2/sequential_6/tf_bottleneck_4/tf_conv_21/Sigmoid]:67
108 	                     MUL	          246.366	    0.137	    0.132	  0.026%	 48.354%	     0.000	        1	[model/tfc3_2/sequential_6/tf_bottleneck_4/tf_conv_21/mul_1]:68
109 	                 CONV_2D	          246.499	   11.805	   11.832	  2.322%	 50.676%	     0.000	        1	[model/tfc3_2/sequential_6/tf_bottleneck_4/tf_conv_22/conv2d_22/BiasAdd;model/tfc3_2/sequential_6/tf_bottleneck_4/tf_conv_22/conv2d_22/Conv2D;]:69
110 	                LOGISTIC	          258.332	    0.251	    0.248	  0.049%	 50.725%	     0.000	        1	[model/tfc3_2/sequential_6/tf_bottleneck_4/tf_conv_22/Sigmoid]:70
111 	                     MUL	          258.580	    0.135	    0.140	  0.027%	 50.752%	     0.000	        1	[model/tfc3_2/sequential_6/tf_bottleneck_4/tf_conv_22/mul_1]:71
112 	                     ADD	          258.721	    0.188	    0.191	  0.037%	 50.790%	     0.000	        1	[model/tfc3_2/sequential_6/tf_bottleneck_4/add]:72
113 	                 CONV_2D	          258.912	    1.547	    1.588	  0.312%	 51.101%	     0.000	        1	[model/tfc3_2/sequential_6/tf_bottleneck_5/tf_conv_23/conv2d_23/BiasAdd;model/tfc3_2/sequential_6/tf_bottleneck_5/tf_conv_23/conv2d_23/Conv2D;]:73
114 	                LOGISTIC	          260.501	    0.230	    0.233	  0.046%	 51.147%	     0.000	        1	[model/tfc3_2/sequential_6/tf_bottleneck_5/tf_conv_23/Sigmoid]:74
115 	                     MUL	          260.734	    0.129	    0.132	  0.026%	 51.173%	     0.000	        1	[model/tfc3_2/sequential_6/tf_bottleneck_5/tf_conv_23/mul_1]:75
116 	                 CONV_2D	          260.866	   11.770	   11.810	  2.317%	 53.490%	     0.000	        1	[model/tfc3_2/sequential_6/tf_bottleneck_5/tf_conv_24/conv2d_24/BiasAdd;model/tfc3_2/sequential_6/tf_bottleneck_5/tf_conv_24/conv2d_24/Conv2D;]:76
117 	                LOGISTIC	          272.678	    0.244	    0.246	  0.048%	 53.538%	     0.000	        1	[model/tfc3_2/sequential_6/tf_bottleneck_5/tf_conv_24/Sigmoid]:77
118 	                     MUL	          272.924	    0.140	    0.138	  0.027%	 53.565%	     0.000	        1	[model/tfc3_2/sequential_6/tf_bottleneck_5/tf_conv_24/mul_1]:78
119 	                     ADD	          273.063	    0.212	    0.191	  0.038%	 53.603%	     0.000	        1	[model/tfc3_2/sequential_6/tf_bottleneck_5/add]:79
120 	                 CONV_2D	          273.254	    2.893	    2.910	  0.571%	 54.174%	     0.000	        1	[model/tfc3_2/tf_conv_17/conv2d_17/BiasAdd;model/tfc3_2/tf_conv_17/conv2d_17/Conv2D;]:80
121 	                LOGISTIC	          276.166	    0.231	    0.239	  0.047%	 54.221%	     0.000	        1	[model/tfc3_2/tf_conv_17/Sigmoid]:81
122 	                     MUL	          276.405	    0.138	    0.137	  0.027%	 54.248%	     0.000	        1	[model/tfc3_2/tf_conv_17/mul_1]:82
123 	           CONCATENATION	          276.542	    0.058	    0.071	  0.014%	 54.262%	     0.000	        1	[model/tfc3_2/concat]:83
124 	                 CONV_2D	          276.614	    5.723	    5.691	  1.117%	 55.379%	     0.000	        1	[model/tfc3_2/tf_conv_18/conv2d_18/BiasAdd;model/tfc3_2/tf_conv_18/conv2d_18/Conv2D;]:84
125 	                LOGISTIC	          282.306	    0.474	    0.478	  0.094%	 55.472%	     0.000	        1	[model/tfc3_2/tf_conv_18/Sigmoid]:85
126 	                     MUL	          282.785	    0.262	    0.270	  0.053%	 55.525%	     0.000	        1	[model/tfc3_2/tf_conv_18/mul_1]:86
127 	                     PAD	          283.056	    0.739	    0.748	  0.147%	 55.672%	     0.000	        1	[model/tf_conv_25/sequential_7/tf_pad_4/Pad]:87
128 	                 CONV_2D	          283.805	   23.778	   23.751	  4.661%	 60.333%	     0.000	        1	[model/tf_conv_25/sequential_7/conv2d_25/BiasAdd;model/tf_conv_25/sequential_7/conv2d_25/Conv2D;]:88
129 	                LOGISTIC	          307.558	    0.250	    0.254	  0.050%	 60.383%	     0.000	        1	[model/tf_conv_25/Sigmoid]:89
130 	                     MUL	          307.812	    0.136	    0.139	  0.027%	 60.410%	     0.000	        1	[model/tf_conv_25/mul_1]:90
131 	                 CONV_2D	          307.952	    2.748	    2.738	  0.537%	 60.947%	     0.000	        1	[model/tfc3_3/tf_conv_26/conv2d_26/BiasAdd;model/tfc3_3/tf_conv_26/conv2d_26/Conv2D;]:91
132 	                LOGISTIC	          310.690	    0.120	    0.121	  0.024%	 60.971%	     0.000	        1	[model/tfc3_3/tf_conv_26/Sigmoid]:92
133 	                     MUL	          310.812	    0.071	    0.072	  0.014%	 60.985%	     0.000	        1	[model/tfc3_3/tf_conv_26/mul_1]:93
134 	                 CONV_2D	          310.884	    1.415	    1.437	  0.282%	 61.267%	     0.000	        1	[model/tfc3_3/sequential_8/tf_bottleneck_6/tf_conv_29/conv2d_29/BiasAdd;model/tfc3_3/sequential_8/tf_bottleneck_6/tf_conv_29/conv2d_29/Conv2D;]:94
135 	                LOGISTIC	          312.322	    0.116	    0.120	  0.024%	 61.291%	     0.000	        1	[model/tfc3_3/sequential_8/tf_bottleneck_6/tf_conv_29/Sigmoid]:95
136 	                     MUL	          312.443	    0.066	    0.068	  0.013%	 61.304%	     0.000	        1	[model/tfc3_3/sequential_8/tf_bottleneck_6/tf_conv_29/mul_1]:96
137 	                 CONV_2D	          312.511	   11.897	   11.904	  2.336%	 63.640%	     0.000	        1	[model/tfc3_3/sequential_8/tf_bottleneck_6/tf_conv_30/conv2d_30/BiasAdd;model/tfc3_3/sequential_8/tf_bottleneck_6/tf_conv_30/conv2d_30/Conv2D;]:97
138 	                LOGISTIC	          324.416	    0.127	    0.127	  0.025%	 63.665%	     0.000	        1	[model/tfc3_3/sequential_8/tf_bottleneck_6/tf_conv_30/Sigmoid]:98
139 	                     MUL	          324.544	    0.068	    0.071	  0.014%	 63.679%	     0.000	        1	[model/tfc3_3/sequential_8/tf_bottleneck_6/tf_conv_30/mul_1]:99
140 	                     ADD	          324.615	    0.100	    0.096	  0.019%	 63.698%	     0.000	        1	[model/tfc3_3/sequential_8/tf_bottleneck_6/add]:100
141 	                 CONV_2D	          324.711	    2.793	    2.731	  0.536%	 64.233%	     0.000	        1	[model/tfc3_3/tf_conv_27/conv2d_27/BiasAdd;model/tfc3_3/tf_conv_27/conv2d_27/Conv2D;]:101
142 	                LOGISTIC	          327.443	    0.119	    0.120	  0.024%	 64.257%	     0.000	        1	[model/tfc3_3/tf_conv_27/Sigmoid]:102
143 	                     MUL	          327.564	    0.073	    0.073	  0.014%	 64.271%	     0.000	        1	[model/tfc3_3/tf_conv_27/mul_1]:103
144 	           CONCATENATION	          327.637	    0.024	    0.025	  0.005%	 64.276%	     0.000	        1	[model/tfc3_3/concat]:104
145 	                 CONV_2D	          327.663	    5.565	    5.548	  1.089%	 65.365%	     0.000	        1	[model/tfc3_3/tf_conv_28/conv2d_28/BiasAdd;model/tfc3_3/tf_conv_28/conv2d_28/Conv2D;]:105
146 	                LOGISTIC	          333.212	    0.237	    0.241	  0.047%	 65.412%	     0.000	        1	[model/tfc3_3/tf_conv_28/Sigmoid]:106
147 	                     MUL	          333.453	    0.136	    0.138	  0.027%	 65.439%	     0.000	        1	[model/tfc3_3/tf_conv_28/mul_1]:107
148 	                 CONV_2D	          333.592	    2.707	    2.737	  0.537%	 65.976%	     0.000	        1	[model/tfsppf/tf_conv_31/conv2d_31/BiasAdd;model/tfsppf/tf_conv_31/conv2d_31/Conv2D;]:108
149 	                LOGISTIC	          336.329	    0.117	    0.121	  0.024%	 66.000%	     0.000	        1	[model/tfsppf/tf_conv_31/Sigmoid]:109
150 	                     MUL	          336.451	    0.068	    0.071	  0.014%	 66.014%	     0.000	        1	[model/tfsppf/tf_conv_31/mul_1]:110
151 	             MAX_POOL_2D	          336.522	    0.231	    0.230	  0.045%	 66.059%	     0.000	        1	[model/tfsppf/max_pooling2d/MaxPool]:111
152 	             MAX_POOL_2D	          336.752	    0.248	    0.237	  0.046%	 66.105%	     0.000	        1	[model/tfsppf/max_pooling2d/MaxPool_1]:112
153 	             MAX_POOL_2D	          336.990	    0.306	    0.226	  0.044%	 66.150%	     0.000	        1	[model/tfsppf/max_pooling2d/MaxPool_2]:113
154 	           CONCATENATION	          337.217	    0.037	    0.036	  0.007%	 66.157%	     0.000	        1	[model/tfsppf/concat]:114
155 	                 CONV_2D	          337.254	   10.576	   10.589	  2.078%	 68.235%	     0.000	        1	[model/tfsppf/tf_conv_32/conv2d_32/BiasAdd;model/tfsppf/tf_conv_32/conv2d_32/Conv2D;]:115
156 	                LOGISTIC	          347.845	    0.244	    0.247	  0.048%	 68.283%	     0.000	        1	[model/tfsppf/tf_conv_32/Sigmoid]:116
157 	                     MUL	          348.092	    0.134	    0.140	  0.027%	 68.311%	     0.000	        1	[model/tfsppf/tf_conv_32/mul_1]:117
158 	                 CONV_2D	          348.232	    2.850	    2.759	  0.541%	 68.852%	     0.000	        1	[model/tf_conv_33/conv2d_33/BiasAdd;model/tf_conv_33/conv2d_33/Conv2D;]:118
159 	                LOGISTIC	          350.992	    0.122	    0.120	  0.024%	 68.876%	     0.000	        1	[model/tf_conv_33/Sigmoid]:119
160 	                     MUL	          351.113	    0.068	    0.070	  0.014%	 68.889%	     0.000	        1	[model/tf_conv_33/mul_1]:120
161 	                QUANTIZE	          351.183	    0.047	    0.047	  0.009%	 68.899%	     0.000	        1	[model/tf_conv_33/mul_11]:121
162 	 RESIZE_NEAREST_NEIGHBOR	          351.231	    0.033	    0.030	  0.006%	 68.904%	     0.000	        1	[model/tf_upsample/resize/ResizeNearestNeighbor]:122
163 	           CONCATENATION	          351.261	    0.107	    0.116	  0.023%	 68.927%	     0.000	        1	[model/tf_concat/concat]:123
164 	                 CONV_2D	          351.377	    5.357	    5.354	  1.051%	 69.978%	     0.000	        1	[model/tfc3_4/tf_conv_34/conv2d_34/BiasAdd;model/tfc3_4/tf_conv_34/conv2d_34/Conv2D;]:124
165 	                LOGISTIC	          356.733	    0.238	    0.239	  0.047%	 70.025%	     0.000	        1	[model/tfc3_4/tf_conv_34/Sigmoid]:125
166 	                     MUL	          356.973	    0.163	    0.139	  0.027%	 70.052%	     0.000	        1	[model/tfc3_4/tf_conv_34/mul_1]:126
167 	                 CONV_2D	          357.112	    1.516	    1.528	  0.300%	 70.352%	     0.000	        1	[model/tfc3_4/sequential_9/tf_bottleneck_7/tf_conv_37/conv2d_37/BiasAdd;model/tfc3_4/sequential_9/tf_bottleneck_7/tf_conv_37/conv2d_37/Conv2D;]:127
168 	                LOGISTIC	          358.641	    0.230	    0.234	  0.046%	 70.398%	     0.000	        1	[model/tfc3_4/sequential_9/tf_bottleneck_7/tf_conv_37/Sigmoid]:128
169 	                     MUL	          358.876	    0.135	    0.133	  0.026%	 70.424%	     0.000	        1	[model/tfc3_4/sequential_9/tf_bottleneck_7/tf_conv_37/mul_1]:129
170 	                 CONV_2D	          359.009	   11.788	   11.802	  2.316%	 72.740%	     0.000	        1	[model/tfc3_4/sequential_9/tf_bottleneck_7/tf_conv_38/conv2d_38/BiasAdd;model/tfc3_4/sequential_9/tf_bottleneck_7/tf_conv_38/conv2d_38/Conv2D;]:130
171 	                LOGISTIC	          370.812	    0.243	    0.246	  0.048%	 72.788%	     0.000	        1	[model/tfc3_4/sequential_9/tf_bottleneck_7/tf_conv_38/Sigmoid]:131
172 	                     MUL	          371.058	    0.138	    0.141	  0.028%	 72.816%	     0.000	        1	[model/tfc3_4/sequential_9/tf_bottleneck_7/tf_conv_38/mul_1]:132
173 	                 CONV_2D	          371.200	    5.374	    5.393	  1.058%	 73.874%	     0.000	        1	[model/tfc3_4/tf_conv_35/conv2d_35/BiasAdd;model/tfc3_4/tf_conv_35/conv2d_35/Conv2D;]:133
174 	                LOGISTIC	          376.595	    0.240	    0.241	  0.047%	 73.921%	     0.000	        1	[model/tfc3_4/tf_conv_35/Sigmoid]:134
175 	                     MUL	          376.836	    0.143	    0.141	  0.028%	 73.949%	     0.000	        1	[model/tfc3_4/tf_conv_35/mul_1]:135
176 	           CONCATENATION	          376.978	    0.092	    0.069	  0.014%	 73.963%	     0.000	        1	[model/tfc3_4/concat]:136
177 	                 CONV_2D	          377.047	    5.632	    5.678	  1.114%	 75.077%	     0.000	        1	[model/tfc3_4/tf_conv_36/conv2d_36/BiasAdd;model/tfc3_4/tf_conv_36/conv2d_36/Conv2D;]:137
178 	                LOGISTIC	          382.726	    0.483	    0.484	  0.095%	 75.172%	     0.000	        1	[model/tfc3_4/tf_conv_36/Sigmoid]:138
179 	                     MUL	          383.211	    0.261	    0.270	  0.053%	 75.225%	     0.000	        1	[model/tfc3_4/tf_conv_36/mul_1]:139
180 	                 CONV_2D	          383.482	    2.844	    2.837	  0.557%	 75.781%	     0.000	        1	[model/tf_conv_39/conv2d_39/BiasAdd;model/tf_conv_39/conv2d_39/Conv2D;]:140
181 	                LOGISTIC	          386.320	    0.236	    0.238	  0.047%	 75.828%	     0.000	        1	[model/tf_conv_39/Sigmoid]:141
182 	                     MUL	          386.559	    0.132	    0.135	  0.027%	 75.855%	     0.000	        1	[model/tf_conv_39/mul_1]:142
183 	 RESIZE_NEAREST_NEIGHBOR	          386.695	    0.107	    0.107	  0.021%	 75.876%	     0.000	        1	[model/tf_upsample_1/resize/ResizeNearestNeighbor]:143
184 	           CONCATENATION	          386.802	    0.480	    0.487	  0.096%	 75.971%	     0.000	        1	[model/tf_concat_1/concat]:144
185 	                 CONV_2D	          387.290	    5.994	    5.988	  1.175%	 77.146%	     0.000	        1	[model/tfc3_5/tf_conv_40/conv2d_40/BiasAdd;model/tfc3_5/tf_conv_40/conv2d_40/Conv2D;]:145
186 	                LOGISTIC	          393.280	    0.477	    0.485	  0.095%	 77.241%	     0.000	        1	[model/tfc3_5/tf_conv_40/Sigmoid]:146
187 	                     MUL	          393.765	    0.272	    0.273	  0.054%	 77.295%	     0.000	        1	[model/tfc3_5/tf_conv_40/mul_1]:147
188 	                 CONV_2D	          394.039	    1.884	    1.947	  0.382%	 77.677%	     0.000	        1	[model/tfc3_5/sequential_10/tf_bottleneck_8/tf_conv_43/conv2d_43/BiasAdd;model/tfc3_5/sequential_10/tf_bottleneck_8/tf_conv_43/conv2d_43/Conv2D;]:148
189 	                LOGISTIC	          395.988	    0.460	    0.471	  0.092%	 77.770%	     0.000	        1	[model/tfc3_5/sequential_10/tf_bottleneck_8/tf_conv_43/Sigmoid]:149
190 	                     MUL	          396.459	    0.258	    0.268	  0.053%	 77.822%	     0.000	        1	[model/tfc3_5/sequential_10/tf_bottleneck_8/tf_conv_43/mul_1]:150
191 	                 CONV_2D	          396.727	   12.717	   12.610	  2.474%	 80.297%	     0.000	        1	[model/tfc3_5/sequential_10/tf_bottleneck_8/tf_conv_44/conv2d_44/BiasAdd;model/tfc3_5/sequential_10/tf_bottleneck_8/tf_conv_44/conv2d_44/Conv2D;]:151
192 	                LOGISTIC	          409.339	    0.494	    0.497	  0.098%	 80.394%	     0.000	        1	[model/tfc3_5/sequential_10/tf_bottleneck_8/tf_conv_44/Sigmoid]:152
193 	                     MUL	          409.837	    0.271	    0.272	  0.053%	 80.447%	     0.000	        1	[model/tfc3_5/sequential_10/tf_bottleneck_8/tf_conv_44/mul_1]:153
194 	                 CONV_2D	          410.109	    5.952	    5.984	  1.174%	 81.622%	     0.000	        1	[model/tfc3_5/tf_conv_41/conv2d_41/BiasAdd;model/tfc3_5/tf_conv_41/conv2d_41/Conv2D;]:154
195 	                LOGISTIC	          416.095	    0.483	    0.485	  0.095%	 81.717%	     0.000	        1	[model/tfc3_5/tf_conv_41/Sigmoid]:155
196 	                     MUL	          416.580	    0.268	    0.274	  0.054%	 81.771%	     0.000	        1	[model/tfc3_5/tf_conv_41/mul_1]:156
197 	           CONCATENATION	          416.855	    0.146	    0.140	  0.028%	 81.798%	     0.000	        1	[model/tfc3_5/concat]:157
198 	                 CONV_2D	          416.996	    6.663	    6.593	  1.294%	 83.092%	     0.000	        1	[model/tfc3_5/tf_conv_42/conv2d_42/BiasAdd;model/tfc3_5/tf_conv_42/conv2d_42/Conv2D;]:158
199 	                LOGISTIC	          423.591	    1.008	    0.993	  0.195%	 83.287%	     0.000	        1	[model/tfc3_5/tf_conv_42/Sigmoid]:159
200 	                     MUL	          424.585	    0.567	    0.544	  0.107%	 83.394%	     0.000	        1	[model/tfc3_5/tf_conv_42/mul_1]:160
201 	                     PAD	          425.130	    1.381	    1.393	  0.273%	 83.667%	     0.000	        1	[model/tf_conv_45/sequential_11/tf_pad_5/Pad]:161
202 	                 CONV_2D	          426.524	   11.894	   11.920	  2.339%	 86.006%	     0.000	        1	[model/tf_conv_45/sequential_11/conv2d_45/BiasAdd;model/tf_conv_45/sequential_11/conv2d_45/Conv2D;]:162
203 	                LOGISTIC	          438.446	    0.240	    0.250	  0.049%	 86.055%	     0.000	        1	[model/tf_conv_45/Sigmoid]:163
204 	                     MUL	          438.696	    0.141	    0.141	  0.028%	 86.083%	     0.000	        1	[model/tf_conv_45/mul_1]:164
205 	           CONCATENATION	          438.837	    0.118	    0.117	  0.023%	 86.106%	     0.000	        1	[model/tf_concat_2/concat]:165
206 	                 CONV_2D	          438.954	    2.827	    2.846	  0.558%	 86.664%	     0.000	        1	[model/tfc3_6/tf_conv_46/conv2d_46/BiasAdd;model/tfc3_6/tf_conv_46/conv2d_46/Conv2D;]:166
207 	                LOGISTIC	          441.801	    0.233	    0.233	  0.046%	 86.710%	     0.000	        1	[model/tfc3_6/tf_conv_46/Sigmoid]:167
208 	                     MUL	          442.035	    0.146	    0.141	  0.028%	 86.738%	     0.000	        1	[model/tfc3_6/tf_conv_46/mul_1]:168
209 	                 CONV_2D	          442.176	    1.503	    1.524	  0.299%	 87.037%	     0.000	        1	[model/tfc3_6/sequential_12/tf_bottleneck_9/tf_conv_49/conv2d_49/BiasAdd;model/tfc3_6/sequential_12/tf_bottleneck_9/tf_conv_49/conv2d_49/Conv2D;]:169
210 	                LOGISTIC	          443.701	    0.233	    0.236	  0.046%	 87.083%	     0.000	        1	[model/tfc3_6/sequential_12/tf_bottleneck_9/tf_conv_49/Sigmoid]:170
211 	                     MUL	          443.938	    0.131	    0.133	  0.026%	 87.109%	     0.000	        1	[model/tfc3_6/sequential_12/tf_bottleneck_9/tf_conv_49/mul_1]:171
212 	                 CONV_2D	          444.071	   11.840	   11.815	  2.318%	 89.427%	     0.000	        1	[model/tfc3_6/sequential_12/tf_bottleneck_9/tf_conv_50/conv2d_50/BiasAdd;model/tfc3_6/sequential_12/tf_bottleneck_9/tf_conv_50/conv2d_50/Conv2D;]:172
213 	                LOGISTIC	          455.888	    0.245	    0.250	  0.049%	 89.476%	     0.000	        1	[model/tfc3_6/sequential_12/tf_bottleneck_9/tf_conv_50/Sigmoid]:173
214 	                     MUL	          456.138	    0.137	    0.137	  0.027%	 89.503%	     0.000	        1	[model/tfc3_6/sequential_12/tf_bottleneck_9/tf_conv_50/mul_1]:174
215 	                 CONV_2D	          456.276	    2.948	    2.939	  0.577%	 90.080%	     0.000	        1	[model/tfc3_6/tf_conv_47/conv2d_47/BiasAdd;model/tfc3_6/tf_conv_47/conv2d_47/Conv2D;]:175
216 	                LOGISTIC	          459.216	    0.243	    0.237	  0.046%	 90.127%	     0.000	        1	[model/tfc3_6/tf_conv_47/Sigmoid]:176
217 	                     MUL	          459.453	    0.136	    0.143	  0.028%	 90.155%	     0.000	        1	[model/tfc3_6/tf_conv_47/mul_1]:177
218 	           CONCATENATION	          459.597	    0.066	    0.069	  0.014%	 90.168%	     0.000	        1	[model/tfc3_6/concat]:178
219 	                 CONV_2D	          459.666	    5.645	    5.642	  1.107%	 91.275%	     0.000	        1	[model/tfc3_6/tf_conv_48/conv2d_48/BiasAdd;model/tfc3_6/tf_conv_48/conv2d_48/Conv2D;]:179
220 	                LOGISTIC	          465.309	    0.476	    0.481	  0.094%	 91.370%	     0.000	        1	[model/tfc3_6/tf_conv_48/Sigmoid]:180
221 	                     MUL	          465.790	    0.270	    0.266	  0.052%	 91.422%	     0.000	        1	[model/tfc3_6/tf_conv_48/mul_1]:181
222 	                     PAD	          466.056	    0.742	    0.749	  0.147%	 91.569%	     0.000	        1	[model/tf_conv_51/sequential_13/tf_pad_6/Pad]:182
223 	                 CONV_2D	          466.805	   12.001	   11.951	  2.345%	 93.914%	     0.000	        1	[model/tf_conv_51/sequential_13/conv2d_51/BiasAdd;model/tf_conv_51/sequential_13/conv2d_51/Conv2D;]:183
224 	                LOGISTIC	          478.759	    0.124	    0.128	  0.025%	 93.939%	     0.000	        1	[model/tf_conv_51/Sigmoid]:184
225 	                     MUL	          478.887	    0.070	    0.071	  0.014%	 93.953%	     0.000	        1	[model/tf_conv_51/mul_1]:185
226 	           CONCATENATION	          478.958	    0.027	    0.032	  0.006%	 93.959%	     0.000	        1	[model/tf_concat_3/concat]:186
227 	                 CONV_2D	          478.990	    2.703	    2.701	  0.530%	 94.489%	     0.000	        1	[model/tfc3_7/tf_conv_52/conv2d_52/BiasAdd;model/tfc3_7/tf_conv_52/conv2d_52/Conv2D;]:187
228 	                LOGISTIC	          481.692	    0.123	    0.120	  0.023%	 94.513%	     0.000	        1	[model/tfc3_7/tf_conv_52/Sigmoid]:188
229 	                     MUL	          481.812	    0.069	    0.072	  0.014%	 94.527%	     0.000	        1	[model/tfc3_7/tf_conv_52/mul_1]:189
230 	                 CONV_2D	          481.884	    1.415	    1.419	  0.278%	 94.805%	     0.000	        1	[model/tfc3_7/sequential_14/tf_bottleneck_10/tf_conv_55/conv2d_55/BiasAdd;model/tfc3_7/sequential_14/tf_bottleneck_10/tf_conv_55/conv2d_55/Conv2D;]:190
231 	                LOGISTIC	          483.304	    0.118	    0.117	  0.023%	 94.828%	     0.000	        1	[model/tfc3_7/sequential_14/tf_bottleneck_10/tf_conv_55/Sigmoid]:191
232 	                     MUL	          483.422	    0.067	    0.071	  0.014%	 94.842%	     0.000	        1	[model/tfc3_7/sequential_14/tf_bottleneck_10/tf_conv_55/mul_1]:192
233 	                 CONV_2D	          483.493	   11.967	   11.914	  2.338%	 97.180%	     0.000	        1	[model/tfc3_7/sequential_14/tf_bottleneck_10/tf_conv_56/conv2d_56/BiasAdd;model/tfc3_7/sequential_14/tf_bottleneck_10/tf_conv_56/conv2d_56/Conv2D;]:193
234 	                LOGISTIC	          495.408	    0.128	    0.128	  0.025%	 97.205%	     0.000	        1	[model/tfc3_7/sequential_14/tf_bottleneck_10/tf_conv_56/Sigmoid]:194
235 	                     MUL	          495.537	    0.071	    0.073	  0.014%	 97.219%	     0.000	        1	[model/tfc3_7/sequential_14/tf_bottleneck_10/tf_conv_56/mul_1]:195
236 	                 CONV_2D	          495.610	    2.767	    2.758	  0.541%	 97.760%	     0.000	        1	[model/tfc3_7/tf_conv_53/conv2d_53/BiasAdd;model/tfc3_7/tf_conv_53/conv2d_53/Conv2D;]:196
237 	                LOGISTIC	          498.369	    0.118	    0.120	  0.024%	 97.784%	     0.000	        1	[model/tfc3_7/tf_conv_53/Sigmoid]:197
238 	                     MUL	          498.490	    0.073	    0.071	  0.014%	 97.798%	     0.000	        1	[model/tfc3_7/tf_conv_53/mul_1]:198
239 	           CONCATENATION	          498.561	    0.025	    0.026	  0.005%	 97.803%	     0.000	        1	[model/tfc3_7/concat]:199
240 	                 CONV_2D	          498.587	    5.486	    5.494	  1.078%	 98.881%	     0.000	        1	[model/tfc3_7/tf_conv_54/conv2d_54/BiasAdd;model/tfc3_7/tf_conv_54/conv2d_54/Conv2D;]:200
241 	                LOGISTIC	          504.083	    0.242	    0.239	  0.047%	 98.928%	     0.000	        1	[model/tfc3_7/tf_conv_54/Sigmoid]:201
242 	                     MUL	          504.322	    0.137	    0.141	  0.028%	 98.956%	     0.000	        1	[model/tfc3_7/tf_conv_54/mul_1]:202
243 	                 CONV_2D	          504.463	    0.258	    0.267	  0.052%	 99.008%	     0.000	        1	[model/tf_detect/tf_conv2d_2/conv2d_59/BiasAdd;model/tf_detect/tf_conv2d_2/conv2d_59/Conv2D;]:203
244 	                 RESHAPE	          504.731	    0.002	    0.002	  0.000%	 99.008%	     0.000	        1	[model/tf_detect/Reshape_4]:204
245 	           STRIDED_SLICE	          504.734	    0.018	    0.019	  0.004%	 99.012%	     0.000	        1	[model/tf_detect/strided_slice_19]:205
246 	                LOGISTIC	          504.753	    0.004	    0.004	  0.001%	 99.013%	     0.000	        1	[model/tf_detect/Sigmoid_6]:206
247 	                     MUL	          504.758	    0.006	    0.006	  0.001%	 99.014%	     0.000	        1	[model/tf_detect/mul_16]:207
248 	                     ADD	          504.764	    0.036	    0.036	  0.007%	 99.021%	     0.000	        1	[model/tf_detect/add_2]:208
249 	                     MUL	          504.801	    0.003	    0.003	  0.001%	 99.022%	     0.000	        1	[model/tf_detect/mul_17]:209
250 	                     MUL	          504.804	    0.021	    0.021	  0.004%	 99.026%	     0.000	        1	[model/tf_detect/truediv_4]:210
251 	           STRIDED_SLICE	          504.827	    0.012	    0.012	  0.002%	 99.028%	     0.000	        1	[model/tf_detect/strided_slice_21]:211
252 	                LOGISTIC	          504.841	    0.004	    0.004	  0.001%	 99.029%	     0.000	        1	[model/tf_detect/Sigmoid_7]:212
253 	                     MUL	          504.846	    0.003	    0.003	  0.001%	 99.030%	     0.000	        1	[model/tf_detect/pow_2;]:213
254 	                     MUL	          504.850	    0.016	    0.016	  0.003%	 99.033%	     0.000	        1	[model/tf_detect/mul_18]:214
255 	                     MUL	          504.868	    0.020	    0.020	  0.004%	 99.037%	     0.000	        1	[model/tf_detect/truediv_5]:215
256 	           STRIDED_SLICE	          504.890	    0.012	    0.012	  0.002%	 99.039%	     0.000	        1	[model/tf_detect/strided_slice_22]:216
257 	                LOGISTIC	          504.904	    0.004	    0.005	  0.001%	 99.040%	     0.000	        1	[model/tf_detect/Sigmoid_8]:217
258 	                QUANTIZE	          504.909	    0.003	    0.004	  0.001%	 99.041%	     0.000	        1	[model/tf_detect/Sigmoid_81]:218
259 	           CONCATENATION	          504.913	    0.033	    0.033	  0.007%	 99.047%	     0.000	        1	[model/tf_detect/concat_2]:219
260 	                 RESHAPE	          504.947	    0.001	    0.001	  0.000%	 99.048%	     0.000	        1	[model/tf_detect/Reshape_5]:220
261 	                 CONV_2D	          504.949	    0.581	    0.545	  0.107%	 99.155%	     0.000	        1	[model/tf_detect/tf_conv2d_1/conv2d_58/BiasAdd;model/tf_detect/tf_conv2d_1/conv2d_58/Conv2D;]:221
262 	                 RESHAPE	          505.494	    0.005	    0.003	  0.001%	 99.155%	     0.000	        1	[model/tf_detect/Reshape_2]:222
263 	           STRIDED_SLICE	          505.497	    0.043	    0.042	  0.008%	 99.163%	     0.000	        1	[model/tf_detect/strided_slice_11]:223
264 	                LOGISTIC	          505.540	    0.013	    0.012	  0.002%	 99.166%	     0.000	        1	[model/tf_detect/Sigmoid_3]:224
265 	                     MUL	          505.554	    0.010	    0.008	  0.002%	 99.168%	     0.000	        1	[model/tf_detect/mul_9]:225
266 	                     ADD	          505.563	    0.119	    0.120	  0.024%	 99.191%	     0.000	        1	[model/tf_detect/add_1]:226
267 	                     MUL	          505.684	    0.008	    0.008	  0.002%	 99.193%	     0.000	        1	[model/tf_detect/mul_10]:227
268 	                     MUL	          505.693	    0.076	    0.075	  0.015%	 99.207%	     0.000	        1	[model/tf_detect/truediv_2]:228
269 	           STRIDED_SLICE	          505.770	    0.043	    0.041	  0.008%	 99.215%	     0.000	        1	[model/tf_detect/strided_slice_13]:229
270 	                LOGISTIC	          505.813	    0.012	    0.012	  0.002%	 99.218%	     0.000	        1	[model/tf_detect/Sigmoid_4]:230
271 	                     MUL	          505.825	    0.008	    0.009	  0.002%	 99.219%	     0.000	        1	[model/tf_detect/pow_1;]:231
272 	                     MUL	          505.835	    0.059	    0.059	  0.012%	 99.231%	     0.000	        1	[model/tf_detect/mul_11]:232
273 	                     MUL	          505.896	    0.075	    0.075	  0.015%	 99.246%	     0.000	        1	[model/tf_detect/truediv_3]:233
274 	           STRIDED_SLICE	          505.973	    0.040	    0.040	  0.008%	 99.254%	     0.000	        1	[model/tf_detect/strided_slice_14]:234
275 	                LOGISTIC	          506.016	    0.012	    0.013	  0.003%	 99.256%	     0.000	        1	[model/tf_detect/Sigmoid_5]:235
276 	                QUANTIZE	          506.030	    0.008	    0.007	  0.001%	 99.258%	     0.000	        1	[model/tf_detect/Sigmoid_51]:236
277 	           CONCATENATION	          506.038	    0.097	    0.095	  0.019%	 99.276%	     0.000	        1	[model/tf_detect/concat_1]:237
278 	                 RESHAPE	          506.133	    0.003	    0.002	  0.000%	 99.277%	     0.000	        1	[model/tf_detect/Reshape_3]:238
279 	                 CONV_2D	          506.136	    1.194	    1.226	  0.241%	 99.517%	     0.000	        1	[model/tf_detect/tf_conv2d/conv2d_57/BiasAdd;model/tf_detect/tf_conv2d/conv2d_57/Conv2D;]:239
280 	                 RESHAPE	          507.363	    0.008	    0.010	  0.002%	 99.519%	     0.000	        1	[model/tf_detect/Reshape]:240
281 	           STRIDED_SLICE	          507.373	    0.171	    0.176	  0.035%	 99.554%	     0.000	        1	[model/tf_detect/strided_slice_3]:241
282 	                LOGISTIC	          507.549	    0.047	    0.048	  0.009%	 99.563%	     0.000	        1	[model/tf_detect/Sigmoid]:242
283 	                     MUL	          507.599	    0.025	    0.027	  0.005%	 99.569%	     0.000	        1	[model/tf_detect/mul_2]:243
284 	                     ADD	          507.627	    0.466	    0.482	  0.095%	 99.663%	     0.000	        1	[model/tf_detect/add]:244
285 	                     MUL	          508.109	    0.024	    0.025	  0.005%	 99.668%	     0.000	        1	[model/tf_detect/mul_3]:245
286 	                     MUL	          508.136	    0.299	    0.302	  0.059%	 99.728%	     0.000	        1	[model/tf_detect/truediv]:246
287 	           STRIDED_SLICE	          508.439	    0.170	    0.162	  0.032%	 99.759%	     0.000	        1	[model/tf_detect/strided_slice_5]:247
288 	                LOGISTIC	          508.601	    0.044	    0.044	  0.009%	 99.768%	     0.000	        1	[model/tf_detect/Sigmoid_1]:248
289 	                     MUL	          508.647	    0.028	    0.026	  0.005%	 99.773%	     0.000	        1	[model/tf_detect/pow;]:249
290 	                     MUL	          508.675	    0.244	    0.245	  0.048%	 99.821%	     0.000	        1	[model/tf_detect/mul_4]:250
291 	                     MUL	          508.923	    0.316	    0.298	  0.058%	 99.880%	     0.000	        1	[model/tf_detect/truediv_1]:251
292 	           STRIDED_SLICE	          509.221	    0.168	    0.157	  0.031%	 99.910%	     0.000	        1	[model/tf_detect/strided_slice_6]:252
293 	                LOGISTIC	          509.380	    0.044	    0.045	  0.009%	 99.919%	     0.000	        1	[model/tf_detect/Sigmoid_2]:253
294 	                QUANTIZE	          509.427	    0.020	    0.019	  0.004%	 99.923%	     0.000	        1	[model/tf_detect/Sigmoid_21]:254
295 	           CONCATENATION	          509.446	    0.381	    0.377	  0.074%	 99.997%	     0.000	        1	[model/tf_detect/concat]:255
296 	                 RESHAPE	          509.823	    0.006	    0.005	  0.001%	 99.998%	     0.000	        1	[model/tf_detect/Reshape_1]:256
297 	           CONCATENATION	          509.829	    0.012	    0.010	  0.002%	100.000%	     0.000	        1	[PartitionedCall:0]:257
298 
299 ============================== Top by Computation Time ==============================
300 	             [node type]	          [start]	  [first]	 [avg ms]	     [%]	  [cdf%]	  [mem KB]	[times called]	[Name]
301 	                 CONV_2D	            0.017	   33.055	   32.181	  6.315%	  6.315%	     0.000	        1	[model/tf_conv/sequential/conv2d/BiasAdd;model/tf_conv/sequential/conv2d/Conv2D;model/tf_conv/sequential/conv2d/BiasAdd/ReadVariableOp/resource;;model/tf_conv/sequential/tf_pad/Pad]:0
302 	                 CONV_2D	           44.003	   28.280	   28.208	  5.535%	 11.850%	     0.000	        1	[model/tf_conv_1/sequential_1/conv2d_1/BiasAdd;model/tf_conv_1/sequential_1/conv2d_1/Conv2D;]:4
303 	                 CONV_2D	          125.836	   24.168	   24.293	  4.767%	 16.617%	     0.000	        1	[model/tf_conv_7/sequential_3/conv2d_7/BiasAdd;model/tf_conv_7/sequential_3/conv2d_7/Conv2D;]:25
304 	                 CONV_2D	          283.805	   23.778	   23.751	  4.661%	 21.278%	     0.000	        1	[model/tf_conv_25/sequential_7/conv2d_25/BiasAdd;model/tf_conv_25/sequential_7/conv2d_25/Conv2D;]:88
305 	                 CONV_2D	          202.608	   23.515	   23.534	  4.618%	 25.896%	     0.000	        1	[model/tf_conv_15/sequential_5/conv2d_15/BiasAdd;model/tf_conv_15/sequential_5/conv2d_15/Conv2D;]:53
306 	                 CONV_2D	           85.855	   16.922	   16.824	  3.301%	 29.197%	     0.000	        1	[model/tfc3/sequential_2/tf_bottleneck/tf_conv_6/conv2d_6/BiasAdd;model/tfc3/sequential_2/tf_bottleneck/tf_conv_6/conv2d_6/Conv2D;]:13
307 	                 CONV_2D	          158.481	   12.662	   12.634	  2.479%	 31.676%	     0.000	        1	[model/tfc3_1/sequential_4/tf_bottleneck_1/tf_conv_12/conv2d_12/BiasAdd;model/tfc3_1/sequential_4/tf_bottleneck_1/tf_conv_12/conv2d_12/Conv2D;]:34
308 	                 CONV_2D	          175.112	   12.470	   12.630	  2.478%	 34.155%	     0.000	        1	[model/tfc3_1/sequential_4/tf_bottleneck_2/tf_conv_14/conv2d_14/BiasAdd;model/tfc3_1/sequential_4/tf_bottleneck_2/tf_conv_14/conv2d_14/Conv2D;]:41
309 	                 CONV_2D	          396.727	   12.717	   12.610	  2.474%	 36.629%	     0.000	        1	[model/tfc3_5/sequential_10/tf_bottleneck_8/tf_conv_44/conv2d_44/BiasAdd;model/tfc3_5/sequential_10/tf_bottleneck_8/tf_conv_44/conv2d_44/Conv2D;]:151
310 	                 CONV_2D	          466.805	   12.001	   11.951	  2.345%	 38.974%	     0.000	        1	[model/tf_conv_51/sequential_13/conv2d_51/BiasAdd;model/tf_conv_51/sequential_13/conv2d_51/Conv2D;]:183
311 
312 Number of nodes executed: 258
313 ============================== Summary by node type ==============================
314 	             [Node type]	  [count]	  [avg ms]	    [avg %]	    [cdf %]	  [mem KB]	[times called]
315 	                 CONV_2D	       60	   444.767	    87.296%	    87.296%	     0.000	       60
316 	                LOGISTIC	       66	    28.190	     5.533%	    92.829%	     0.000	       66
317 	                     MUL	       75	    17.285	     3.393%	    96.221%	     0.000	       75
318 	                     PAD	        6	    12.599	     2.473%	    98.694%	     0.000	        6
319 	                     ADD	       10	     2.859	     0.561%	    99.255%	     0.000	       10
320 	           CONCATENATION	       17	     2.214	     0.435%	    99.690%	     0.000	       17
321 	             MAX_POOL_2D	        3	     0.692	     0.136%	    99.826%	     0.000	        3
322 	           STRIDED_SLICE	        9	     0.657	     0.129%	    99.954%	     0.000	        9
323 	 RESIZE_NEAREST_NEIGHBOR	        2	     0.135	     0.026%	    99.981%	     0.000	        2
324 	                QUANTIZE	        4	     0.075	     0.015%	    99.996%	     0.000	        4
325 	                 RESHAPE	        6	     0.022	     0.004%	   100.000%	     0.000	        6
326 
327 Timings (microseconds): count=50 first=510862 curr=510289 min=508139 max=510862 avg=509611 std=631
328 Memory (bytes): count=0
329 258 nodes observed
ML-TN-009-htop-model-int8-cpu.png
Test #4: INT8, NPU[edit | edit source]

The last round of testing refers to the most interesting configuration: INT8 precision and NPU enable. As expected, this set-up allowed us to achieve a significant performance boost. The average inference time dropped to 32.4ms, corresponding to about 31 FPS, an order of magnitude higher than the first configuration. Regarding the CPU load, no significant differences were observed compared to the previous case.

 1 STARTING!
 2 Log parameter values verbosely: [0]
 3 Graph: [model_int8.tflite]
 4 Enable op profiling: [1]
 5 External delegate path: [/usr/lib/libvx_delegate.so]
 6 Loaded model model_int8.tflite
 7 Vx delegate: allowed_cache_mode set to 0.
 8 Vx delegate: device num set to 0.
 9 Vx delegate: allowed_builtin_code set to 0.
10 Vx delegate: error_during_init set to 0.
11 Vx delegate: error_during_prepare set to 0.
12 Vx delegate: error_during_invoke set to 0.
13 EXTERNAL delegate created.
14 Explicitly applied EXTERNAL delegate, and the model graph will be completely executed by the delegate.
15 The input model file size (MB): 7.34413
16 Initialized session in 17.692ms.
17 Running benchmark for at least 1 iterations and at least 0.5 seconds but terminate if exceeding 150 seconds.
18 count=1 curr=32272774
19 
20 Running benchmark for at least 50 iterations and at least 1 seconds but terminate if exceeding 150 seconds.
21 count=50 first=32567 curr=32498 min=31887 max=32679 avg=32434.9 std=225
22 
23 Inference timings in us: Init: 17692, First inference: 32272774, Warmup (avg): 3.22728e+07, Inference (avg): 32434.9
24 Note: as the benchmark tool itself affects memory footprint, the following is only APPROXIMATE to the actual memory footprint of the model at runtime. Take the information at your discretion.
25 Memory footprint delta from the start of the tool (MB): init=9.5625 overall=95.2344
26 Profiling Info for Benchmark Initialization:
27 ============================== Run Order ==============================
28 	             [node type]	          [start]	  [first]	 [avg ms]	     [%]	  [cdf%]	  [mem KB]	[times called]	[Name]
29 	 ModifyGraphWithDelegate	            0.000	    0.915	    0.915	 26.437%	 26.437%	     0.000	        1	ModifyGraphWithDelegate/0
30 	         AllocateTensors	            0.972	    2.546	    2.546	 73.563%	100.000%	     0.000	        1	AllocateTensors/0
31 
32 ============================== Top by Computation Time ==============================
33 	             [node type]	          [start]	  [first]	 [avg ms]	     [%]	  [cdf%]	  [mem KB]	[times called]	[Name]
34 	         AllocateTensors	            0.972	    2.546	    2.546	 73.563%	 73.563%	     0.000	        1	AllocateTensors/0
35 	 ModifyGraphWithDelegate	            0.000	    0.915	    0.915	 26.437%	100.000%	     0.000	        1	ModifyGraphWithDelegate/0
36 
37 Number of nodes executed: 2
38 ============================== Summary by node type ==============================
39 	             [Node type]	  [count]	  [avg ms]	    [avg %]	    [cdf %]	  [mem KB]	[times called]
40 	         AllocateTensors	        1	     2.546	    73.563%	    73.563%	     0.000	        1
41 	 ModifyGraphWithDelegate	        1	     0.915	    26.437%	   100.000%	     0.000	        1
42 
43 Timings (microseconds): count=1 curr=3461
44 Memory (bytes): count=0
45 2 nodes observed
46 
47 
48 
49 Operator-wise Profiling Info for Regular Benchmark Runs:
50 ============================== Run Order ==============================
51 	             [node type]	          [start]	  [first]	 [avg ms]	     [%]	  [cdf%]	  [mem KB]	[times called]	[Name]
52 	             Vx Delegate	            0.017	   32.506	   32.386	100.000%	100.000%	     0.000	        1	[PartitionedCall:0]:258
53 
54 ============================== Top by Computation Time ==============================
55 	             [node type]	          [start]	  [first]	 [avg ms]	     [%]	  [cdf%]	  [mem KB]	[times called]	[Name]
56 	             Vx Delegate	            0.017	   32.506	   32.386	100.000%	100.000%	     0.000	        1	[PartitionedCall:0]:258
57 
58 Number of nodes executed: 1
59 ============================== Summary by node type ==============================
60 	             [Node type]	  [count]	  [avg ms]	    [avg %]	    [cdf %]	  [mem KB]	[times called]
61 	             Vx Delegate	        1	    32.385	   100.000%	   100.000%	     0.000	        1
62 
63 Timings (microseconds): count=50 first=32506 curr=32445 min=31846 max=32628 avg=32385.9 std=223
64 Memory (bytes): count=0
65 1 nodes observed
ML-TN-009-htop-model-int8-npu.png
Recap[edit | edit source]

The results achieved are summarized in the following table.

Linux distribution NPU enabled Model precision Average inference time

[ms]

Average throughput

(inference only)

[FPS]

Power consumption

[W]

FPS/W
Yocto Kirkstone (DESK-MX8M-L 4.2.1) no FP16 545 1.83 1.95 0.94
yes 10769 0.09 1.98 0.05
no INT8 510 1.96 1.85 1.1
yes 32.4 30.9 1.65 18.7

To determine the achievable frame rate of a real product, additional times must be considered, i.e. acquisition time, image preprocessing, etc. However, it is reasonable to assume that they are an order of magnitude smaller than the inference time. Therefore, the last round of testing allows us to claim that ORCA SOM is a serious candidate for the design of a real endoscope as the expected frame rate exceeds the minimum conventional threshold (20 FPS) with a good safety margin.

Conclusions[edit | edit source]

This chapter discusses the results obtained from the experiments conducted by answering the research questions outlined in the introduction to provide insights into the effectiveness and limitations of the proposed approach.

1. How can Federated Learning be integrated with a deep-learning YOLOv5 model for polyp detection while preserving patient data privacy?

To answer this, YOLOv5 architecture was successfully integrated into the Federated Learning (FL) environment using the NVFlare framework. The implementation was carefully designed to ensure high detection accuracy but also strict compliance with data privacy regulations.

Object detectors like YOLOv5 are more complex than traditional deep learning models because they involve additional components, such as anchor regression and modules that shape the network’s architecture. Therefore, YOLOv5 implementation was preceded by a sequence of preliminary steps to ensure its applicability to federated training; to facilitate model deployment across devices, a checkpoint file containing the model weights in PyTorch format was pre-prepared for each architecture. Thanks to a warm-up training phase on a small polyp dataset to fine-tune the anchor settings, the trained model was then created as a Torch model on the server and distributed to the clients. This structure, aligned with NVFlare's Persistor component, enabled model distribution across the two clients and facilitated the subsequent weight aggregation phase. This approach ensured consistency in model updates while maintaining efficient federated learning deployment. To ensure patient data-privacy local training was conducted by both nodes, and only the model weight updates were communicated to the central server for aggregation. This decentralized approach kept polyp medical data within the local nodes, thereby enhancing data security. Moreover, an attempt was made to enhance data security by enabling homomorphic encryption (HE). Instead of transmitting the full model weights, the PolypModelLearner was developed to only share weights differences between the trained local and global models for weights aggregation, aligning with the HE algorithm requirements. However, due to limited resources and time constraints, a full-scale test of this approach could not be conducted, and its effectiveness still remains to be fully evaluated.

2. How does the Federated Learning approach impact model generalization and robustness across different nodes?

A comparative analysis of the centralized and federated learning approach showed that FL provided better model generalization than the centralized approach on polyp KVASIR-SEG Dataset 1, achieving performance metrics quite close to the trained model on a larger and multi-source Dataset 2. The assessment was facilitated through NVFlare's Cross-Site Evaluation workflow, which enabled global model evaluation across test datasets such as Etis-Larib and PolypGEN. The assurance of consistent performance was further demonstrated with mAP scores above 0.7, after several training epochs and rounds completed. Further, the inference-federated designed system showed that iteratively retraining difficult cases over time improved local models' overall knowledge of polyp characteristics that further enhanced polyp detection accuracy. However, a trade-off in convergence speed was noted, as FL required more training rounds to reach optimal performance compared to centralized training, yet it preserved data privacy and enhanced model robustness across different sites.

3. What is the trade-off between model accuracy and computational efficiency during YOLOv5 model deployment in real-time polyp detection for an IoT-edge device system?

YOLOv5 deployment on an edge-based IoT system posed several trade-off related decisions to take, specifically regarding computational limitations, inference speed, and optimization techniques. As a result of the limited processing capability of the SBC ORCA embedded board, training and detection parameters, such as batch and image size and learning rate, had to be decreased to achieve reasonable training times. In the federated context, training trials conducted on the training set Center 1 with more than 300 images imposed severe performance bottlenecks with more than 2 hours of training. This also precluded the integration of homomorphic encryption (HE), since communication transfer latency of encrypted weight, from devices to the server, was unacceptable. It is worth remembering, however, that no hardware acceleration was leveraged for such computations. In terms of inference speed, additional work needs to be done to improve real-time detection speed. The experiments showed that smaller YOLOv5 models, having fewer than 10 million parameters, are suitable for IoT use cases but still need to be optimized further for deployment in a federated setup, which requires more computational resources. YOLOv5n achieved faster inference speed at the expense of lower accuracy. On the other hand, YOLOv5s achieved better detection performance at an increased computational expense.

Overall, the results demonstrated that training PyTorch-based models on small CPUs is quite challenging. To facilitate better training efficiency, a few changes were tried out, such as freezing the backbone while training to save computational effort. Still, in a majority of cases, training while keeping the backbone unfrozen worked better. These findings demonstrate that edge-level deployment of FL requires

  • Careful tuning of model settings
  • Leveraging optimization techniques
  • Exploring hardware-acceleration modules, if available

to achieve an optimum level of efficiency and accuracy.

4. What are the practical challenges in implementing the FL-based AI models in real-world medical environments and how could they be addressed?

This work aimed to deal with a problem that is discussed within the literature: the feasibility and efficacy of using ML models in federated environment for medical settings. Through real-world application of FL in this area, numerous issues arose that showcased the possibility but also the constraints of applying such an approach.

One of the primary limitations when adopting IoT devices in this context concerns hardware capabilities. The embedded device used in this work demonstrated that both processing power and memory are critical factors when training models that require large datasets. The hardware constraints imposed limitations on the selection of model parameters. These factors significantly influenced the feasibility of real-time training and inference showing the need for optimization techniques, such as model pruning or model conversion into a lightweight format, to maintain efficiency. Therefore, these issues highlight the need for dedicated hardware accelerators, such as GPUs or NPUs, to support on-device training in edge-based Federated Learning systems. Security was at the center, with medical AI use cases demanding robust defense against data leakage and adversarial attacks. Security was a central concern, especially in medical AI use cases that required robust defenses against data leakage and adversarial attacks. Homomorphic encryption (HE) emerged as a potential solution to safeguard model updates and protect sensitive data from inference attacks; however, its implementation attempt introduced significant communication overhead resulting in substantial latency.

Moreover, another critical factor for FL performance concerns distribution of data among various clinical sites. Variations in the features, like different lighting acquisitions, imaging equipment, and patient demographics, could impact model convergence and generalizability. In an effort to compensate for these hurdles and provide a guarantee of stable model performance across different clinical scenarios, other federated learning optimization algorithms could be explored to leverage adaptive weighting mechanisms and improve the performance on non-heterogenous data distribution.

5. Is the selected embedded platform suitable, in terms of hardware resources, for implementing an actual product?

On the basis of the results illustrated in the previous chapter, from the computational standpoint, ORCA SOM proved to be reasonably suitable for implementing actual IoT AI-powered endoscopes operating within a Federated Learning system. Thanks to its built-in security features such as Secure/Encrypted Boot, ORCA SOM working in tandem with a properly-designed software stack is also capable of satisfying security standards required by the challenging environment of this project.


In conclusion, this work explored the use of Federated Learning (FL) in the medical imaging domain, specifically for the detection of colorectal polyps using the YOLOv5 architecture. This study focused on examining a decentralized way of training with the intention of improving model performance, while being more sensitive to patient data privacy. After significant experimentation within centralized and federated training configurations, some key understandings emerged. Results of centralized training indicated YOLOv5 was well suitable for polyp detection and the s and n versions of YOLOv5 had the best mixture of accuracy alongside computational and inference effort. Furthermore, the performance metrics, especially the mAP, indicated that training on an enriched dataset led to significant improvements in detection accuracy compared to a single dataset approach, thus demonstrating model generalizability.

Possible improvements[edit | edit source]

Building on the findings of this work, several future research directions can be explored to enhance Federated Learning (FL) in healthcare when adopting AI models and to deploy such solutions in true-industrial, production-ready environments. Exploring these directions could significantly improve the feasibility, security, and reliability of FL-based medical systems, making them more viable to practical clinical implementation.

  • In the context of object-detectors, one of the focal points to optimize model efficiency is to investigate lightweight versions of YOLO, which have the potential to boost performance on embedded systems without compromising accuracy. Additionally, implementing quantization and pruning techniques would accelerate inference time and boost frame rate, making the model more suitable for real-time applications.
  • In the federated environment, having a larger number of sites involved could provide interesting insights into performance variations with more data, as this research was limited to just two sites. Another important feature is the application of state-of-the-art privacy-preserving methods mentioned in the article. While this research tried to explore Homomorphic Encryption (HE), it is essential that future work focuses on applying differential privacy mechanisms to protect data while maintaining model accuracy and reduce computational costs.
  • Ideally, to improve model training when using non-IID (non-identically distributed) data distributions, other federated learning techniques, which have demonstrated their potential in literature, such as SCAFFOLD and FedOpt could be explored. Also personalized federated learning strategies, in which models are tailored to the specific settings of hospitals and the different patient populations, could enhance detection rates and enable wider clinical application—an element that has been investigated in this study to a certain extent.
  • The combination of FL with cloud-edge hybrid systems presents a promising research direction. Leveraging both cloud computing for large-scale model aggregation and edge devices for on-device inference would potentially balance both computational and data privacy efficiency. This hybrid scheme would be especially beneficial in healthcare applications where inference needs to be performed in real-time while keeping up with regulatory compliance.
  • Last but not least, the IoT application described in this article could benefit vastly from the migration to the ToloMEO ecosystem. Thanks to ToloMEO's ready-to-use features, for instance, the fleet of endoscopes could be managed remotely, the inference results could not only collected on the cloud for further analysis but also notarized on a eIDAS-compliant blockchain, and so on.