ML-TN-009 — AI at the edge: IoT real-time endoscopes and Federated Learning

From DAVE Developer's Wiki
Jump to: navigation, search
Info Box
NeuralNetwork.png Applies to Machine Learning
TBD.png


History[edit | edit source]

Version Date Notes
0.1.0 March 2025 First public draft

Abstract[edit | edit source]

This article summarizes the work carried out by Niccolò Brusadin for his internship at DAVE Embedded System. He prototyped a Federated Learning system whose edge devices are machines emulating IoT smart endoscopes for automatic early detection of gastro-intestinal tract polyps.

Early detection of colorectal polyps is crucial for colorectal cancer prevention, however current endoscopic techniques have limitations regarding detection accuracy and efficiency, since most studies proved a significant number of polyps can be missed out during routine exam procedures. Deep Learning-based computer detection systems have been developed to assist endoscopists with real-time polyp detection but training these models on medical data raises concerns about privacy, data security, and regulatory compliance.

This thesis explores the integration of Federated Learning (FL), a Machine learning paradigm that allows multiple institutions to collaboratively train a distributed AI model without sharing sensitive patient data. The work develops a Federated Learning system for automatic GI polyp detection based on an IoT-based platform. In such a context, the role of edge devices will be played by a small fleet of prototypes of a "smart" endoscope able to process in real-time streams through a YOLOv5 deep learning model. The training infrastructure will be created using the NVFlare framework, which is an open-source FL platform for privacy-preserving AI applications. This approach ensures data privacy while improving accuracy and reliability for polyp detection without sharing personal data.

Experimental assessments of the centralized approach demonstrated the efficacy and high accuracy of the YOLOv5 model for polyp detection, achieving strong performance in terms of mAP. Evaluations of the federated training scenarios indicated that it produced comparable performance compared to the centralized learning environment while maintaining data privacy. Moreover, the applicability of the inference-federated hybrid system was demonstrated, which allowed real-time polyp detection. Limitations were identified, including hardware capabilities and communications latency. This suggests future optimizations could be made to the growing area of AI that focuses on patient privacy in imaging studies, and create opportunities for using federated AI, in practice.

The work in based on the achievements of two previous internships detailed here and here.

If you are interested in having Brusadin's entire thesis, please send a request to this address.

Introduction[edit | edit source]

Colorectal cancer (CRC) is a prominent malignant tumour in the digestive system, primarily affecting the colon or rectum, and it significantly contributes to global cancer mortality, accounting for around 10% of all cases. Early detection is crucial, as survival rates can rise from approximately 63% to 91% when diagnosed early. Colonoscopy is the gold standard for identifying CRC, allowing detailed examination of polyps. However, the endoscopist’s expertise greatly influences its sensitivity, and polyps can be easily missed due to various factors, highlighting the need for advanced digital solutions in detection.

Recent research has turned toward AI-assisted screening, employing Deep Learning techniques for object detection and semantic segmentation in polyp diagnosis. YOLO models, known for their efficacy in medical image analysis, offer real-time polyp detection capabilities, balancing speed and performance. Nevertheless, training AI models in healthcare raises significant data privacy and security concerns due to the centralized nature of data collection.

Federated Learning (FL) addresses these issues by enabling devices to collaboratively train a shared AI model while keeping patient data localized, thus maintaining privacy. This decentralized approach not only protects sensitive information but also enhances model generalization across varied clinical contexts. Various FL frameworks have been developed for healthcare applications, with NVFlare offering high-efficiency simulation tools for research and robust production capabilities for enterprise users.

Purpose of the thesis and research questions[edit | edit source]

This work aims to develop a Federated Learning system for automatic detection of gastrointestinal polyps, by integrating deep-learning inference tasks with privacy-preserving training. This research addresses the challenges of AI-assisted colonoscopy, with a particular focus on diagnostic accuracy, computational efficiency, and protection of patient data.

The proposed system will employ an IoT-based smart endoscope device prototyped with an embedded device running a Debian-derived GNU/Linux distribution and integrated with a YOLOv5 deep learning model for real-time polyp detection. The FL training infrastructure will be implemented using NVFlare, with two separate nodes to jointly train the global model without sharing raw medical data. This approach aims to enhance model generalization across various clinical environments while ensuring data privacy regulations. The key points of this research will be as follows:

  1. Development of a deep-learning model for polyp-detection: a YOLOv5 Deep Learning model will be selected to be used as the training model in the federated learning framework and as the inference model to perform detection tasks on smart endoscopic devices.
  2. Integration of Nvidia NVFlare Federated Learning paradigm: the developed model will be implemented in the NVFlare framework as the starting point for the federated learning scenario.
  3. Investigation of the impact of decentralized training: a comparison between FL training and centralized training will be assessed to determine whether a federated learning environment improves model robustness across different nodes.
  4. Implementation of FL polyp detection in actual practice: the purpose is to investigate the applicability of NVFlare framework in real-world environments and evaluate its ability to maintain model accuracy, reduce reliance on centralized datasets, and ensure compliance with healthcare privacy regulations.
  5. Design a system that integrates federated training and detection: the goal is to design a server-edge system where devices initially perform inference locally, then select only poorly detected images to be used for federated training. This iterative process focuses on enhancing model accuracy by training the model only in challenging cases and avoiding redundant training on well-classified images.
  6. Model Optimization for the designed workflow: finally, we aim to ensure that the model runs on edge devices with adequate computational resources without compromising detection accuracy.

Based on these objectives, the research aims to answer the following questions:

  1. How can Federated Learning be integrated with a deep-learning YOLOv5 model for polyp detection while preserving patient data privacy?
  2. How does the Federated Learning approach impact model generalization and robustness across different nodes?
  3. What is the trade-off between accuracy and computational efficiency during YOLOv5 model deployment in real-time polyp detection for an IoT-edge device system?
  4. What are the practical challenges in implementing the FL-based AI models in real-world medical environments, and how could they be addressed?
  5. Is the selected embedded platform suitable, in terms of hardware resources, for implementing an actual product?

This work attempts to find answers to the questions above while demonstrating the applicability of the federated framework in polyp detection.

Hardware/software test-bed[edit | edit source]

The following picture illustrates the test-bed used for this work.

ML-TN-009-testbed.png

It includes a host machine acting as both an FL server and one client (client 1), while the second client operates on an embedded device, specifically DAVE Embedded Systems ORCA Single Board Computer, powered by NXP i.MX8M Plus SoC. This system-on-chip, in turn, integrates a Neural Processing Unit (NPU). The NPU hardware-accelerates ML workloads during the execution of inference algorithms that make use of the most common Deep Neural Network (DNN) architectures. Basically, NPU is a dedicated processor optimized for executing mathematical computations required by DNNs. Typical advantages when leveraging an NPU are:

  • Off-loading the CPU so that it can performs other processing
  • By reducing the inference time of DNN-based algorithms, increasing the throughput of processed samples (in this case, expressed in frames per second)
  • Improving power efficiency.

Regarding the host machine, to ensure a consistent and reproducible environment, a containerized architecture was deployed based on Docker. The Docker image is built on top of a recommended PyTorch image alongside necessary dependencies.

For what concerns client 1, to keep its configuration simple, no containers were used. Instead, a Python virtual environment was created reproducing the containerized environment of client 2. To facilitate these steps, a Debian-derived distribution was installed, called [https://www.armbian.com/ Armbian]. As such, this distribution allows to install pre-built packages easily with well-known apt and friends tools. Armbian is very convenient for developing and testing tasks, but it is not highly optimized as it generally does not provide software modules required to exploit proprietary hardware accelerators such as i.MX8MP’s NPU. As it was more important to verify the functionality of the proposed solution rather than optimize its performance, it was deemed appropriate to prioritize speed in implementing and testing it at the expense of performance in executing the inference algorithms. This is taken into account in this section, where the results achieved are discussed.

Specifications of both machines are detailed in the following table.

Spec / Machine Host

(server + client 2)

Client 1 Notes
Architecture AMD64 AARCH64
Processor / SoC AMD Ryzen 9 5950X NXP i.MX8M Plus
Hardware accelerator NVIDIA RTX 3080 Ti GPU 2.3 TOPS NPU (1) (1) Not exploited.
RAM [GB] 64 6

As explained here, NVFlare had been chosen as the FL framework due to its capability in developing FL applications on embedded systems and addressing privacy regulations inherent in medical data processing.

NVFlare[edit | edit source]

NVFlare adopts a modular design, focusing on essential collaboration components, including a Controller that manages communication between the FL server and clients. The main workflow involves parameter initialization by the FL server, task delegation to clients, local model training, model updates submission back to the server, and aggregation of model updates using algorithms like FedAvg.

Among the available open-source frameworks, Nvidia NVFlare has been selected for developing a Federated Learning environment in this project. A recent study conducted at DAVE Embedded Systems company supports this choice, emphasizing the applicability of NVFlare for developing FL applications in embedded systems [7]. The findings of the study confirm that NVFlare facilitates the creation of real-world Federated Learning environments using Linux-powered embedded platforms, thus showing its adaptability. Furthermore, in the context of medical data and in real-world clinical scenarios, this software development kit (SDK) addresses the requirements for real-time processing privacy regulations. With its suite of powerful tools designed for enhanced collaboration and efficiency, NVFlare emerges as the most suitable and reliable framework for building a Federated Learning system aimed at polyp detection using embedded devices.

Nvidia NVFlare is an open-source framework selected for developing a Federated Learning (FL) environment, primarily due to its adaptability and effectiveness for building FL applications, particularly in embedded systems. Supported by a study conducted at DAVE Embedded Systems, NVFlare has demonstrated its capability in creating real-world FL environments on Linux-powered embedded platforms, which is essential for applications like medical data processing that require adherence to privacy regulations and real-time processing demands. With NVFlare's suite of tools, developers can enhance collaboration and efficiency, making it a reliable choice for building systems focused on polyp detection and other applications.

Overview of NVFlare[edit | edit source]

NVFlare adopts a "less is more" philosophy in its construction and is designed around an Application Programming Interface (API) approach that emphasizes essential functionality while maintaining flexibility and reduced complexity. This design allows developers to easily customize and build Federated Learning workflows. The framework's architecture includes various components, such as Controllers, Task Executors, and Filters, which streamline the process of client coordination and task execution in federated environments.

The Federated Learning Server plays a central role in NVFlare, managing client communications, assigning tasks, aggregating model updates, and overseeing the overall workflow. It interacts with a Job component that specifies the federated learning tasks, while the FL Client represents the distributed nodes executing these tasks. Each Client includes an Executor responsible for locally processing the training tasks assigned by the Controller.

Furthermore, NVFlare ensures secure and efficient deployment of FL applications through an end-to-end operational environment. It provides security credentials and secure communication capabilities essential for real-world applications. Researchers can carry out FL studies and simulations using either admin commands through Notebooks or the NVFlare Console, an interactive command tool, facilitating streamlined operations.

Architecture and Workflow of NVFlare[edit | edit source]

The NVFlare workflow aligns with common FL algorithms, such as FedAvg, consisting of several key operational steps:

  1. The FL Server initiates a job with parameters, including a global model to be distributed to clients.
  2. The Controller assigns training tasks to clients, requesting model updates based on local data.
  3. Each Client's Executor processes the assigned tasks by training the model locally.
  4. Once training concludes, Executors upload model updates to the FL Server.
  5. The Controller collects and aggregates these updates using the designated federated learning algorithm to refine the global model.
  6. The updated global model is redistributed to clients, continuing this iterative process until the desired model accuracy is achieved.

In addition, optional filters can be integrated into task interactions to enhance data privacy through techniques such as differential privacy or homomorphic encryption, which do not hinder the training process.

Structured communication between the Controller and Executor is organized using Shareable Objects, which contain information transmitted between the client and server, and Data Exchange Objects (DXOs) that specify the content of these communications. An important element of NVFlare is the FLComponent class, which serves as the foundation for various components within the system, offering built-in mechanisms for auditing, event handling, logging, and error handling, facilitating organized FL activity.

NVFlare Simulator[edit | edit source]

The NVFlare Simulator is a crucial tool that allows developers and data scientists to expedite the development of FL Components and learning workflows. It enables local testing and debugging of applications on a single machine without needing a realistic project setup, as all clients and servers are simulated within the same process. The Simulator manages client instances and executes multiple federated learning rounds in a controlled environment, allowing components developed here to transition seamlessly into real-world federated scenarios.

Real-world deployment and provisioning tools[edit | edit source]

To facilitate the implementation of FL systems in real-world contexts, NVFlare includes a comprehensive provisioning system that secures communications and simplifies deployment processes. The Provisioning tool generates security credentials and configurations for all participants in an FL study. Each participant receives a Startup Kit containing essential configuration files, certificates, and local authorization policies, ensuring a consistent and secure setup across different locations.

In practical applications, NVFlare employs client-server communication channels secured by signed certificates for identity verification and SSL to establish secure connections between clients and servers. The framework utilizes its own Certificate Authority (CA) to generate and sign certificates for each participant, thus ensuring unique identities. The gRPC protocol facilitates efficient and secure communication, verifying credentials via generated tokens before allowing clients to join the training process, thereby reinforcing security and preventing unauthorized access.

In summary, Nvidia NVFlare empowers developers to build highly adaptable and secure Federated Learning environments, especially suited for applications like medical data processing. Its architecture, tools, and focus on privacy and security make it a frontrunner in the Federated Learning framework realm.

Model development[edit | edit source]

Selection[edit | edit source]

The YOLO (You Only Look Once) family stands out with its remarkable efficiency and accuracy among the various object detection algorithms. Developed by the team at Ultralytics, YOLO has become a popular choice for real-time object detection. Our study implements YOLOv5, a decision supported by prior research from DAVE Embedded Systems company that highlighted its performance in detecting polyps. The name YOLO reflects its unique approach: it examines the entire image at once to identify objects and their locations, unlike other traditional methods that use a two-stage detection process. In the YOLO framework, object detection is treated as a regression problem with a single convolutional neural network that predicts bounding boxes and class probabilities for the entire image.

YOLOv5 models[edit | edit source]

The YOLOv5 architecture consists of five distinct models, ranging from the computationally efficient YOLOv5n to the high-precision YOLOv5x. Each version is tailored for different deployment scenarios and varies in speed, size, and accuracy.

  • YOLOv5n (Nano): it is designed for resource-constrained environments and is the smallest and fastest model in the series. With a compact size of less than 2.5 MB in INT8 format and approximately 4 MB in FP32 format, it is ideal for deployment on edge devices and IoT platforms.
  • YOLOv5s (Small): YOLOv5s consists of approximately 7.2 million parameters. Its balance between efficiency and accuracy makes it suitable for CPU-based inference tasks as well as IoT platforms
  • YOLOv5m (Medium): this mid-sized model contains 21.2 million parameters, offering a trade-off between speed and accuracy. YOLOv5m is often considered a versatile option for a broad range of object detection applications and datasets.
  • YOLOv5l (Large): with 46.5 million parameters, YOLOv5l is designed for scenarios that require higher precision, particularly in detecting smaller objects within images.
  • YOLOv5x (Extra Large): YOLOv5x boasts 86.7 million parameters, achieving the highest mean Average Precision (mAP) among its counterparts. However, this increased performance comes at the cost of higher computational requirements.

Performance metrics[edit | edit source]

The performance metrics discussed in this thesis are those used by the YOLOv5 model, which is employed in both centralized and federated training. Since federated training is implemented using the NVFlare framework, the evaluation metrics remain the same for both approaches, as YOLOv5 will be considered as integral part of the NVFlare environment. To evaluate the performance of object detection models effectively, several metrics are employed. Each of these metrics provides insights into different aspects of the model's accuracy and reliability. Below are the metrics used for evaluating YOLOv5 models, focusing on AP, mAP, and confidence scores. These last mentioned are essential for assessing the effectiveness of object detection models, offering insights into their performance in identifying and localizing objects within images.

  • Intersection over Union (IoU): quantifies the overlap between a predicted bounding box and a ground truth bounding box. It plays a crucial role in evaluating the accuracy of object localization.
  • Precision (P): quantifies the percentage of true positives among all positive predictions, evaluating the model’s ability to avoid false positives.
  • Recall (R): calculates the ratio of correctly identified positive instances by the object detector.
  • F1-score: this score is the harmonic mean of precision and recall, providing a balanced evaluation of a model’s performance by considering both false positives and false negatives.
  • Average Precision (AP): this metric is calculated based on the Precision-Recall (PR) curve. Basically, it calculates the area under the PR curve (AUC), providing a single value that encapsulates the model’s precision and recall performance.
  • Mean Average Precison (mAP): it extends the concept of AP by computing the average AP values across multiple object classes. This will be the primary metric to evaluate federated learning (FL) training performance and to compare results between centralized and federated scenarios.
  • Confidence score: when the deal with inference task, the confidence score is the one to consider. It belongs to the predicted output of the model and it reflects how confident the model is that the box contains an object and also how accurate it thinks the box is that it predicts. If no object exists in that cell, the confidence scores should be zero, otherwise it represents the IOU between the predicted box and any ground truth box.

Datasets[edit | edit source]

High-quality data that accurately reflects the size, shape, texture, and variability of polyps is essential for achieving accurate and robust detection of polyps during colonoscopy. To train and test machine learning models, several public datasets — including Kvasir-SEG, PolypDB, PolypGEN, Etis-LaribPolypDB, and CVC-ColonDB — were used to design algorithms for polyp detection and segmentation. These datasets provide a representative sample of images and videos captured during colonoscopy procedures in clinical practice. The datasets used in this research were selected following a thorough search of public resources, including an online benchmark table. In essence, the method involved a phased approach utilizing different datasets for pre-training, architecture evaluation, and performance testing of the YOLOv5 model in a federated learning context.

To enhance model generalization and reduce overfitting, image augmentation was applied during training as well. YOLOv5 incorporates various augmentation techniques that include:

  • Mosaic augmentation: an image processing technique that combines four training images into one to encourage object detection models to better handle various object scales and translations.
  • Random affine transformations: include random rotation, scaling, translation, and cropping of images.
  • HSV augmentation: random modifications to the hue, saturation, and value of images.

Central training[edit | edit source]

Before the central training process of the initial model, a pre-training phase was executed to fine-tune the network's head for use in both central and federated training. After performing data preprocessing, a YOLOv5 model was trained to reproduce the results obtained in this prior project.

Example of a polyp image annotation with bounding boxes in VOC format to be converted into YOLO format.
EndoCV_C2_0197.jpg with predicted bonding boxes. Image from: PolypGEN dataset, Center 2.
Associated class and coordinates of the bounding boxes.

In doing so, comparisons with the previous work and literature were drawn to ensure model robustness. Moreover, an effort was undertaken to guarantee model reproducibility by enriching the original dataset with additional publicly available datasets. The new trained model was evaluated on a dedicated test dataset of endoscopic images to demonstrate its improvement. A comparative analysis among various YOLOv5 versions was conducted to further refine the development process to determine the most suitable model for Federated Learning implementation.

Project result directory of YOLOv5 containing all training outputs and visualizations of performance metrics.

Integration into the Federated Learning workflow[edit | edit source]

The integration of the model into a Federated Learning workflow was facilitated using the NVFlare open-source framework and the pre-trained YOLOv5 weights. A simulation was carried out to test NVFlare tools and evaluate the performance of the model trained across two distinct clients. This was performed using NVFlare FL Simulator, leveraging the original dataset and the widely adopted FedAvg algorithm, which is well-documented in existing literature and included in NVFlare’s repository examples. Moreover, advanced techniques such as Secure Aggregation and Homomorphic Encryption were rigorously tested to enhance the security and integrity of the training process.

Federated training[edit | edit source]

The NVFlare Provisioning tool was then employed to execute federated training, with the attention on considering computational constraints inherent to the embedded device. To simplify the process and make it reproducible, a software framework was devised for smart endoscopes, integrating both inference capabilities and federate training. The starting point of this framework was the trained YOLOv5 model based on the enriched dataset. The workflow began with the distribution of the global YOLO model to individual clients, followed by the conversion of the model into a streamlined TFLite format to optimize the usage of hardware resources. Inference was then executed, where images that met predetermined quality criteria were segregated and excluded from the next federated training. The concept was to train only on images where the model still needs to learn while discarding those on which it performed well, as these would not bring new information to the next training round.

Model exportation in .tflite format on the embedded board.