Changes

← Older edit

ML-TN-007 — AI at the edge: exploring Federated Learning solutions

36,926 bytes added, 16 April

→‎Introduction

__FORCETOC__

== History ==

{| class="wikitable" border="1"

!Version

|-

|1.0.0

|~~August~~ October 2023

|First public release

|-

|}

==Introduction==

According to Wikipedia, [https://en.wikipedia.org/wiki/Federated_learning Federated Learning] (FL) is defined as ''a machine learning technique that trains an algorithm via '''multiple independent sessions, each using its own dataset'''. This approach stands in contrast to traditional centralized machine learning techniques where local datasets are merged into one training session, as well as to approaches that assume that local data samples are identically distributed.''

''Federated learning enables multiple actors to build a common, robust machine learning model '''without sharing data, thus addressing critical issues such as data privacy, data security, data access rights and access to heterogeneous data'''. Its applications engage industries including defense, telecommunications, Internet of Things, and pharmaceuticals. A major open question is when/whether federated learning is preferable to pooled data learning. Another open question concerns the trustworthiness of the devices and the impact of malicious actors on the learned model.''

In principle, FL can be an extremely useful technique to address critical issues of industrial IoT (IIoT) applications. As such, it matches perfectly [~~[ToloMEO Embedded Assistant|~~https://tolomeo.io DAVE Embedded Systems' IIoT platform, ToloMEO]]. This Technical Note (TN) illustrates how DAVE Embedded Systems explored, tested, and characterized some of the most promising open-source FL frameworks available to date. One of these frameworks might equip ToloMEO-compliant products in the future allowing our customers to implement federated learning systems easily. From the point of view of machine learning, therefore, we investigated if typical embedded architectures used today for industrial applications are suited for acting not only as inference platforms — we already dealt with this issue [[ML-TN-001 - AI at the edge: comparison of different embedded platforms - Part 1|here]] — but as training platforms as well.

In brief, the work consisted of the following steps:

* Comparing the results for isolating the best framework.

* Deep investigation of the best framework.

A detailed dissertation of the work that led to this Technical Note is available [https://cloud.dave.eu/public/dc19e3 here ~~TBD~~ ] [[#LDLThesis|[1]]].

== Choosing Federated learning frameworks ==

=== Criteria and initial, long list ===

For selecting the frameworks, several factors were taken into account:

* '''ML frameworks flexibility''': The adaptability of the framework to manage different ML frameworks.* '''Licensing''': It is mandatory that the framework has an open-source, permissive license to cope with the typical requirements of real-world use cases.* '''Repository rating and releases''': Rating in a repository is important for a FL framework as it indicates a high level of community interest and support, potentially leading to more contributions and improvements. Meanwhile, the first and latest releases indicate respectively the maturity and the support of the framework and whether it is released or still in a beta version.* '''Documentation and tutorials''': The provided documentation with related tutorials has to be complete and well-made.* '''Readiness for commercial usage''': The readiness of the framework to be developed in a real-world scenario. In order to establish the readiness, it was checked the version of the framework and the license.

According to the previous criteria, an initial list including the most promising FL frameworks was completed. It comprised of the following products:

* [https://github.com/NVIDIA/NVFlare NVIDIA FL Application Runtime Environment] (NVFlare)

In the next sections, the aforementioned factors will be treated individually, justifying the reasons behind the discard of some frameworks rather than others.

==== ML frameworks flexibility ====Flexibility in ML frameworks is crucial when choosing a FL framework asit allows adapting the system to diverse use cases and data distributions. A flexible framework can support various ML algorithms, models, and data types, accommodating the specific requirements of different scenarios. This adaptability enhances the framework’s applicability in real-world deployments, ensuring it can effectively handle the heterogeneity and dynamic nature of distributed data sources across clients. It is noteworthy to mention that most of the frameworks discussed earlier, including NVFlare, FATE, Flower, PySyft, IBM, OpenFL, and FedML, are designed to be agnostic to the underlying ML framework used. This agnosticism provides users with the flexibility to employ various ML libraries, such as TensorFlow, PyTorch, and scikit-learn, among others, based on their preferences and project requirements. However, it is important to highlight that one exception to this trend is TFF. Unlike the other frameworks, TFF is specifically tailored for the TensorFlow ecosystem. While it harnesses the powerful capabilities of TensorFlow, it inherently limits the utilisation of other ML libraries in the FL context. As a result, users opting for TFF should be aware of this framework’s dependency on TensorFlow for their FL endeavors. For that reason TFF was discarded from the potential frameworks to be considered for comparison.

=== Licensing ===The choice of a suitable license is of paramount importance for any FL framework. A well-crafted license provides a legal foundation that governs the usage, distribution, and modification of the framework’s source code and associated components. A permissive license, like the MIT License or Apache License, allows users to use, modify, and distribute the framework with relatively few restrictions. This encourages widespread adoption, fosters innovation, and facilitates contributions from a broader community of developers and researchers. The permissiveness of these licenses empowers users to incorporate the framework into their projects, even if hey have proprietary components. On the other hand, copyleft licenses, like the GNU GPL, require derived works to be distributed under the same terms, ensuring that any modifications or extensions to the framework remain open-source. While this may be more restrictive, it ~~allows adapting~~ encourages a collaborative ecosystem where improvements are shared back with the ~~system~~ community. A clear and well-defined license also provides legal protection to ~~diverse~~ both developers and users, helping to mitigate potential legal risks and disputes. It ensures that contributors have granted appropriate rights to their work and helps maintain a healthy and sustainable development environment. Most of the frameworks previously described are under the Apache-2.0 license except one: IBMFL. In fact, it is under an unspecified license that makes the framework not suitable for commercial use ~~cases and data distributions~~. For that reason, IBMFL was discarded from the comparison too.

~~A flexible~~ === Repository rating and releases ===Ratings in public repositories such as "stars" in GitHub are important because they serve as a measure of popularity and community interest in the project. When a repository achieves a good rating, it indicates that more developers and users find the project valuable and relevant. This can lead to several benefits:* '''Visibility''': Repositories with good ratings are likely to appear higher in platform's search results, making it easier for others to discover and use the project.* '''Credibility''': High-rating repositories are often perceived as more trust-worthy and reliable, as they are vetted and endorsed by a larger user base.* '''Contributions''': Popular repositories tend to attract more contributions from developers, leading to a more active and vibrant community around the project.* '''Feedback''': Projects with good ratings are more likely to receive feedback, bug reports, and feature requests, helping the developers improve the software.* '''Maintenance''': Higher ratings can also stimulate the maintainers to keep the project updated and actively supported. Other important, rating-related aspects are the first and latest releases. Thanks to the latter, it is possible respectively to see the maturity of the framework ~~can~~ and also how often it is updated, and thus the support ~~various ML algorithms~~behind it. Obviously, a framework that was born earlier than others is much more likely to have better ratings. Having this in mind, at the time of writing this thesis, the ranking in terms of received stars correlated with the first release for each framework is as follows:** '''PySyft''': 8.9k stars / Jan 19, 2020** '''FATE''': 5.1k stars / Feb 18, 2019** '''FedML''': 3.1k stars / Apr 30, 2022** '''Flower''': 2.8k stars / Nov 11, 2020 ** '''TFF''': 2.1k stars / Feb 20, 2019 ** '''OpenFL''': 567 stars / Feb 1, ~~models~~2021 ** '''IBMFL''': 438 stars / Aug 28, ~~and data~~2020 ** '''NVFlare''': 413 stars / Nov 23, 2021These characteristics, although they certainly have a bearing on the choice of frameworks, were not enough to go so far as to discard any of the selected frameworks.

~~types~~=== Documentation and tutorials ===High quality documentation and well-crafted tutorials are essential considerations when selecting a FL framework. In fact, there are several reasons that are presented here below:* '''Accessibility and Ease of Use''': Comprehensive documentation allows users to understand the framework’s functionalities, APIs, and usage quickly. It enables developers, researchers, and practitioners to get started with the framework efficiently, reducing the learning curve.* '''Accelerated Development''': Well-structured tutorials and examples demonstrate how to use the framework to build practical FL systems. They provide step-by-step guidance on setting up experiments, running code, and interpreting results. This expedites the development process and encourages experimentation with different configurations.* '''Error Prevention''': Clear documentation and good examples help users avoid common mistakes and errors during implementation. It provides troubleshooting tips and addresses frequently asked questions, reducing frustration and increasing user satisfaction.* '''Reliability and Robustness''': A well-documented framework indicates that developers have invested time in organizing their code and explaining its functionalities. This attention to detail suggests a more reliable and stable framework.* '''Maintenance''': Higher stars can also stimulate the maintainers to keep the project updated and actively supported.Regarding this aspects, there are a lot of frameworks that still don’t have good documentation and tutorials. Among the latter, there are: PySyft, OpenFL and FedML. PySyft is still under construction, ~~accommodating~~ as the ~~specific requirements~~ official repository says, and for that reason often the documentation is not up to date and is not complete. OpenFL, on its side, has very meager documentation and only a few tutorials that don’t explore a lot of ML frameworks or a lot of ~~different~~ scenarios. ~~This~~The FedML framework also has, like PySyft, incomplete documentation because the project is born very recently and is still under development. Finally, the FATE framework has a complete and well-made documentation but very few tutorials and, because of its complex architecture, would have taken too much time. Because of these reasons, these four frameworks were discarded from the comparison.

~~adaptability enhances~~ === Readiness for commercial usage ===In the ~~framework’s applicability~~ context of FL, the significance of a framework being ready for commercial use cannot be overstated. As businesses increasingly recognize the value of decentralized ML solutions, the demand for robust and production-ready frameworks has intensified. A FL framework geared for commercial use offers several crucial advantages. Firstly, it provides a stable and scalable foundation to deploy large-scale FL systems across diverse devices and platforms. This ensures that businesses can seamlessly integrate the framework into their existing infrastructure, minimizing disruption, and optimizing efficiency. Moreover, a commercially viable framework emphasizes security and privacy measures, a non-negotiable aspect when dealing with sensitive data across distributed environments. Advanced encryption techniques, secure communication protocols, and differential privacy methods guarantee that user data remains safeguarded, mitigating potential risks of data breaches or unauthorized access. Of the frameworks covered, only a few are ready to be used commercially. Among them are: Flower, NVFlare, FATE, OpenFL, and TFF. On the other hand, there are some frameworks that are not yet ready. Among the latter are: FedML, PySyft and IBMFL. The first two are in ~~real~~fact still under development and not yet ready for commercial-level use, while the third has a private license that does not allow the framework to be used for commercial-~~world deploy~~level applications. As explained in the previous sub-sections, these frameworks were already discarded from comparison.

~~ments~~== Final choice ==At the beginning of this section, ~~ensuring it can effectively handle the heterogeneity~~ a total of eight frameworks were considered. Each framework was assessed based on various aspects and ~~dynamic na~~after an in-depth analysis, six frameworks were deemed unsuitable due to some requisites not being met. The requirements that were considered are summarized in the following table:

~~ture of distributed data sources across clients~~{| class="wikitable" style="margin: 0 auto;"|+ FL frameworks table comparison! align="center" | Framework! align="center" | NVFlare! align="center" | FATE! align="center" | Flower! align="center" | PySyft! align="center" | IBMFL! align="center" | OpenFL! align="center" | TFF! align="center" | FedML|-| align="center" | ML frameworks flessibility| align="center" | high| align="center" | fair| align="center" | high| align="center" | fair| align="center" | high| align="center" | high| align="center" | low| align="center" | high|-| align="center" | License| align="center" | Apache| align="center" | Apache| align="center" | Apache| align="center" | Apache| align="center" | Apache| align="center" | Apache| align="center" | Apache| align="center" | Apache|-| align="center" | Repo rating (stars)| align="center" | 413| align="center" | 5.1k| align="center" | 2.8k| align="center" | 8.9k| align="center" | 438| align="center" | 567| align="center" | 2.1k| align="center" | 3.1k|-| align="center" | Releases| align="center" | Nov 23| align="center" | Feb 18| align="center" | Nov 11| align="center" | Jan 19| align="center" | Aug 28| align="center" | Feb 1| align="center" | Feb 20| align="center" | Apr 30|-| align="center" | Documentation and tutorials| align="center" | good| align="center" | decent| align="center" | good| align="center" | bad| align="center" | good| align="center" | bad| align="center" | good| align="center" | bad|-| align="center" | Readiness for commercial usage| align="center" | ready| align="center" | ready| align="center" | ready| align="center" | not ready| align="center" | not ready| align="center" | ready| align="center" | ready| align="center" | not ready|}

~~It is noteworthy~~ These two remaining frameworks are then: '''Flower''' and '''NVFlare'''. They demonstrated the potential to ~~mention that most~~ address the research objectives effectively and were well-aligned with the specific requirements of the FL project. Later, these two selected frameworks ~~discussed ear~~will be rigorously compared, examining their capabilities in handling diverse ML models, supporting various communication protocols, and accommodating heterogeneous client configurations. The comparison will delve into the frameworks’ performance, ease of integration, and potential for real-world deployment. By focusing on these two frameworks, this research aims to provide a detailed evaluation that can serve as a valuable resource for practitioners and researchers seeking to implement FL in a variety of scenarios. The selected frameworks will undergo comprehensive testing and analysis, enabling the subsequent sections to present an informed and insightful comparison, shedding light on their respective strengths and limitations.

~~lier, including~~ = Flower vs NVFlare~~, FATE, Flower, PySyft, IBM, OpenFL, and FedML,~~: an in-depth comparison =

~~are designed to be agnostic to the underlying ML framework used. This ag-~~== Functional testing ==

~~nosticism provides users with~~ === Introduction ===To conduct an in-depth comparison of the ~~flexibility~~ two selected frameworks based on a functional perspective, it was chosen a problem simulating a real-world scenario. Specifically, a classification problem was used. The choice was due to the following reasons:* Firstly, classification problems are a widely studied and well-understood domain in ML, making them suitable for benchmarking and evaluation purposes.* Secondly, classification tasks are versatile and can be applied to ~~employ~~ various real-world scenarios, such as image classification, natural language processing, and medical diagnosis, making the evaluation results applicable to a wide range of applications.* Classification problems often involve complex data patterns and require efficient model training and optimization techniques, making them a suitable challenge for assessing the performance and scalability of FL frameworks. Additionally, classification problems allow for the evaluation of key metrics like accuracy, precision, recall, and F1-score, providing a comprehensive assessment of a framework’s capabilities in handling different aspects of ML ~~libraries,~~tasks.

~~such as TensorFlow, PyTorch,~~ === System design ===This section describes the architectural aspects of the entire system used to test Flower and ~~scikit-learn, among others, based on their~~NVFlare.

~~preferences~~ ==== Testing environments ====Within the FL infrastructure, multiple parties are involved to create a collaborative environment for training ML models. These parties, including data providers (clients) and ~~project requirements~~a model aggregator (server), play essential roles in the FL process. Data providers contribute their locally-held data, while model aggregators facilitate the consolidation of the individual model updates from different parties. The interactions between these parties enable the training of robust preserving models, making FL an effective approach for decentralized data scenarios.

~~However~~Two testing environments were used for testing the frameworks: the first one is denoted as ''local'', it while the other is ~~important to highlight that one exception to this trend~~called ''cloud''.

~~is TFF~~===== Local environment =====The local parties consist of a single desktop computer that acts both as the server and four separate clients. This configuration mimics a decentralized environment where the desktop computer takes on the roles of multiple participants, simulating the interactions and data contributions of distinct entities. As the server, it coordinates and manages the FL process, while functioning as an individual client allows it to provide diverse data contributions for training. This localized approach allowed for the use of Docker as the development environment, leveraging the power of a desktop computer assembled with an RTX 3080ti GPU to enhance performance. ~~Unlike~~ The power given by the ~~other frameworks~~NVidia GPU, ~~TFF~~ allowed the use of a more complex model and the simulation of four clients on the same machine that acts also as the server. Being self-contained in a single host, the local environment is ~~specifically tailored~~ convenient for testing, especially when thefocus is on the functional verification.

~~TensorFlow ecosystem~~{| class="wikitable" style="margin: 0 auto;"|+Testbed configuration!Item!Name!Version / Model / Qty|-|Operating system|GNU/Linux Ubuntu|20. ~~While it harnesses the powerful capabilities of Ten~~04|-|ML frameworks|Pytorch|1.13.1|-| rowspan="2" |FL frameworks|Flower|1.4.0|-|NVFlare|2.3.0|-| rowspan="3" |Hardware component|Processor|AMD Ryzen 9 5950x|-|Graphics card|NVidia RTX 3080Ti 12GB GDDR6X|-|System RAM|64GB DDR4|-|Software containerization system|Docker|24.0.2|-|Architecture|AMD64| -|}

~~sorFlow~~===== Cloud environment =====In this case, ~~it inherently limits~~ cloud parties consist of two embedded devices or virtual machines acting as clients, and a notebook serving as the server. This configuration facilitates a distributed learning approach, enabling the clients to process their data locally while contributing to the model’s training. The server coordinates the learning process and aggregates the updates from the clients to improve the global model. This setup ensures a decentralized and privacy-preserving approach to ML, as the data remains on the clients’ devices, and only the model updates are shared during the training process. Leveraging embedded devices as clients enables the ~~utilisation~~ inclusion of ~~other ML libraries~~ resource-constrained devices in the FLecosystem, making the framework more versatile and applicable to a wide range of scenarios. A notebook acting as the server provides a centralized point of coordination and ensures smooth communication and collaboration between the clients, making the FL process efficient and effective in leveraging distributed resources for improved model performance. Of course, this environment is more complicated to set up, but it better simulates real configurations.

~~context. As~~ ===== Virtual environment =====To simulate a ~~result~~real-world scenario, ~~users opting~~ VMs were set up and configured to act as clients in a FL system. This simulation allowed for ~~TFF should be aware~~ the exploration of diverse data distributions, data processing capabilities, and communication constraints that might arise in a real deployment. The following table summarizes the characteristics of ~~this frame-~~the test bed used for such tests.

~~work’s dependency on TensorFlow for their~~ {| class="wikitable" style="margin: 0 auto;"! Machine ! Component ! Name / Type ! Version / Qty|-| rowspan="8" | Host PC | Operating system | GNU/Linux Ubuntu | 22.04|-| ML frameworks | Pytorch | 1.13.1|-| rowspan="2" | FL frameworks | Flower | 1.4.0|-| NVFlare | 2.3.0|-| CPU | Intel i7 12700h | 6+8 cores|-| RAM| DDR4 | 16GB|-| Middleware | Python | 3.10.6|-| Architecture | AMD64 | -|-| rowspan="10" | VMs | Operating system | GNU/Linux Ubuntu | 22.04|-| ML frameworks | Pytorch | 1.13.1|-| rowspan="2" | FL ~~endeavors~~frameworks | Flower | 1.4.0|-| NVFlare | 2.3.0|-| CPU | Intel i7 12700h | 3 cores|-| RAM| DDR4 | 4 GB|-| rowspan="4" | Middleware | Python | 3.10.6|-| Vagrant | 2.3.4|-| Virtualbox | 6.1.38|-| Ansible | 7.5.0|}

===== Embedded Environment =====As mentioned in the introduction, one of the tasks of this work is to test a FL architecture on a real-world environment. In a typical scenario related to industrial applications, clients are built upon embedded devices that are engineered to meet the usual tight constraints of such environments. For the experiments described in this TN, two embedded platforms were used to implement clients, which are:* [https://www.xilinx.com/products/boards-and-kits/zcu104.html Xilinx Zynq UltraScale+ MPSoC ZCU104 Evaluation kit]* [[ORCA SBC|DAVE Embedded Systems SBC ORCA]].These platforms are both based on system-on-chips (SoC's) that are expressly designed to address industrial applications. NXP i.MX8M Plus and Xilinx Zynq UltraScale+ are indeed components that ~~reason TFF was discarded from~~ fit very well the ~~potential frameworks~~ demanding and challenging requirements our customers must meet in order to bebuild successful products. For instance, these SoC's not only provides computational resources to implement complex architectures but also have wide temperature operating ranges and long availability as they belong to specific longevity programs issued by the respective manufacturers.

~~considered~~ At its core, ZCU104 integrates an array of processing elements including a quad-core ARM Cortex-A53 Application Processing Unit (APU)<ref>Microprocessor that combines both traditional CPU and GPU cores onto a single chip.</ref>, which is based on an ARM64 architecture, and a dual-core ARM Cortex-R5 Real-Time Processing Unit (RPU)<ref>Dedicated hardware component or processor designed to execute tasks or operations with strict timing constraints. RPUs are commonly employed in systems that require immediate and predictable responses, such as embedded systems, robotics, and real-time control applications.</ref>. This processing power allows for ~~comparison~~efficient and parallel execution of complex [[ML-TN-001_-_AI_at_the_edge:_comparison_of_different_embedded_platforms_-_Part_1|ML inference algorithms]], making it an ideal choice for applications that demand real-time processing capabilities. It also features a Mali-400 MP2 Graphics Processing Unit, 16nm FinFET+ Programmable Logic, and 2 GB of DDR4 RAM. The peculiarity of this SoC that distinguishes it from competitors’ products is the fact that it integrates a Field Programmable Gate Array (FPGA)<ref>Re-configurable hardware device that allows users to implement custom digital circuits and functions by programming its internal logic gates and interconnections.</ref>, which is strictly coupled to the ARM processors. The ZCU104 boasts an array of high-speed interfaces, such as Gigabit Ethernet, USB 3.0, and DisplayPort, enabling seamless connectivity with external devices and peripherals. From the other side, at the heart of the SBC ORCA lies the NXP i.MX8M Plus SoC featuring a quad-core Arm Cortex-A53 CPU, which is based on ARM64 architecture, and a powerful Neural Processing Unit (NPU). The inclusion of the NPU enhances the platform’s ability to accelerate ML workloads, providing significant speed-up and power efficiency for [[ML-TN-001_-_AI_at_the_edge:_comparison_of_different_embedded_platforms_-_Part_1|neural network-based inference algorithms]]. The SBC ORCA is equipped with ample memory resources, including 6 GB of LPDDR4 RAM, to accommodate large datasets and complex ML models. Even the SBC ORCA offers a variety of high-speed interfaces, such as Gigabit Ethernet, USB, and HDMI, which enable seamless connectivity with external devices and peripherals. The research environment deployed on the two embedded devices was meticulously constructed, incorporating Python’s virtual environment (<code>python-venv</code>). The environment utilizes <code>python-venv</code>, version 3.10.6, for the Xilinx Zynq UltraScale+ MPSoC ZCU104 device and version 3.9.1 for the DAVE Embedded Systems SBC ORCA device, ensuring precise version control and compatibility tailored to each device’s capabilities. To ensure uniformity and maintainability across different environments, the same <code>requirements.txt</code> file employed in the VM environment was seamlessly integrated into the Python virtual environments of both embedded devices. By employing this unified approach, it was guaranteed that each virtual environment within the embedded devices closely mirrors the environment within the VM, thereby enhancing reproducibility and streamlining research tasks and experiments across diverse hardware platforms. Even in this case, the role of the server was performed by the notebook machine described previously. The following table illustrates the characteristics of the machines used for the "embedded environment".

{| class="wikitable" style="margin: 0 auto;"! Machine ! Component ! Name / Type ! Version / Qty|-| rowspan="7" |PC| Operating system | GNU/Linux Ubuntu | 22.04|-| ML frameworks | Pytorch | 1.13.1|-| rowspan= ~~Licensing~~ "2" | FL frameworks | Flower | 1.4.0|-| NVFlare | 2.3.0|-|CPU| Intel i7 12700h | 6+8 core|-| Middleware | python-venv | 3.10.6|-| Architecture | AMD64 | -|-| rowspan="8" |ZCU104| Operating system | Xilinx Linux Ubuntu | 22.04|-| ML frameworks | Pytorch | 1.13.1|-| rowspan="2" | FL frameworks | Flower | 1.4.0|-| NVFlare | 2.3.0|-| CPU | ARM Cortex-A53 1.5Ghz | 4 cores|-|RAM| DDR4 | 2GB|-| Middleware | python venv | 3.10.6|-| Architecture | ARM64 | -|-| rowspan="8" |SBC ORCA| Operating system | Linux Armbian | 23.02|-| ML frameworks | Pytorch | 1.13.1|-| rowspan="2" | FL frameworks | Flower | 1.4.0|-| NVFlare | 2.3.0|-| CPU | ARM Cortex-A53 1.6Ghz | 4 core|-| RAM| LPDDR4 | 6 GB|-| Platform | python venv | 3.9.1|-| Architecture | ARM64 ~~The choice of a suitable license is of paramount importance for any FL frame~~| -|}

~~work [19]~~==== ML framework ====Another crucial factor in designing the testing set up is the ML framework to be used. To this end, PyTorch was selected as the primary ML framework. ~~A well~~The flexibility of PyTorch allowed for the implementation of complex models and easy customization to meet specific project requirements. Also, the availability of pre-~~crafted license provides~~ trained models and a vast collection of built-in functions expedited the development process and enabled focus on the core aspects of the project. Another pivotal factor is PyTorch’s ability to leverage GPU for hardware acceleration, which is crucial for training models on distributed data in FL environments. Its integration with CUDA and optimization for GPU computing make it a ~~legal foundation that governs~~ pragmatic choice for applications requiring high performance. Lastly, PyTorch was chosen for its adaptability within theexisting development environment, including its compatibility with Docker and '''embedded devices based on the ARM64 (AArch64) architecture'''. PyTorch’s adaptability and support for ARM64 architecture were key factors in this decision. This interoperability has facilitated the integration of the framework into the research and development environment.

~~usage~~==== Data Preprocessing ===="Data Preprocessing" in an important step for ensuring the success and effectiveness of the entire process. This crucial phase involves the choice of the dataset and the transformation and preparation of data before it is used for training the ML model on distributed devices. The "Data Preprocessing" stage plays a vital role too in harmonizing the data collected from different parties, ~~distribution~~which might have varying data distributions and formats. By applying standardized preprocessing techniques across the data from multiple clients, the potential bias and ~~modification of the framework’s source code~~ inconsistencies arising from diverse data sources can be mitigated, leading to a more accurate and ~~as-~~robust global model.

~~sociated components~~Data preprocessing step includes dataset selection, dataset splitting, and data augmentation. For more details about these operations, please refer to [1].

~~A permissive license~~==== Model configuration ====After data processing, ~~like~~ three fundamental aspects were treated in order to have a FL system that converges efficiently to meaningful solutions. Indeed, a correct model configuration plays a crucial role in FL as it encompasses the ~~MIT License or Apache License~~selection of the appropriate model architecture, optimization algorithm, ~~allows users~~and criterion.

===== Model architecture =====The model architecture plays a crucial role in defining how the data flows through the network, the number and type of layers, the connections between neurons, and the activation functions applied at different stages. Since the main purpose of this work is to ~~use, modify~~compare two frameworks, the choice of architecture was made in order to:* maximize the selected metrics at the end of the training* limiting the resource demands and ~~distribute~~ thus also the time required to complete the ~~framework with relatively few restrictions~~process.

====== Cloud environment ======The model architecture chosen for the cloud environment prioritizes simplicity and compatibility with the resource-constrained CPUs of embedded devices and virtual machines. This ~~encourages widespread adoption~~decision was motivated by the need for a lightweight and efficient model that can be easily deployed and executed on various devices participating in the FL system. For that reason, ~~fosters innovation~~the architecture chosen is a model, ~~and facilitates con-~~specifically SimpleCNN, taken from the "Training a classifier" tutorial available on the official PyTorch website. This selection fits perfectly with the request for a lightweight model that could still deliver satisfactory results.

~~tributions from a broader community~~ ====== Local environment ======A more complex model architecture was chosen for the local environment to leverage the usage of an NVidia GPU for model training. This decision was driven by the aim to harness the computational power of ~~developers~~ the GPU and expedite the training process, ultimately leading to improved model performance and ~~researchers~~more efficient training. This led to better metrics results compared to the cloud counterpart. In this case, the architecture that was chosen is indeed the ResNet-18. ResNet-18 is a highly effective DL model for image classification tasks, known for its ability to handle deeper architectures without sacrificing performance. ~~The per~~It performs exceptionally well with the CIFAR-10 dataset and is computationally efficient.

~~missiveness~~ ===== Optimizer =====The optimizer chosen, presented in the introduction paragraph, is the SGD since it is a popular optimizer for classification problems due to its simplicity and effectiveness. It was tuned using some hyperparameters: momentum and learning rate. Momentum is a technique used to accelerate the convergence of the optimization process and improve its stability. It addresses the issue of slow convergence and oscillations in the loss function by introducing a "velocity" term that helps the optimizer navigate the optimization landscape more efficiently. The value of ~~these licenses empowers users~~ 0.9 that was chosen, meaning that the optimizer gives more weight to the past accumulated gradients, leading to smoother updates. Learning rate is a hyperparameter that determines the step size at which the model updates its weights during the optimization process. It controls how much the model adjusts its internal parameters in response to ~~incorporate~~ the ~~framework~~error calculated during training. In this case the chosen learning rate of 0.001 strikes a balance between making smaller, more precise steps towards the optimal solution and avoiding overshooting or oscillations during the optimization process.

~~into their projects~~===== Criterion =====Cross-entropy is commonly used in classification problems because it quantifies the difference between the predicted probabilities and the actual target labels, ~~even if they have proprietary components~~providing a measure of how well the model is performing in classifying the input data.In the context of CIFAR-10, where there are ten classes (e.g., airplanes,

On cars, birds, etc.), the ~~other hand~~Cross-Entropy loss compares the predicted class probabilities with the true one-hot encoded labels for each input sample. It applies the logarithm to the probabilities and then sums up the negative log likelihoods across all classes. The objective is to minimize this loss function during the training process, ~~copyleft licenses~~which effectively encourages the model to assign high probabilities to the correct class labels and low probabilities to the incorrect ones. One of the reasons why Cross-Entropy Loss is considered suitable for CIFAR-10 and classification tasks, in general, is its ability to handle multi-class scenarios efficiently. By transforming the model’s output into probabilities through the softmax activation, ~~like~~ it inherently captures the ~~GNU GPL~~relationships between different classes, ~~require derived~~allowing for a more expressive representation of class likelihoods.

~~works~~ ==== Client-side settings ====On the client side, three important tasks including training, validation, and testing are being performed. Each task comes to be ~~distributed under~~ executed by every client participating in the FL infrastructure. At the ~~same terms~~end of a cycle, ~~ensuring~~ tasks come to be blocked temporarily so that the results accumulated up to that ~~any modifica-~~point are sent to the server, which will take care of aggregating them. Once aggregated, each client will start again with the tasks assigned to it from the data formed by the server. The model is trained for 3 epochs in the case of the cloud environment and for 4 epochs in the case of the local environment. The total steps per epoch in both cases is for each task of 1250 steps.

~~tions~~ ==== Aggregation algorithm ====In this FL scenario, the Federated Averaging algorithm was employed as the aggregation method. FedAvg is a fundamental and widely adopted algorithm used to aggregate model updates from multiple clients (or ~~extensions~~ participants) in a FL setting. The primary objective of FedAvg is to allow collaborative model training while preserving data privacy. After local training, clients communicate their model updates (gradients) to the server, where these updates are aggregated to create a global model. The global model is then sent back to the ~~framework remain open-source~~clients that, use it as the starting point for the next round of training. This iterative process continues until the global model converges to a satisfactory solution. ~~While this may~~

~~be more restrictive~~==== Metrics ====In order to make a good comparison, three of the most common and essential metrics were chosen to evaluate model performance and effectiveness. The chosen metrics are the following:* '''Loss''': the loss function quantifies the dissimilarity between the predicted output of the model and the actual ground truth labels in the training data. It provides a measure of how well the model is performing during training. The goal is to minimize the loss function, as a lower loss indicates that the model is better aligned with the training data.* '''Accuracy''': accuracy is a fundamental metric used to assess the model’s overall performance. It represents the proportion of correctly predicted samples to the total number of samples in the dataset. A higher accuracy indicates that the model is making accurate predictions, ~~it encourages~~ while a lower accuracy suggests that the model might need further improvements. Calculating the accuracy of individual clients in a ~~collaborative ecosystem where improve~~FL classification problem is important to assess the performance of each client’s local model. This helps in understanding how well each client is adapting to its local data distribution and making accurate predictions.* '''F1-score''': the F1-score is a metric that combines both precision and recall to provide a balanced evaluation of the model’s performance, especially when dealing with imbalanced datasets. Precision measures the ratio of correctly predicted positive samples to all predicted positive samples, while recall measures the ratio of correctly predicted positive samples to all actual positive samples. The F1-score is the harmonic mean of precision and recall, providing a single metric that considers both aspects.

~~ments~~ ==== Server-side settings ====After the choice of the metrics to evaluate, the last thing to decide were the server settings. In fact, two important parameters are missing in this regard: number of rounds and number of clients that would have participated in the FL infrastructure. A round represents a communication cycle between clients and the central server in the FL training process. During each round, participating clients perform local training using their available local data. Subsequently, the updated model weights trained locally are sent to the central server or coordination node. Here, the weights are ~~shared~~ centrally aggregated to obtain an updated global model, which represents the combined knowledge of all participating clients. At this point, the round is concluded and the aggregate model is sent back ~~with~~ to the clients, who will use this updated model to perform a new round. In this case, the number of rounds chosen for the cloud part is equal to 4 meanwhile for the local part there is a total of 10 rounds. On the server-side, the second parameter to be chosen is the number of clients that will participate in the various rounds of FL. In this case, the number of cloud-side clients is 2 while on the ~~community~~local-side is equal to 4.

~~A clear~~ === Results ==={| class="wikitable" style="margin: 0 auto;"|+Flower running on SBC ORCA!# of cores!!|-|1|[[File:Flower 1-core htop MX8M+.png|center|thumb|300x300px]]|[[File:Flower log 1-core MX8M+.png|center|thumb|300x300px]]|-|4|[[File:Flower 4-cpu htop MX8M+.png|center|thumb|300x300px]]|[[File:Flower log 4-core MX8M+.png|center|thumb|300x300px]]|}For each experiment, three evaluations were performed:* '''Global evaluation''': accuracy and F1-score at the end of each round of FL were tested.* '''Local evaluation''': accuracy and F1-score at the end of each round of FL were tested.* '''Training evaluation''': it was computed loss, accuracy, and ~~well~~F1-~~defined license~~ score.Experiments were run both in the local and cloud environments. Detailed results are illustrated in [1]. In essence, the results are very similar for both frameworks. Thus, they can be considered equivalent from the point of view of the metrics considered. It is also ~~provides legal protection~~ very important to note that for all the results regarding the cloud environment, there are '''very similar values between the testbed based on virtual machines and the one based on embedded devices'''. This is not obvious because moving from virtualized, x86-powered clients to ARM64-powered clients entails several issues that can affect the results of the FL application. Among these, it is worth to remember the following:* '''Limited hardware resources''': Embedded devices often have limited hardware resources, such as CPU, memory and computing power. This restriction can affect the performance of FL, especially if models are complex or operations require many resources.* '''Hardware variations''': Embedded devices may have hardware variations between them, even if they belong to the same class. These hardware differences may lead to different behaviors in FL models, requiring more robustness in adapting to different devices.* '''Variations in workload''': Embedded device applications may have very different workloads from those simulated in a virtual environment. These variations may lead to different requirements for FL.In conclusion, from a functional perspective, bothframeworks passed the test suite. More details about their performances in terms of execution time can be found in [[#Execution time|this section]].

~~developers~~ == Privacy and ~~users~~security ==In the comparison between the two FL frameworks, ~~helping~~ NVFlare and Flower, a crucial aspect that was assessed is the level of privacy and security offered by each framework. Both NVFlare and Flower implement strategies to ~~mitigate~~ ensure data privacy and security during the training process. These strategies involve techniques such as data encryption, secure communication protocols, and differential privacy mechanisms. By analyzing the privacy and security features, it was possible to evaluate how well each framework protects sensitive user data while enabling effective collaborative model training. Factors such as the strength of encryption algorithms, the robustness of communication channels against potential ~~legal risks~~ breaches, and the extent to which the frameworks adhere to privacy regulations were considered. The assessment aimed to determine whether these frameworks implement state-of-the-art security measures, maintain data confidentiality, and ~~disputes~~provide assurances against potential attacks or data leaks. Additionally, a comparison of the frameworks’ approaches to preserving user privacy within a collaborative training environment was conducted. This analysis not only provides insights into the level of privacy and security that NVFlare and Flower offer but also shows how ready these two frameworks are for industrial use and how their strategies can prevent some of the main attacks of the FL landscape. Specifically, the following items were investigated. For more details, please refer to [1]:* Secure communication* Differential privacy* Secure aggregation* Federated authorizationPrivacy and security are two of the most important features to consider when analyzing FL frameworks. This is because FL itself was born out of the concern that classical ML brought about due to the lack of data security and privacy. In this regard, both frameworks have various techniques for data security and privacy, enabling a secure implementation of a FL architecture avoiding a lot of possible attacks. In spite of this, the analysis we conducted led us to the conclusion that newer NVFlare framework is more complete both in terms of the tutorials provided for a possible implementation example and in terms of the provided features. Indeed, it can be seen how greater attention was paid to a possible development in a real-world scenario thanks for example to the Federated Authorization system recently implemented by the framework.

~~It ensures that contributors have granted appropriate rights~~ == Ease of use ==Another important characteristic to ~~their work~~consider the ease of use of a framework. This is a relevant aspect because it influences the practicality and circulation of a framework. Reference [1] illustrates the comparison between NVFlare and Flower with respect to the following relevant items concerning the ease of use:* Code development* Feature support* Technical support.

== Execution time ==As part of the comparison between NVFlare and ~~helps maintain~~ Flower, a ~~healthy~~ method of evaluation involved the analysis of execution time across 4-core and ~~sustainable development environment~~1-core settings.This approach aimed to assess both the degree of parallelism and the overall speed of the frameworks in completing assigned tasks. The entire analysis was conducted using the DAVE Embedded Systems SBC ORCA embedded device presented earlier, reflecting scenarios in industrial or corporate contexts where not all cores might be available within an FL architecture due to other ongoing tasks. In such cases, there might be only one core available for utilization. The reduction of cores from 4 to 1 was done by means of a kernel-level setting of the cores via the command line parameter <code>maxcpus=1</code>. The setup used for this test was the following:

~~Most of the frameworks previously described are under the Apache-2.0~~ ~~license except one: IBMFL. In fact, it is under an unspecified license that~~ ~~makes the framework not suitable for commercial use. For that reason,~~ ~~IBMFL was discarded from the comparison too.~~ {| class="wikitable" style=~~== Repository stars and releases ====Stars in a GitHub repository are important because they serve as a measure~~"margin: 0 auto;"! Framework ~~of popularity and community interest in the project. When a repository~~! # clients ! # rounds ~~receives more stars, it indicates that more developers and users find the~~! # epochs ! Step ~~project valuable and relevant. This can lead to several benefits:~~! Model * Visibility: Repositories with more stars are likely to appear higher in GitHub search results, making it easier for others to discover and use the project.! Device* Credibility: High-starred repositories are often perceived as more trust|- ~~worthy and reliable, as they are vetted and endorsed by a larger user base.~~* Contributions: Popular repositories tend to attract more contribu- tions from developers, leading to a more active and vibrant community around the project.| Flower * Feedback: Projects with many stars are more likely to receive feedback, bug reports, and feature requests, helping the developers improve the software.| 2 * Maintenance: Higher stars can also stimulate the maintainers to keep the project updated and actively supported. Other important aspects, which are related to the stars obtained by the framework, are the first and latest releases. Thanks to the latter, it is possible respectively to see the maturity of the framework and also how often it is updated, and thus the support behind it. Obviously, a framework that was born earlier than others is much more likely to have more stars. Having this in mind, at the time of writing this thesis, the ranking in terms of star ratings received correlated with the first release for each framework is as follows:| 1 * PySyft 8.9k stars Jan 19, 2020 2. FATE 5.1k stars Feb 18, 2019 3. FedML 3.1k stars Apr 30, 2022 4. Flower 2.8k stars Nov 11, 2020 5. TFF 2.1k stars Feb 20, 2019 6. OpenFL 567 stars Feb | 1, 2021 7. IBMFL 438 stars Aug 28, 2020 8. NVFlare 413 stars Nov 23, 2021 These characteristics, although they certainly have a bearing on the choice of frameworks, were not enough to go so far as to discard any of the selected frameworks. ~~==== Documentation and tutorials ====~~* High quality documentation and well-crafted tutorials are essential consid- erations when selecting a FL framework. In fact, there are several reasons that are presented here below:* Accessibility and Ease of Use: Comprehensive documentation allows users to understand the framework’s functionalities, APIs, and usage quickly. It enables developers, researchers, and practitioners to get started with the framework efficiently, reducing the learning curve. • Accelerated Development: Well-structured tutorials and examples demon- strate how to use the framework to build practical FL systems. They provide step-by-step guidance on setting up experiments, running code, and interpreting results. This expedites the development process and encourages experimentation with different configurations. • Error Prevention: Clear documentation and good examples help users avoid common mistakes and errors during implementation. It provides troubleshooting tips and addresses frequently asked questions, reducing frustration and increasing user satisfaction. • Reliability and Robustness: A well-documented framework indicates that developers have invested time in organising their code and ex- plaining its functionalities. This attention to detail suggests a more reliable and stable framework. • Maintenance: Higher stars can also stimulate the maintainers to keep the project updated and actively supported.~~Regarding this aspects, there are a lot of frameworks that still don’t have~~| 1250 | SimpleCNN ~~good documentation and tutorials. Among the latter, there are: PySyft,~~| SBC ORCA~~OpenFL and FedML. PySyft is still under construction, as the official repos~~|- ~~itory says, and for that reason often the documentation is not up to date~~ ~~and is not complete. OpenFL, on its side, has very meager documentation~~ ~~and only a few tutorials that don’t explore a lot of ML frameworks or a~~ ~~lot of scenarios. The FedML framework also has, like PySyft, incomplete~~ ~~documentation because the project is born very recently and is still under~~ ~~development. Finally, the FATE framework has a complete and well-made~~ ~~documentation but very few tutorials and, because of its complex architec-~~ ~~ture, would have taken too much time. Because of these reasons, these four~~ ~~frameworks were discarded from the comparison.~~ ~~==== Readiness for commercial usage ====In the context of FL, the significance of a framework being ready for commer-~~ ~~cial use cannot be overstated. As businesses increasingly recognise the value~~ ~~of decentralised ML solutions, the demand for robust and production-ready~~ ~~frameworks has intensified.~~ ~~A FL framework geared for commercial use offers several crucial ad-~~ ~~vantages. Firstly, it provides a stable and scalable foundation to deploy~~ ~~large-scale FL systems across diverse devices and platforms. This ensures~~ ~~that businesses can seamlessly integrate the framework into their existing~~ ~~infrastructure, minimising disruption and optimising efficiency.~~ ~~Moreover, a commercially viable framework emphasises security and pri-~~ ~~vacy measures, a non-negotiable aspect when dealing with sensitive data~~ ~~across distributed environments. Advanced encryption techniques, secure~~ ~~communication protocols, and differential privacy methods guarantee that~~ ~~user data remains safeguarded, mitigating potential risks of data breaches~~ ~~or unauthorised access.~~ ~~Of the frameworks covered, only a few are ready to be used commer-~~ ~~cially. Among them are: Flower,~~ | NVFlare~~, FATE, OpenFL and TFF. On~~| 2 ~~the other hand, there are some frameworks that are not yet ready. Among~~| 1 | 1 ~~the latter are: FedML, PySyft and IBMFL. The first two are in fact still~~ ~~under development and not yet ready for commercial-level use, while the~~ ~~third has a private license that does not allow the framework to be used~~ ~~for commercial-level applications. As explained in the previous sub-sections,~~ ~~these frameworks were already discarded from comparison.~~ ~~==== Final choice ====At the beginning of this section, a total of eight frameworks were considered.~~ ~~Each framework was assessed based on various aspects and after an in-depth~~| 1250 ~~analysis, six frameworks were deemed unsuitable due to some requisites not~~| SimpleCNN | SBC ORCA~~being met.~~|}

In order to have two clients, also the Xilinx Zynq UltraScale+ MPSoC ZCU104 Evaluation Kit was used. Since the two boards have similar CPUs the execution time for the two embedded devices is the same, allowing only the DAVE Embedded Systems SBC ORCA device to be considered for calculating the execution time as the number of available cores varies. The ~~requirements that were met~~ model architecture used is the same SimpleCNN already mentioned previously. For a better understanding, the execution time was divided into two parts: the training time and ~~not met by~~ the ~~framework were~~total time.

~~summarised in Table 3~~[[File:Flower-NVFlare-execution-time-1.1:png|center|thumb|716x716px|Total execution time.]]

~~These~~ Although the execution times between the two ~~remaining~~ frameworks are~~: Flower~~ very different from each other due to a different infrastructure, the 4-core and 1-core execution times are comparable. In fact, for both frameworks, in terms of both training and ~~NVFlare~~overall execution times, the decrease in cores from 4 to 1 resulted in a loss in performance ranging from 18% to a maximum of 22%. It can be therefore concluded that the degree of parallelism for both frameworks is not very high. This is also due to the fact that during execution, the device did not utilize all 4 cores at 100% but only at about 50%, compared to 100% utilization when the device was set to 1 core. ~~They demon-~~

~~strated~~ From the ~~potential~~ histograms, it can be seen that there are comparable training times between the two frameworks with Flower being slightly faster and thus making more efficient use of computing power. On the other hand, there is a clear difference between the two, as far as the total running time is concerned. This discrepancy in execution speed could be attributed to ~~address~~ various factors, including differences in algorithmic optimizations, parallel processing efficiency, network communication strategies, and underlying architectural design. In fact, Flower has a much simpler architecture than NVFlare, bringing to have a total execution time of about three times lower. This is important since this agility is particularly valuable in scenarios where real-time decision-making or rapid response to changing data is crucial. Moreover, in resource-limited environments, such as the ~~research objectives effectively and were~~one used in this work that makes use of embedded devices, conservation of computing power may be essential. Flower’s efficiency in this regard makes it a more suitable choice for applications where hardware resources are limited.

= Applying NVFlare to a real-word case =In the above comparison, among the two contenders, NVFlare has demonstrated its superiority in terms of addressing the intricate challenges posed by the modern landscape of FL. While Flower demonstrated efficiency in terms of execution time, NVFlare excelled in other crucial topics in real-world deployments. NVFlare’s robust privacy and security architecture significantly outperformed Flower’s, ensuring that sensitive data and model parameters are adequately protected during the FL process. This advantage is of paramount importance, especially in applications involving confidential or sensitive information. Furthermore, NVFlare’s ease of use and comprehensive support infrastructure were pivotal in its selection as the preferred framework. Its straightforward setup and configuration, along with user-friendly APIs and well-~~aligned with~~ documented resources, make it accessible to a wide range of developers and data scientists. While Flower may have a simpler, more centralized server-side architecture, NVFlare’s user-centric approach ensures that users can efficiently leverage its capabilities for their FL projects. However, it is essential to recognize that real-world application scenarios can be considerably more intricate and multifaceted. To address this aspect, a more intricate simulated scenario involving NVFlare warrants examination. In this extended analysis, a more complex use case will be explored, involving the testing of various algorithms and data heterogeneity to assess the impact of this factor on the results and on different algorithms under consideration. By assessing NVFlare’s performance within a more intricate context, the aim is to provide a holistic view of its capabilities and limitations, contributing to a more comprehensive understanding of its viability in various practical scenarios without forgetting even the ~~specific requirements~~ scalability of the ~~FL project~~framework.

== Advanced system design ==To the end of performing more advanced testing, the same problem described in chapter "Flower vs NVFlare: an in-depth comparison" was leveraged. However, the test bed was tuned in order to increase the complexity of the use case as detailed in the rest of this chapter. The design settings of this advanced FL system remain consistent with those utilized in the previous comparison, referred to as the "Local Environment", unless some changes. In ~~chapter 5~~this scenario, the same desktop machine was utilized, equipped with an NVidia RTX 3080 Ti GPU. The ML framework, Pytorch, remained consistent, as did the Data Preprocessing involving Dataset selection, Dataset splitting, and Data augmentation. However, a significant change was introduced regarding "Data heterogeneity". Model configuration and client-side settings also remained unchanged. Minor adjustments were made to the metrics taken into consideration, ~~these~~ focusing exclusively on two ~~selected frameworks will be rigorously compared~~: local training loss and server validation accuracy. On the server side, the configuration underwent modifications. While maintaining a count of four clients,the number of communication rounds was elevated to 20.

~~examining their capabilities~~ == FL algorithms and centralized simulation ==One of the two main changes made to the system design was to simulate a centralized training baseline and to consider two other algorithms in ~~handling diverse~~ addition to FedAvg. The centralized training was conducted using a single client for 20 local epochs, aiming to simulate a ML ~~models~~environment. This approach served as a reference point for comparison against various instances of FL. The other two FL algorithms employed in this study are Federated Optimisation (FedProx) and Stochastic Controlled Averaging for FL (Scaffold). Starting with FedProx, this algorithm extends the conventional FedAvg method by introducing a proximal term. The proximal term adds a regularization factor to the optimization process, enhancing the convergence rate and stability of the model across participating clients. FedProx achieves this by optimizing the global model using both local updates and a global proximal term, ~~supporting vari-~~which balances the contributions of individual clients while preventing divergence.

~~ous communication protocols~~Moving on to Scaffold, ~~and accommodating heterogeneous client con~~this approach focuses on refining the aggregation step of FL. It introduces controlled averaging by employing the variance of model updates as a control signal. This allows Scaffold to dynamically adjust the aggregation weight of each client’s update based on their historical per-ormance. By doing so, the algorithm mitigates the effects of noisy updates, improving the overall convergence of the FL process.

~~figurations~~== Data heterogeneity ==In this advanced project, an additional feature was incorporated involving the integration of classes aimed at performing dataset splitting among the designated clients, which, in this instance, were four in number. In addition to dividing the dataset into four subsets, the possibility of choosing the level of heterogeneity of the data was added by applying the Dirichlet sampling strategy. ~~The~~ Thus, it was possible to dynamically adjust the degree of data heterogeneity for each client bringing higher. This functionality made it possible to simultaneously customize the level of data heterogeneity across all clients. In the context of FL, this data heterogeneity can be defined as follows:* '''Low Data Heterogeneity''': Low heterogeneity means that the data across different clients is quite similar or homogeneous. There is little variation among the data held by different clients. This leads to nearly balanced classes among clients, that is classes with a similar number of samples in each class.* '''High Data Heterogeneity''': High heterogeneity means that there is significant diversity in the data across different clients or nodes. This means that every subset assigned to each client contains unbalanced classes, i.e. some classes may be over-represented in some customers, while others may be under-represented.In order to have a clear comparison ~~will delve into~~ within the experiments, the upper and lower extremes of the ~~frameworks’ performance~~α factor affecting heterogeneity were considered,i.e. 0.1 and 1.0.

~~ease~~ == Results analysis ==A series of ~~integration~~seven experiments were conducted. The first experiment involved a centralised simulation, while the remaining six experiments focused on testing three different algorithms: FedAvg, FedProx and Scaffold. Specifically, each algorithm was tested twice: the first time with α = 0.1 and ~~potential for real-world deployment~~the second time with α = 1.0.

~~By focusing on these two frameworks~~The following figure represents the local <code>training_loss</code> obtained running the quoted experiments. As can be seen, ~~this research aims~~ the loss of the centralized simulation isn’t good enough to ~~provide~~ keep up with the other experiments that reach abit lower loss values. This shows the effectiveness of the FL algorithms compared to a classical ML approach. It can also be noticed the same behavior obtained in section with the previous test bed, where at the beginning of each round the loss go instantly higher compared to the last epoch of the previous round to get lower with later epochs. Another important thing to note is how experiments with an alpha value of 0.1 perform worse than their counterpart evaluated from an alpha value of 1.0.

~~detailed evaluation that can serve as a valuable resource for practitioners~~[[File:NVFlare-local-training-loss.png|center|thumb|727x727px|NVFlare: local training loss.]]

This factor becomes even more evident when observing the following chart, which illustrates the server <code>validation_accuracy</code>. This is due to the fact that there are more unbalanced classes within each client’s dataset and ~~researchers seeking~~ this leads models trained on classes with less data to ~~implement FL in a variety~~ have difficulty generalizing correctly. Models are more inclined to predict dominant classes, reducing accuracy on less represented classes. The poor representation of ~~scenarios~~some classes makes it difficult for models to learn from them, leading to lower overall accuracy. ~~The se-~~

~~lected frameworks will undergo comprehensive testing and analysis, enabling~~ ~~the subsequent sections to present an informed and insightful comparison,~~ ~~shedding light on their respective strengths and limitations.~~ ~~== Frameworks in-depth comparison: Flower vs NVFlare ==~~ ~~=== Introduction ===~~To conduct an in-depth comparison of the two selected frameworks, a problem simulating a real-world scenario was chosen. Specifically, a classification problem was used. The choice was due to the following reasons:* Firstly, classification problems are a widely studied and well-understood domain in ML, making them suitable for benchmarking and evaluation purposes.* Secondly, classification tasks are versatile and can be applied to various real-world scenarios, such as image classification, natural language processing, and medical diagnosis, making the evaluation results applicable to a wide range of applications.* Classification problems often involve complex data patterns and require efficient model training and optimization techniques, making them a suitable challenge for assessing the performance and scalability of FL frameworks. Additionally, classification problems allow for the evaluation of key metrics like accuracy, precision, recall, and F1-score, providing a comprehensive assessment of a framework’s capabilities in handling different aspects of ML tasks. ~~=== System design ===This section describes the architectural aspects of the entire system used to test Flower and NVFlare.~~ ~~==== Testing environments ====~~Within the FL infrastructure, multiple parties are involved to create a collaborative environment for training ML models. These parties, including data providers (clients) and a model aggregator (server), play essential roles in the FL process. Data providers contribute their locally-held data, while model aggregators facilitate the consolidation of the individual model updates from different parties. The interactions between these parties enable the training of robust preserving models, making FL an effective approach for decentralized data scenarios. ~~Two testing environments were used for testing the frameworks: the first one is denoted as ''local'', while the other is called ''cloud''.~~* Local environment: The local parties consist of a single desktop computer that acts both as the server and four separate clients. This configuration mimics a decentralized environment where the desktop computer takes on the roles of multiple participants, simulating the interactions and data contributions of distinct entities. As the server, it coordinates and manages the FL process, while functioning as an individual client allows it to provide diverse data contributions for training. This localized approach allowed for the use of Docker as the development environment, leveraging the power of a desktop computer assembled with an RTX 3080ti GPU to enhance performance. The power given by the NVidia GPU, allowed the use of a more complex model and the simulation of four clients on the same machine that acts also as the server. Being self-contained in a single host, the local environment is convenient for testing, especially when the focus is on the functional verification.* Cloud environment: In this case, cloud parties consist of two embedded devices or virtual machines acting as clients, and a notebook serving as the server. This configuration facilitates a distributed learning approach, enabling the clients to process their data locally while contributing to the model’s training. The server coordinates the learning process and aggregates the updates from the clients to improve the global model. This setup ensures a decentralized and privacy-preserving approach to ML, as the data remains on the clients’ devices, and only the model updates are shared during the training process. Leveraging embedded devices as clients enables the inclusion of resource-constrained devices in the FL ecosystem, making the framework more versatile and applicable to a wide range of scenarios. The notebook acting as the server provides a centralized point of coordination and ensures smooth communication and collaboration between the clients, making the FL process efficient and effective in leveraging distributed resources for improved model performance. Of course, this environment is more complicated to set up, but it better simulates real configurations. ~~==== ML framework ====~~Another crucial factor in designing the testing set up is the ML framework to be used. To this end, PyTorch was selected as the primary ML framework. The flexibility of PyTorch allowed for the implementation of complex models and easy customization to meet specific project requirements. Also, the availability of pre-trained models and a vast collection of built-in functions expedited the development process and enabled focus on the core aspects of the project. Another pivotal factor is PyTorch’s ability to leverage GPU for hardware acceleration, which is crucial for training models on distributed data in FL environments. Its integration with CUDA and optimisation for GPU computing make it a pragmatic choice for applications requiring high performance. Lastly, PyTorch was chosen for its adaptability within the existing development environment, including its compatibility with Docker and '''embedded devices based on the ARM64 (AArch64) architecture'''. PyTorch’s adaptability and support for ARM64 architecture were key factors in this decision. This interoperability has facilitated the integration of the framework into the research and development environment. ~~==== Data Preprocessing ====~~"Data Preprocessing" in an important step for ensuring the success and effectiveness of the entire process. This crucial phase involves the choice of the dataset and the transformation and preparation of data before it is used for training the ML model on distributed devices. The "Data Preprocessing" stage plays a vital role too in harmonizing the data collected from different parties, which might have varying data distributions and formats. By applying standardized preprocessing techniques across the data from multiple clients, the potential bias and inconsistencies arising from diverse data sources can be mitigated, leading to a more accurate and robust global model. ~~Data preprocessing step includes dataset selection, dataset splitting, and data augmentation. For more details about these operations, please refer to [1].~~ ~~==== Model configuration ====~~ ~~==== Client-side settings ====~~ ~~==== Aggregation algorithm ====~~ ~~==== Metrics ====~~ ~~==== Server-side settings ====~~ ~~=== Results ===~~ ~~{| class="wikitable"|+Flower running on SBC ORCA!# of cores~~!!|-|1|[[File:~~Flower 1~~NVFlare-~~core htop MX8M+.png|center|thumb|300x300px]]|[[File:Flower log 1-core MX8M+.png|center|thumb|300x300px]]~~|server-|4~~|[[File:Flower 4~~validation-~~cpu htop MX8M+~~accuracy.png|center|thumb|~~300x300px]]~~732x732px|~~[[File~~NVFlare:~~Flower log 4-core MX8M+~~server validation accuracy.~~png|center|thumb|300x300px~~]]|}

~~== Deep investigation~~ Analyzing the individual algorithms, it can be seen a very similar behavior between FedAvg and FedProx, which have very similar results in terms of both local <code>training_loss</code> and server <code>validation_accuracy</code>. This is mainly due to the fact that they are very similar to each other minus a proximity term mu, in the case of FedProx, which improves the convergence ratio. The Scaffold algorithm, on the other hand, has a totally different implementation from its predecessors, which allows to dynamic adjustment of the aggregation weight of ~~NVFlare =~~each client’s update based on their historical performance, and thus achieves better performance, especially when using unbalanced classes (α =~~TBD~~0.1). This can easily be seen in the server <code>validation_accuracy</code> graph.

~~== Conclusions~~ The successful execution of this more complex use case on NVFlare, involving multiple tested algorithms and ~~future work ==TBD~~diverse data heterogeneity, further underscores the framework’s robust capabilities and suitability for a wide range of scenarios. This result confirms the versatility of NVFlare as a FL framework, making it a reliable choice for real-world scenarios dealing with heterogeneous and complex data.

~~One~~ = Conclusions and future work =The analysis detailed in previous chapter allows to say that NVFlare is a viable framework for building real-world Federated Learning systems that make use of Linux-powered embedded platforms as clients. Nevertheless, it is worth remembering that one important issue ~~that~~ was purposely not addressed , yet is . For the sake of simplicity, this work did not consider the problem of labeling of new samples. In other words, it was implicitly assumed that new samples collected by ~~a device~~ the clients are somehow labelled prior to being used for training. This is a strong assumption ~~because implies~~ . In reality, system architects can not overlook this fundamental issue when designing FL-based solutions. This is especially true in industrial environments where clients are often unattended devices that operate standalone. To date, this topic hes not been investigated thoroughly yet. For instance, [https://bdtechtalks.com/2021/08/09/what-is-federated-learning/ this article] mentions it without providing practical solutions, however. Generally speaking, it seems thatfor industrial applications the use of unsupervised learning techniques could be a promising approach (see for example [https://arxiv.org/pdf/1805.03911.pdf this paper]). In any case, the problem of labeling new samples will have to be addressed in future works to make Federated Learning truly available for real applications in the space of Industrial IoT.

~~labeling of new samples issue~~=Notes=* https:<references /~~/bdtechtalks.com/2021/08/09/what-is-federated-learning/~~* unsupervised learning?**https://arxiv.org/pdf/1805.03911.pdf>

== References ==* <div id="LDLThesis">[1]Leandro Di Lauro, ''Comparative Analysis and Evaluation of Federated Learning Frameworks'', 20th September 2023, https://cloud.dave.eu/public/dc19e3</div>

U0001

Bureaucrats, dave_user, Administrators

4,650

edits

DAVE Developer's Wiki β

Changes

ML-TN-007 — AI at the edge: exploring Federated Learning solutions

DAVE Developer's Wiki ^β