Open main menu

DAVE Developer's Wiki β

Changes

Introduction
''Federated learning enables multiple actors to build a common, robust machine learning model '''without sharing data, thus addressing critical issues such as data privacy, data security, data access rights and access to heterogeneous data'''. Its applications engage industries including defense, telecommunications, Internet of Things, and pharmaceuticals. A major open question is when/whether federated learning is preferable to pooled data learning. Another open question concerns the trustworthiness of the devices and the impact of malicious actors on the learned model.''
In principle, FL can be an extremely useful technique to address critical issues of industrial IoT (IIoT) applications. As such, it matches perfectly [[ToloMEO Embedded Assistant|https://tolomeo.io DAVE Embedded Systems' IIoT platform, ToloMEO]]. This Technical Note (TN) illustrates how DAVE Embedded Systems explored, tested, and characterized some of the most promising open-source FL frameworks available to date. One of these frameworks might equip ToloMEO-compliant products in the future allowing our customers to implement federated learning systems easily. From the point of view of machine learning, therefore, we investigated if typical embedded architectures used today for industrial applications are suited for acting not only as inference platforms — we already dealt with this issue [[ML-TN-001 - AI at the edge: comparison of different embedded platforms - Part 1|here]] — but as training platforms as well.
In brief, the work consisted of the following steps:
== Criteria and initial, long list ==
For selecting the frameworks, several factors were taken into account:
* '''ML frameworks flexibility''': The adaptability of the framework to manage different ML frameworks.* '''Licensing''': It is mandatory that the framework has an open-source, permissive license to cope with the typical requirements of real-world use cases.* '''Repository rating and releases''': Rating in a repository is important for a FL framework as it indicates a high level of community interest and support, potentially leading to more contributions and improvements. Meanwhile, the first and latest releases indicate respectively the maturity and the support of the framework and whether it is released or still in a beta version.* '''Documentation and tutorials''': The provided documentation with related tutorials has to be complete and well-made.* '''Readiness for commercial usage''': The readiness of the framework to be developed in a real-world scenario. In order to establish the readiness, it was checked the version of the framework and the license.
According to the previous criteria, an initial list including the most promising FL frameworks was completed. It comprised of the following products:
* [https://github.com/NVIDIA/NVFlare NVIDIA FL Application Runtime Environment] (NVFlare)
=== Licensing ===
The choice of a suitable license is of paramount importance for any FL framework. A well-crafted license provides a legal foundation that governs the usage, distribution, and modification of the framework’s source code and associated components. A permissive license, like the MIT License or Apache License, allows users to use, modify, and distribute the framework with relatively few restrictions. This encourages widespread adoption, fosters innovation, and facilitates contributions from a broader community of developers and researchers. The permissiveness of these licenses empowers users to incorporate the framework into their projects, even if hey have proprietary components. On the other hand, copyleft licenses, like the GNU GPL, require derived works to be distributed under the same terms, ensuring that any modifications or extensions to the framework remain open-source. While this may be more restrictive, it encourages a collaborative ecosystem where improvements are shared back with the community. A clear and well-defined license also provides legal protection to both developers and users, helping to mitigate potential legal risks and disputes. It ensures that contributors have granted appropriate rights to their work and helps maintain a healthy and sustainable development environment. Most of the frameworks previously described are under the Apache-2.0 license except one: IBMFL. In fact, it is under an unspecified license that makes the framework not suitable for commercial use. For that reason, IBMFL was discarded from the comparison too.
=== Repository rating and releases ===
Ratings in public repositories such as "stars" in GitHub are important because they serve as a measure of popularity and community interest in the project. When a repository achieves a good rating, it indicates that more developers and users find the project valuable and relevant. This can lead to several benefits:
* '''Visibility''': Repositories with good ratings are likely to appear higher in platform's search results, making it easier for others to discover and use the project.* '''Credibility''': High-rating repositories are often perceived as more trust-worthy and reliable, as they are vetted and endorsed by a larger user base.* '''Contributions''': Popular repositories tend to attract more contributions from developers, leading to a more active and vibrant community around the project.* '''Feedback''': Projects with good ratings are more likely to receive feedback, bug reports, and feature requests, helping the developers improve the software.* '''Maintenance''': Higher ratings can also stimulate the maintainers to keep the project updated and actively supported. Other important, rating-related aspects are the first and latest releases. Thanks to the latter, it is possible respectively to see the maturity of the framework and also how often it is updated, and thus the support behind it. Obviously, a framework that was born earlier than others is much more likely to have better ratings. Having this in mind, at the time of writing this thesis, the ranking in terms of received stars correlated with the first release for each framework is as follows:** '''PySyft''': 8.9k stars / Jan 19, 2020** '''FATE''': 5.1k stars / Feb 18, 2019** '''FedML''': 3.1k stars / Apr 30, 2022** '''Flower''': 2.8k stars / Nov 11, 2020 ** '''TFF''': 2.1k stars / Feb 20, 2019 ** '''OpenFL''': 567 stars / Feb 1, 2021 ** '''IBMFL''': 438 stars / Aug 28, 2020 ** '''NVFlare ''': 413 stars / Nov 23, 2021
These characteristics, although they certainly have a bearing on the choice of frameworks, were not enough to go so far as to discard any of the selected frameworks.
=== Documentation and tutorials ===
High quality documentation and well-crafted tutorials are essential considerations when selecting a FL framework. In fact, there are several reasons that are presented here below:
* '''Accessibility and Ease of Use''': Comprehensive documentation allows users to understand the framework’s functionalities, APIs, and usage quickly. It enables developers, researchers, and practitioners to get started with the framework efficiently, reducing the learning curve.* '''Accelerated Development''': Well-structured tutorials and examples demonstrate how to use the framework to build practical FL systems. They provide step-by-step guidance on setting up experiments, running code, and interpreting results. This expedites the development process and encourages experimentation with different configurations.* '''Error Prevention''': Clear documentation and good examples help users avoid common mistakes and errors during implementation. It provides troubleshooting tips and addresses frequently asked questions, reducing frustration and increasing user satisfaction.* '''Reliability and Robustness''': A well-documented framework indicates that developers have invested time in organizing their code and explaining its functionalities. This attention to detail suggests a more reliable and stable framework.* '''Maintenance''': Higher stars can also stimulate the maintainers to keep the project updated and actively supported.
Regarding this aspects, there are a lot of frameworks that still don’t have good documentation and tutorials. Among the latter, there are: PySyft, OpenFL and FedML. PySyft is still under construction, as the official repository says, and for that reason often the documentation is not up to date and is not complete. OpenFL, on its side, has very meager documentation and only a few tutorials that don’t explore a lot of ML frameworks or a lot of scenarios. The FedML framework also has, like PySyft, incomplete documentation because the project is born very recently and is still under development. Finally, the FATE framework has a complete and well-made documentation but very few tutorials and, because of its complex architecture, would have taken too much time. Because of these reasons, these four frameworks were discarded from the comparison.
== Final choice ==
At the beginning of this section, a total of eight frameworks were considered. Each framework was assessed based on various aspects and after an in-depth analysis, six frameworks were deemed unsuitable due to some requisites not being met. The requirements that were considered are summarized in the following table:
{| class="wikitable" style="margin: 0 auto;"
|}
These two remaining frameworks are then: '''Flower ''' and '''NVFlare'''. They demonstrated the potential to address the research objectives effectively and were well-aligned with the specific requirements of the FL project. Later, these two selected frameworks will be rigorously compared, examining their capabilities in handling diverse ML models, supporting various communication protocols, and accommodating heterogeneous client configurations. The comparison will delve into the frameworks’ performance, ease of integration, and potential for real-world deployment. By focusing on these two frameworks, this research aims to provide a detailed evaluation that can serve as a valuable resource for practitioners and researchers seeking to implement FL in a variety of scenarios. The selected frameworks will undergo comprehensive testing and analysis, enabling the subsequent sections to present an informed and insightful comparison, shedding light on their respective strengths and limitations.
= Flower vs NVFlare: an in-depth comparison =
Cross-entropy is commonly used in classification problems because it quantifies the difference between the predicted probabilities and the actual target labels, providing a measure of how well the model is performing in classifying the input data. In the context of CIFAR-10, where there are ten classes (e.g., airplanes,
cars, birds, etc.), the Cross-Entropy loss compares the predicted class probabilities with the true one-hot encoded labels for each input sample. It applies the logarithm to the probabilities and then sums up the negative log likelihoods across all classes. The objective is to minimize this loss function during the training process, which effectively encourages the model to assign high probabilities to the correct class labels and low probabilities to the incorrect ones. One of the reasons why Cross-Entropy Loss is considered suitable for CIFAR-10 and classification tasks, in general, is its ability to handle multi-class scenarios efficiently. By transforming the model’s output into probabilities through the softmax activation, it inherently captures the relationships between different classes, allowing for a more expressive representation of class likelihoods.
==== Client-side settings ====
==== Metrics ====
In order to make a good comparison, three of the most common and essential metrics were chosen to evaluate model performance and effectiveness. The chosen metrics are the following:
* '''Loss''': The the loss function quantifies the dissimilarity between the predicted output of the model and the actual ground truth labels in the training data. It provides a measure of how well the model is performing during training. The goal is to minimize the loss function, as a lower loss indicates that the model is better aligned with the training data.* '''Accuracy''': Accuracy accuracy is a fundamental metric used to assess the model’s overall performance. It represents the proportion of correctly predicted samples to the total number of samples in the dataset. A higher accuracy indicates that the model is making accurate predictions, while a lower accuracy suggests that the model might need further improvements. Calculating the accuracy of individual clients in a FL classification problem is important to assess the performance of each client’s local model. This helps in understanding how well each client is adapting to its local data distribution and making accurate predictions.* '''F1-score''': The the F1-score is a metric that combines both precision and recall to provide a balanced evaluation of the model’s performance, especially when dealing with imbalanced datasets. Precision measures the ratio of correctly predicted positive samples to all predicted positive samples, while recall measures the ratio of correctly predicted positive samples to all actual positive samples. The F1-score is the harmonic mean of precision and recall, providing a single metric that considers both aspects.
==== Server-side settings ====
|}
For each experiment, three evaluations were performed:
* '''Global evaluation''': accuracy and F1-score at the end of each round of FL were tested.* '''Local evaluation''': accuracy and F1-score at the end of each round of FL were tested.* '''Training evaluation''': it was computed loss, accuracy, and F1-score.
Experiments were run both in the local and cloud environments. Detailed results are illustrated in [1]. In essence, the results are very similar for both frameworks. Thus, they can be considered equivalent from the point of view of the metrics considered. It is also very important to note that for all the results regarding the cloud environment, there are '''very similar values between the testbed based on virtual machines and the one based on embedded devices'''. This is not obvious because moving from virtualized, x86-powered clients to ARM64-powered clients entails several issues that can affect the results of the FL application. Among these, it is worth to remember the following:
* '''Limited hardware resources''': Embedded devices often have limited hardware resources, such as CPU, memory and computing power. This restriction can affect the performance of FL, especially if models are complex or operations require many resources.* '''Hardware variations''': Embedded devices may have hardware variations between them, even if they belong to the same class. These hardware differences may lead to different behaviors in FL models, requiring more robustness in adapting to different devices.* '''Variations in workload''': Embedded device applications may have very different workloads from those simulated in a virtual environment. These variations may lead to different requirements for FL.
In conclusion, from a functional perspective, both frameworks passed the test suite. More details about their performances in terms of execution time can be found in [[#Execution time|this section]].
== Data heterogeneity ==
In this advanced project, an additional feature was incorporated involving the integration of classes aimed at performing dataset splitting among the designated clients, which, in this instance, were four in number. In addition to dividing the dataset into four subsets, the possibility of choosing the level of heterogeneity of the data was added by applying the Dirichlet sampling strategy. Thus, it was possible to dynamically adjust the degree of data heterogeneity for each client bringing higher. This functionality made it possible to simultaneously customize the level of data heterogeneity across all clients. In the context of FL, this data heterogeneity can be defined as follows:
* '''Low Data Heterogeneity''': Low heterogeneity means that the data across different clients is quite similar or homogeneous. There is little variation among the data held by different clients. This leads to nearly balanced classes among clients, that is classes with a similar number of samples in each class.* '''High Data Heterogeneity''': High heterogeneity means that there is significant diversity in the data across different clients or nodes. This means that every subset assigned to each client contains unbalanced classes, i.e. some classes may be over-represented in some customers, while others may be under-represented.
In order to have a clear comparison within the experiments, the upper and lower extremes of the α factor affecting heterogeneity were considered, i.e. 0.1 and 1.0.
4,650
edits