Difference between revisions of "ML-TN-007 — AI at the edge: exploring Federated Learning solutions"

From DAVE Developer's Wiki
Jump to: navigation, search
(Model configuration)
Line 302: Line 302:
  
 
==== Model configuration ====
 
==== Model configuration ====
 +
After data processing, three fundamental aspects were treated in order to have a FL system that converges efficiently to meaningful solutions. Indeed, a correct model configuration plays a crucial role in FL as it encompasses the selection of the appropriate model architecture, optimization algorithm, and criterion.
 +
 +
===== Model architecture =====
 +
The model architecture plays a crucial role in defining how the data flows
 +
 +
through the network, the number and type of layers, the connections between
 +
 +
neurons, and the activation functions applied at different stages.
 +
 +
Since the main purpose of this work is to compare two frameworks, the
 +
 +
choice of architecture was made so as to maximise as much as possible the
 +
 +
metrics chosen at the end of the training but limiting the resource demands
 +
 +
on the part of the model and thus also the time required to complete the
 +
 +
process.
 +
 +
====== Cloud environment ======
 +
The model architecture chosen for the cloud part
 +
 +
prioritises simplicity and compatibility with the resource-constrained CPUs
 +
 +
of embedded devices and virtual machines. This decision was motivated by
 +
 +
the need for a lightweight and efficient model that can be easily deployed
 +
 +
and executed on various devices participating in the FL system.
 +
 +
For that reason, the architecture chosen for the cloud part of this work
 +
 +
is a model taken from the "Training a classifier" tutorial available on the
 +
 +
official PyTorch website [22] (Listing 4.2). This selection fits perfectly with
 +
 +
the request for a lightweight model that could still deliver satisfactory results.
 +
 +
The model of choice is a sequential one since they are often suitable and
 +
 +
widely used for classification problems [61]. It is a type of NN architecture
 +
 +
composed by a plain stack of layers where each layer has exactly one input
 +
 +
tensor and one output tensor. As shown in the figure, the model consists of
 +
 +
several layers, starting with two Convolutional layers (Conv2d) that process
 +
 +
the input data followed by MaxPooling layers (MaxPool2d) that reduce the
 +
 +
spatial dimensions. Afterwards, there are three fully connected layers (Lin-
 +
 +
ear) responsible for the final classification. The model has a total of 62,006
 +
 +
parameters, which are the trainable weights and biases within the layers.
 +
 +
These parameters are optimised during the training process to achieve accu-
 +
 +
rate predictions. The model architecture is suitable for tasks such as image
 +
 +
classification and demonstrates a balance between depth and complexity,
 +
 +
allowing for efficient training and satisfactory performance.
 +
 +
====== Local environment ======
 +
A more complex model architecture was chosen for
 +
 +
the local environment to leverage the usage of an NVidia GPU for model
 +
 +
training. This decision was driven by the aim to harness the computational
 +
 +
power of the GPU and expedite the training process, ultimately leading to
 +
 +
improved model performance and more efficient training. This led to better
 +
 +
metrics results compared to the cloud counterpart.
 +
 +
In this case, the architecture that was chosen is indeed the ResNet-18
 +
 +
[5], as shown in Listing 4.3. ResNet-18 is a highly effective DL model for
 +
 +
image classification tasks, known for its ability to handle deeper architectures
 +
 +
without sacrificing performance. It performs exceptionally well with the
 +
 +
CIFAR-10 dataset and is computationally efficient.
 +
 +
In the context of this work, the choice was to use ResNet-18 in the
 +
 +
’weights=DEFAULT’ mode since it offers several advantages. The ’DE-
 +
 +
FAULT’ mode simplifies integration into the FL system and allows for faster
 +
 +
convergence and better generalisation through transfer learning with pre-
 +
 +
trained weights. In fact, this means that the model was initially trained on
 +
 +
a large dataset, typically for a different task (e.g., ImageNet classification),
 +
 +
and its learned weights and parameters were saved. Instead of training the
 +
 +
model from scratch on a new task, you start with these pre-trained weights
 +
 +
as an initialisation.
 +
 +
The layers can be summarised as follows:
 +
 +
1. Convolutional Layers (Conv2d): The initial layer of ResNet-18 consists
 +
 +
of a 2D convolution operation that convolves the input image with a set
 +
 +
of learnable filters. These filters help detect various low-level features,
 +
 +
such as edges and corners, in the input image.
 +
 +
2. Batch Normalization Layers (BatchNorm2d): After each convolutional
 +
 +
layer, batch normalization is applied, which normalizes the output of
 +
 +
the previous layer to improve training stability and accelerate conver-
 +
 +
gence.
 +
 +
3. Rectified Linear Unit Activation (ReLU): Following the batch nor-
 +
 +
malization, a non-linear activation function called ReLU is applied
 +
 +
element-wise to introduce non-linearity into the model and allow it
 +
 +
to learn complex features.
 +
 +
4. Max Pooling Layers (MaxPool2d): After the initial convolutional block,
 +
 +
max pooling layers are used to downsample the spatial dimensions of
 +
 +
the feature maps, reducing computational complexity while retaining
 +
 +
important information.
 +
 +
5. Basic Blocks: The ResNet-18 model utilizes a series of eight basic
 +
 +
blocks, of which for simplicity only four are visible visible in the code
 +
 +
above, each consisting of multiple convolutional and batch normaliza-
 +
 +
tion layers. The basic blocks are designed to mitigate the vanishing
 +
 +
gradient problem and allow the model to be deeper without perfor-
 +
 +
mance degradation.
 +
 +
6. Adaptive Average Pooling (AdaptiveAvgPool2d): The adaptive average
 +
 +
pooling layer aggregates the spatial dimensions of the feature maps into
 +
 +
a fixed size, ensuring the model can handle input images of varying sizes
 +
 +
and aspect ratios.
 +
 +
7. Fully Connected Layers (Linear): Towards the end of the model, adap-
 +
 +
tive average pooling is applied to convert the spatial dimensions of the
 +
 +
feature maps into a fixed size. Subsequently, fully connected layers are
 +
 +
used to perform classification based on the learned features.
 +
 +
The Resnet-18 model used was also modified with custom fully connected
 +
 +
layers (Linear) in order to accommodate a different output classification task.
 +
 +
In the original ResNet-18, the model was designed for image classification
 +
 +
with 1,000 output classes. However, in this modified version, the number of
 +
 +
output classes was reduced to just 10 to align with the specific classification
 +
 +
problem at hand. By reducing the number of output classes, the model’s
 +
 +
architecture becomes more tailored to the target classification task, which,
 +
 +
in turn, reduces the number of trainable parameters by about half million
 +
 +
parameters.
 +
 +
===== Optimizer =====
 +
The optimiser chosen, presented in the introduction paragraph, is the SGD
 +
 +
since it is a popular optimiser for classification problems due to its simplicity
 +
 +
and effectiveness. It was tuned using some hyperparameters: momentum and
 +
 +
learning rate.
 +
 +
Momentum is a technique used to accelerate the convergence of the op-
 +
 +
timisation process and improve its stability. It addresses the issue of slow
 +
 +
convergence and oscillations in the loss function by introducing a "velocity"
 +
 +
term that helps the optimiser navigate the optimisation landscape more ef-
 +
 +
ficiently. The value of 0.9 that was chosen, meaning that the optimiser gives
 +
 +
more weight to the past accumulated gradients, leading to smoother updates.
 +
 +
Learning rate is a hyperparameter that determines the step size at which
 +
 +
the model updates its weights during the optimisation process. It controls
 +
 +
how much the model adjusts its internal parameters in response to the error
 +
 +
calculated during training. In this case the chosen learning rate of 0.001
 +
 +
strikes a balance between making smaller, more precise steps towards the
 +
 +
optimal solution and avoiding overshooting or oscillations during the opti-
 +
 +
misation process.
 +
 +
===== Criterion =====
 +
Cross-entropy is commonly used in classification problems because it quan-
 +
 +
tifies the difference between the predicted probabilities and the actual target
 +
 +
labels, providing a measure of how well the model is performing in classifying
 +
 +
the input data.
 +
 +
This can be expressed using this formula:
 +
 +
where (y) is the target probability, (p) is the predicted probability, and
 +
 +
(m) is the number of classes. So that is how “wrong” or “far away” the
 +
 +
prediction is from the true distribution.
 +
 +
In the context of CIFAR-10, where there are ten classes (e.g., airplanes,
 +
 +
cars, birds, etc.), the Cross-Entropy loss compares the predicted class proba-
 +
 +
bilities with the true one-hot encoded labels for each input sample. It applies
 +
 +
the logarithm to the probabilities and then sums up the negative log like-
 +
 +
lihoods across all classes. The objective is to minimize this loss function
 +
 +
during the training process, which effectively encourages the model to as-
 +
 +
sign high probabilities to the correct class labels and low probabilities to the
 +
 +
incorrect ones.
 +
 +
One of the reasons why Cross-Entropy Loss is considered suitable for
 +
 +
CIFAR-10 and classification tasks, in general, is its ability to handle multi-
 +
 +
class scenarios efficiently. By transforming the model’s output into probabil-
 +
 +
ities through the softmax activation, it inherently captures the relationships
 +
 +
between different classes, allowing for a more expressive representation of
 +
 +
class likelihoods.
  
 
==== Client-side settings ====
 
==== Client-side settings ====
 +
On the client side, three important tasks including training, validation and
 +
 +
testing are being performed. Each task comes to be executed by every client
 +
 +
participating in the FL infrastructure. At the end of a cycle, tasks come to be
 +
 +
blocked temporarily so that the results accumulated up to that point are sent
 +
 +
to the server, which will take care of aggregating them. Once aggregated,
 +
 +
each client will start again with the tasks assigned to it from the data formed
 +
 +
by the server.
 +
 +
The model is trained for 3 epochs in the case of the cloud environment
 +
 +
and for 4 epochs in the case of the local environment. The total steps per
 +
 +
epoch in both cases is for each task of 1250 steps.
  
 
==== Aggregation algorithm ====
 
==== Aggregation algorithm ====
 +
In this FL scenario, the Federated Averaging (FedAvg) [10] algorithm was
 +
 +
employed as the aggregation method. FedAvg is a fundamental and widely
 +
 +
adopted algorithm used to aggregate model updates from multiple clients
 +
 +
(or participants) in a FL setting.
 +
 +
The primary objective of FedAvg is to allow collaborative model training
 +
 +
while preserving data privacy. After local training, clients communicate their
 +
 +
model updates (gradients) to the server, where these updates are aggregated
 +
 +
to create a global model. The global model is then sent back to the clients
 +
 +
that, use it as the starting point for the next round of training. This iterative
 +
 +
process continues until the global model converges to a satisfactory solution.
 +
 +
The Listing 4.4 describes how FedAvg works in a FL infrastructure:
 +
 +
As reader can see, the FedAvg algorithm works by averaging the model
 +
 +
updates from individual clients, weighted by the proportion of data samples
 +
 +
each client holds. This weighted average ensures that clients with larger
 +
 +
datasets have a more significant influence on the global model, while main-
 +
 +
taining fairness for clients with smaller datasets. By iteratively aggregating
 +
 +
the updates and distributing the global model back to clients, FedAvg en-
 +
 +
ables collaborative learning without sharing raw data.
  
 
==== Metrics ====
 
==== Metrics ====
 +
In order to make a good comparison, three of the most common and essential
 +
 +
metrics were chosen to evaluate model performance and effectiveness.
 +
 +
The chosen metrics are the following:
 +
 +
• Loss: The loss function quantifies the dissimilarity between the pre-
 +
 +
dicted output of the model and the actual ground truth labels in the
 +
 +
training data. It provides a measure of how well the model is perform-
 +
 +
ing during training. The goal is to minimize the loss function, as a
 +
 +
lower loss indicates that the model is better aligned with the training
 +
 +
data.
 +
 +
• Accuracy: Accuracy is a fundamental metric used to assess the model’s
 +
 +
overall performance. It represents the proportion of correctly predicted
 +
 +
samples to the total number of samples in the dataset. A higher ac-
 +
 +
curacy indicates that the model is making accurate predictions, while
 +
 +
a lower accuracy suggests that the model might need further improve-
 +
 +
ments. Calculating the accuracy of individual clients in a FL classifi-
 +
 +
cation problem is important to assess the performance of each client’s
 +
 +
local model. This helps in understanding how well each client is adapt-
 +
 +
ing to its local data distribution and making accurate predictions.
 +
 +
The formula for the accuracy can be simply expressed as follows:
 +
 +
It ranges between 0 and 1, where 1 indicates perfect accuracy, meaning
 +
 +
correctly predicted sample, and 0 indicates it will always fail in the
 +
 +
prediction.
 +
 +
F1-score: The F1-score is a metric that combines both precision and
 +
 +
recall to provide a balanced evaluation of the model’s performance,
 +
 +
especially when dealing with imbalanced datasets. Precision measures
 +
 +
the ratio of correctly predicted positive samples to all predicted positive
 +
 +
samples, while recall measures the ratio of correctly predicted positive
 +
 +
samples to all actual positive samples. The F1-score is the harmonic
 +
 +
mean of precision and recall, providing a single metric that considers
 +
 +
both aspects.
  
 
==== Server-side settings ====
 
==== Server-side settings ====
 +
After the choice of the metrics to evaluate, the last thing to decide were the
 +
 +
server settings. In fact, two important parameters are missing in this regard:
 +
 +
the number of rounds and the number of clients that would have participated
 +
 +
in the FL infrastructure.
 +
 +
A round represents a communication cycle between clients and the cen-
 +
 +
tral server in the FL training process. During each round, participating
 +
 +
clients perform local training using their available local data. Subsequently,
 +
 +
the updated model weights trained locally are sent to the central server or
 +
 +
coordination node. Here, the weights are centrally aggregated to obtain an
 +
 +
updated global model, which represents the combined knowledge of all par-
 +
 +
ticipating clients. At this point, the round is concluded and the aggregate
 +
 +
model is sent back to the clients, who will use this updated model to perform
 +
 +
a new round.
 +
 +
In this case, the number of rounds chosen for the cloud part is equal to
 +
 +
4 meanwhile for the local part there is a total of 10 rounds.
 +
 +
On the server-side, the second parameter to be chosen is the number of
 +
 +
clients that will participate in the various rounds of FL. As also seen in the
 +
 +
subsection 4.1.1, in this case, the number of cloud-side clients is 2 while on
 +
 +
the local-side is equal to 4.
  
 
=== Results ===
 
=== Results ===
Line 328: Line 742:
 
|}
 
|}
  
== Deep investigation of NVFlare ==
+
== Applying NVFlare to a real-word case ==
 
TBD
 
TBD
  

Revision as of 08:32, 28 September 2023

Info Box
NeuralNetwork.png Applies to Machine Learning


History[edit | edit source]

Version Date Notes
1.0.0 August 2023 First public release

Introduction[edit | edit source]

According to Wikipedia, Federated Learning (FL) is defined as a machine learning technique that trains an algorithm via multiple independent sessions, each using its own dataset. This approach stands in contrast to traditional centralized machine learning techniques where local datasets are merged into one training session, as well as to approaches that assume that local data samples are identically distributed.

Federated learning enables multiple actors to build a common, robust machine learning model without sharing data, thus addressing critical issues such as data privacy, data security, data access rights and access to heterogeneous data. Its applications engage industries including defense, telecommunications, Internet of Things, and pharmaceuticals. A major open question is when/whether federated learning is preferable to pooled data learning. Another open question concerns the trustworthiness of the devices and the impact of malicious actors on the learned model.

In principle, FL can be an extremely useful technique to address critical issues of industrial IoT (IIoT) applications. As such, it matches perfectly DAVE Embedded Systems' IIoT platform, ToloMEO. This Technical Note (TN) illustrates how DAVE Embedded Systems explored, tested, and characterized some of the most promising open-source FL frameworks available to date. One of these frameworks might equip ToloMEO-compliant products in the future allowing our customers to implement federated learning systems easily. From the point of view of machine learning, therefore, we investigated if typical embedded architectures used today for industrial applications are suited for acting not only as inference platforms — we already dealt with this issue here — but as training platforms as well.

In brief, the work consisted of the following steps:

  • Selecting the FL frameworks to test.
  • Testing the selected frameworks.
  • Comparing the results for isolating the best framework.
  • Deep investigation of the best framework.

A detailed dissertation of the work that led to this Technical Note is available here TBD [1].

Choosing Federated learning frameworks[edit | edit source]

Criteria and initial, long list[edit | edit source]

For selecting the frameworks, several factors were taken into account:

  • ML frameworks flexibility: The adaptability of the framework to manage different ML frameworks.
  • Licensing: It is mandatory that the framework has an open-source, permissive license to cope with the typical requirements of real-world use cases.
  • Repository rating and releases: Rating in a repository is important for a FL framework as it indicates a high level of community interest and support, potentially leading to more contributions and improvements. Meanwhile, the first and latest releases indicate respectively the maturity and the support of the framework and whether it is released or still in a beta version.
  • Documentation and tutorials: The provided documentation with related tutorials has to be complete and well-made.
  • Readiness for commercial usage: The readiness of the framework to be developed in a real-world scenario. In order to establish the readiness, it was checked the version of the framework and the license.

According to the previous criteria, an initial list including the most promising FL frameworks was completed. It comprised of the following products:

The limitation in selecting only eighth FL frameworks arises from the evolving nature of the field. As a relatively recent and rapidly evolving technique, FL continues to witness the emergence of various frameworks, each with its unique features and capabilities. In this context, the choice to focus on these frameworks reflects the attempt to capture the current state of the art and provide an analysis of the most prominent and well-established options available. This selection aims to offer valuable insights into the leading frameworks that are currently considered among the best choices in the evolving landscape of FL.

In the next sections, the aforementioned factors will be treated individually, justifying the reasons behind the discard of some frameworks rather than others.

ML frameworks flexibility[edit | edit source]

Flexibility in ML frameworks is crucial when choosing a FL framework as

it allows adapting the system to diverse use cases and data distributions.

A flexible framework can support various ML algorithms, models, and data

types, accommodating the specific requirements of different scenarios. This

adaptability enhances the framework’s applicability in real-world deploy-

ments, ensuring it can effectively handle the heterogeneity and dynamic na-

ture of distributed data sources across clients.

It is noteworthy to mention that most of the frameworks discussed ear-

lier, including NVFlare, FATE, Flower, PySyft, IBM, OpenFL, and FedML,

are designed to be agnostic to the underlying ML framework used. This ag-

nosticism provides users with the flexibility to employ various ML libraries,

such as TensorFlow, PyTorch, and scikit-learn, among others, based on their

preferences and project requirements.

However, it is important to highlight that one exception to this trend

is TFF. Unlike the other frameworks, TFF is specifically tailored for the

TensorFlow ecosystem. While it harnesses the powerful capabilities of Ten-

sorFlow, it inherently limits the utilisation of other ML libraries in the FL

context. As a result, users opting for TFF should be aware of this frame-

work’s dependency on TensorFlow for their FL endeavors.

For that reason TFF was discarded from the potential frameworks to be

considered for comparison.

Licensing[edit | edit source]

The choice of a suitable license is of paramount importance for any FL frame-

work [19]. A well-crafted license provides a legal foundation that governs the

usage, distribution, and modification of the framework’s source code and as-

sociated components.

A permissive license, like the MIT License or Apache License, allows users

to use, modify, and distribute the framework with relatively few restrictions.

This encourages widespread adoption, fosters innovation, and facilitates con-

tributions from a broader community of developers and researchers. The per-

missiveness of these licenses empowers users to incorporate the framework

into their projects, even if they have proprietary components.

On the other hand, copyleft licenses, like the GNU GPL, require derived

works to be distributed under the same terms, ensuring that any modifica-

tions or extensions to the framework remain open-source. While this may

be more restrictive, it encourages a collaborative ecosystem where improve-

ments are shared back with the community.

A clear and well-defined license also provides legal protection to both

developers and users, helping to mitigate potential legal risks and disputes.

It ensures that contributors have granted appropriate rights to their work

and helps maintain a healthy and sustainable development environment.

Most of the frameworks previously described are under the Apache-2.0

license except one: IBMFL. In fact, it is under an unspecified license that

makes the framework not suitable for commercial use. For that reason,

IBMFL was discarded from the comparison too.

Repository stars and releases[edit | edit source]

Stars in a GitHub repository are important because they serve as a measure

of popularity and community interest in the project. When a repository

receives more stars, it indicates that more developers and users find the

project valuable and relevant. This can lead to several benefits:

  • Visibility: Repositories with more stars are likely to appear higher in GitHub search results, making it easier for others to discover and use the project.
  • Credibility: High-starred repositories are often perceived as more trust- worthy and reliable, as they are vetted and endorsed by a larger user base.
  • Contributions: Popular repositories tend to attract more contribu- tions from developers, leading to a more active and vibrant community around the project.
  • Feedback: Projects with many stars are more likely to receive feedback, bug reports, and feature requests, helping the developers improve the software.
  • Maintenance: Higher stars can also stimulate the maintainers to keep the project updated and actively supported. Other important aspects, which are related to the stars obtained by the framework, are the first and latest releases. Thanks to the latter, it is possible respectively to see the maturity of the framework and also how often it is updated, and thus the support behind it. Obviously, a framework that was born earlier than others is much more likely to have more stars. Having this in mind, at the time of writing this thesis, the ranking in terms of star ratings received correlated with the first release for each framework is as follows:
  • PySyft 8.9k stars Jan 19, 2020 2. FATE 5.1k stars Feb 18, 2019 3. FedML 3.1k stars Apr 30, 2022 4. Flower 2.8k stars Nov 11, 2020 5. TFF 2.1k stars Feb 20, 2019 6. OpenFL 567 stars Feb 1, 2021 7. IBMFL 438 stars Aug 28, 2020 8. NVFlare 413 stars Nov 23, 2021 These characteristics, although they certainly have a bearing on the choice of frameworks, were not enough to go so far as to discard any of the selected frameworks.

Documentation and tutorials[edit | edit source]

  • High quality documentation and well-crafted tutorials are essential consid- erations when selecting a FL framework. In fact, there are several reasons that are presented here below:
  • Accessibility and Ease of Use: Comprehensive documentation allows users to understand the framework’s functionalities, APIs, and usage quickly. It enables developers, researchers, and practitioners to get started with the framework efficiently, reducing the learning curve. • Accelerated Development: Well-structured tutorials and examples demon- strate how to use the framework to build practical FL systems. They provide step-by-step guidance on setting up experiments, running code, and interpreting results. This expedites the development process and encourages experimentation with different configurations. • Error Prevention: Clear documentation and good examples help users avoid common mistakes and errors during implementation. It provides troubleshooting tips and addresses frequently asked questions, reducing frustration and increasing user satisfaction. • Reliability and Robustness: A well-documented framework indicates that developers have invested time in organising their code and ex- plaining its functionalities. This attention to detail suggests a more reliable and stable framework. • Maintenance: Higher stars can also stimulate the maintainers to keep the project updated and actively supported.

Regarding this aspects, there are a lot of frameworks that still don’t have

good documentation and tutorials. Among the latter, there are: PySyft,

OpenFL and FedML. PySyft is still under construction, as the official repos-

itory says, and for that reason often the documentation is not up to date

and is not complete. OpenFL, on its side, has very meager documentation

and only a few tutorials that don’t explore a lot of ML frameworks or a

lot of scenarios. The FedML framework also has, like PySyft, incomplete

documentation because the project is born very recently and is still under

development. Finally, the FATE framework has a complete and well-made

documentation but very few tutorials and, because of its complex architec-

ture, would have taken too much time. Because of these reasons, these four

frameworks were discarded from the comparison.

Readiness for commercial usage[edit | edit source]

In the context of FL, the significance of a framework being ready for commer-

cial use cannot be overstated. As businesses increasingly recognise the value

of decentralised ML solutions, the demand for robust and production-ready

frameworks has intensified.

A FL framework geared for commercial use offers several crucial ad-

vantages. Firstly, it provides a stable and scalable foundation to deploy

large-scale FL systems across diverse devices and platforms. This ensures

that businesses can seamlessly integrate the framework into their existing

infrastructure, minimising disruption and optimising efficiency.

Moreover, a commercially viable framework emphasises security and pri-

vacy measures, a non-negotiable aspect when dealing with sensitive data

across distributed environments. Advanced encryption techniques, secure

communication protocols, and differential privacy methods guarantee that

user data remains safeguarded, mitigating potential risks of data breaches

or unauthorised access.

Of the frameworks covered, only a few are ready to be used commer-

cially. Among them are: Flower, NVFlare, FATE, OpenFL and TFF. On

the other hand, there are some frameworks that are not yet ready. Among

the latter are: FedML, PySyft and IBMFL. The first two are in fact still

under development and not yet ready for commercial-level use, while the

third has a private license that does not allow the framework to be used

for commercial-level applications. As explained in the previous sub-sections,

these frameworks were already discarded from comparison.

Final choice[edit | edit source]

At the beginning of this section, a total of eight frameworks were considered.

Each framework was assessed based on various aspects and after an in-depth

analysis, six frameworks were deemed unsuitable due to some requisites not

being met.

The requirements that were met and not met by the framework were

summarised in Table 3.1:

TBD

These two remaining frameworks are: Flower and NVFlare. They demon-

strated the potential to address the research objectives effectively and were

well-aligned with the specific requirements of the FL project.

In chapter 5, these two selected frameworks will be rigorously compared,

examining their capabilities in handling diverse ML models, supporting vari-

ous communication protocols, and accommodating heterogeneous client con-

figurations. The comparison will delve into the frameworks’ performance,

ease of integration, and potential for real-world deployment.

By focusing on these two frameworks, this research aims to provide a

detailed evaluation that can serve as a valuable resource for practitioners

and researchers seeking to implement FL in a variety of scenarios. The se-

lected frameworks will undergo comprehensive testing and analysis, enabling

the subsequent sections to present an informed and insightful comparison,

shedding light on their respective strengths and limitations.

Frameworks in-depth comparison: Flower vs NVFlare[edit | edit source]

Introduction[edit | edit source]

To conduct an in-depth comparison of the two selected frameworks, a problem simulating a real-world scenario was chosen. Specifically, a classification problem was used. The choice was due to the following reasons:

  • Firstly, classification problems are a widely studied and well-understood domain in ML, making them suitable for benchmarking and evaluation purposes.
  • Secondly, classification tasks are versatile and can be applied to various real-world scenarios, such as image classification, natural language processing, and medical diagnosis, making the evaluation results applicable to a wide range of applications.
  • Classification problems often involve complex data patterns and require efficient model training and optimization techniques, making them a suitable challenge for assessing the performance and scalability of FL frameworks. Additionally, classification problems allow for the evaluation of key metrics like accuracy, precision, recall, and F1-score, providing a comprehensive assessment of a framework’s capabilities in handling different aspects of ML tasks.

System design[edit | edit source]

This section describes the architectural aspects of the entire system used to test Flower and NVFlare.

Testing environments[edit | edit source]

Within the FL infrastructure, multiple parties are involved to create a collaborative environment for training ML models. These parties, including data providers (clients) and a model aggregator (server), play essential roles in the FL process. Data providers contribute their locally-held data, while model aggregators facilitate the consolidation of the individual model updates from different parties. The interactions between these parties enable the training of robust preserving models, making FL an effective approach for decentralized data scenarios.

Two testing environments were used for testing the frameworks: the first one is denoted as local, while the other is called cloud.

  • Local environment: The local parties consist of a single desktop computer that acts both as the server and four separate clients. This configuration mimics a decentralized environment where the desktop computer takes on the roles of multiple participants, simulating the interactions and data contributions of distinct entities. As the server, it coordinates and manages the FL process, while functioning as an individual client allows it to provide diverse data contributions for training. This localized approach allowed for the use of Docker as the development environment, leveraging the power of a desktop computer assembled with an RTX 3080ti GPU to enhance performance. The power given by the NVidia GPU, allowed the use of a more complex model and the simulation of four clients on the same machine that acts also as the server. Being self-contained in a single host, the local environment is convenient for testing, especially when the focus is on the functional verification.
  • Cloud environment: In this case, cloud parties consist of two embedded devices or virtual machines acting as clients, and a notebook serving as the server. This configuration facilitates a distributed learning approach, enabling the clients to process their data locally while contributing to the model’s training. The server coordinates the learning process and aggregates the updates from the clients to improve the global model. This setup ensures a decentralized and privacy-preserving approach to ML, as the data remains on the clients’ devices, and only the model updates are shared during the training process. Leveraging embedded devices as clients enables the inclusion of resource-constrained devices in the FL ecosystem, making the framework more versatile and applicable to a wide range of scenarios. The notebook acting as the server provides a centralized point of coordination and ensures smooth communication and collaboration between the clients, making the FL process efficient and effective in leveraging distributed resources for improved model performance. Of course, this environment is more complicated to set up, but it better simulates real configurations.

ML framework[edit | edit source]

Another crucial factor in designing the testing set up is the ML framework to be used. To this end, PyTorch was selected as the primary ML framework. The flexibility of PyTorch allowed for the implementation of complex models and easy customization to meet specific project requirements. Also, the availability of pre-trained models and a vast collection of built-in functions expedited the development process and enabled focus on the core aspects of the project. Another pivotal factor is PyTorch’s ability to leverage GPU for hardware acceleration, which is crucial for training models on distributed data in FL environments. Its integration with CUDA and optimisation for GPU computing make it a pragmatic choice for applications requiring high performance. Lastly, PyTorch was chosen for its adaptability within the existing development environment, including its compatibility with Docker and embedded devices based on the ARM64 (AArch64) architecture. PyTorch’s adaptability and support for ARM64 architecture were key factors in this decision. This interoperability has facilitated the integration of the framework into the research and development environment.

Data Preprocessing[edit | edit source]

"Data Preprocessing" in an important step for ensuring the success and effectiveness of the entire process. This crucial phase involves the choice of the dataset and the transformation and preparation of data before it is used for training the ML model on distributed devices. The "Data Preprocessing" stage plays a vital role too in harmonizing the data collected from different parties, which might have varying data distributions and formats. By applying standardized preprocessing techniques across the data from multiple clients, the potential bias and inconsistencies arising from diverse data sources can be mitigated, leading to a more accurate and robust global model.

Data preprocessing step includes dataset selection, dataset splitting, and data augmentation. For more details about these operations, please refer to [1].

Model configuration[edit | edit source]

After data processing, three fundamental aspects were treated in order to have a FL system that converges efficiently to meaningful solutions. Indeed, a correct model configuration plays a crucial role in FL as it encompasses the selection of the appropriate model architecture, optimization algorithm, and criterion.

Model architecture[edit | edit source]

The model architecture plays a crucial role in defining how the data flows

through the network, the number and type of layers, the connections between

neurons, and the activation functions applied at different stages.

Since the main purpose of this work is to compare two frameworks, the

choice of architecture was made so as to maximise as much as possible the

metrics chosen at the end of the training but limiting the resource demands

on the part of the model and thus also the time required to complete the

process.

Cloud environment[edit | edit source]

The model architecture chosen for the cloud part

prioritises simplicity and compatibility with the resource-constrained CPUs

of embedded devices and virtual machines. This decision was motivated by

the need for a lightweight and efficient model that can be easily deployed

and executed on various devices participating in the FL system.

For that reason, the architecture chosen for the cloud part of this work

is a model taken from the "Training a classifier" tutorial available on the

official PyTorch website [22] (Listing 4.2). This selection fits perfectly with

the request for a lightweight model that could still deliver satisfactory results.

The model of choice is a sequential one since they are often suitable and

widely used for classification problems [61]. It is a type of NN architecture

composed by a plain stack of layers where each layer has exactly one input

tensor and one output tensor. As shown in the figure, the model consists of

several layers, starting with two Convolutional layers (Conv2d) that process

the input data followed by MaxPooling layers (MaxPool2d) that reduce the

spatial dimensions. Afterwards, there are three fully connected layers (Lin-

ear) responsible for the final classification. The model has a total of 62,006

parameters, which are the trainable weights and biases within the layers.

These parameters are optimised during the training process to achieve accu-

rate predictions. The model architecture is suitable for tasks such as image

classification and demonstrates a balance between depth and complexity,

allowing for efficient training and satisfactory performance.

Local environment[edit | edit source]

A more complex model architecture was chosen for

the local environment to leverage the usage of an NVidia GPU for model

training. This decision was driven by the aim to harness the computational

power of the GPU and expedite the training process, ultimately leading to

improved model performance and more efficient training. This led to better

metrics results compared to the cloud counterpart.

In this case, the architecture that was chosen is indeed the ResNet-18

[5], as shown in Listing 4.3. ResNet-18 is a highly effective DL model for

image classification tasks, known for its ability to handle deeper architectures

without sacrificing performance. It performs exceptionally well with the

CIFAR-10 dataset and is computationally efficient.

In the context of this work, the choice was to use ResNet-18 in the

’weights=DEFAULT’ mode since it offers several advantages. The ’DE-

FAULT’ mode simplifies integration into the FL system and allows for faster

convergence and better generalisation through transfer learning with pre-

trained weights. In fact, this means that the model was initially trained on

a large dataset, typically for a different task (e.g., ImageNet classification),

and its learned weights and parameters were saved. Instead of training the

model from scratch on a new task, you start with these pre-trained weights

as an initialisation.

The layers can be summarised as follows:

1. Convolutional Layers (Conv2d): The initial layer of ResNet-18 consists

of a 2D convolution operation that convolves the input image with a set

of learnable filters. These filters help detect various low-level features,

such as edges and corners, in the input image.

2. Batch Normalization Layers (BatchNorm2d): After each convolutional

layer, batch normalization is applied, which normalizes the output of

the previous layer to improve training stability and accelerate conver-

gence.

3. Rectified Linear Unit Activation (ReLU): Following the batch nor-

malization, a non-linear activation function called ReLU is applied

element-wise to introduce non-linearity into the model and allow it

to learn complex features.

4. Max Pooling Layers (MaxPool2d): After the initial convolutional block,

max pooling layers are used to downsample the spatial dimensions of

the feature maps, reducing computational complexity while retaining

important information.

5. Basic Blocks: The ResNet-18 model utilizes a series of eight basic

blocks, of which for simplicity only four are visible visible in the code

above, each consisting of multiple convolutional and batch normaliza-

tion layers. The basic blocks are designed to mitigate the vanishing

gradient problem and allow the model to be deeper without perfor-

mance degradation.

6. Adaptive Average Pooling (AdaptiveAvgPool2d): The adaptive average

pooling layer aggregates the spatial dimensions of the feature maps into

a fixed size, ensuring the model can handle input images of varying sizes

and aspect ratios.

7. Fully Connected Layers (Linear): Towards the end of the model, adap-

tive average pooling is applied to convert the spatial dimensions of the

feature maps into a fixed size. Subsequently, fully connected layers are

used to perform classification based on the learned features.

The Resnet-18 model used was also modified with custom fully connected

layers (Linear) in order to accommodate a different output classification task.

In the original ResNet-18, the model was designed for image classification

with 1,000 output classes. However, in this modified version, the number of

output classes was reduced to just 10 to align with the specific classification

problem at hand. By reducing the number of output classes, the model’s

architecture becomes more tailored to the target classification task, which,

in turn, reduces the number of trainable parameters by about half million

parameters.

Optimizer[edit | edit source]

The optimiser chosen, presented in the introduction paragraph, is the SGD

since it is a popular optimiser for classification problems due to its simplicity

and effectiveness. It was tuned using some hyperparameters: momentum and

learning rate.

Momentum is a technique used to accelerate the convergence of the op-

timisation process and improve its stability. It addresses the issue of slow

convergence and oscillations in the loss function by introducing a "velocity"

term that helps the optimiser navigate the optimisation landscape more ef-

ficiently. The value of 0.9 that was chosen, meaning that the optimiser gives

more weight to the past accumulated gradients, leading to smoother updates.

Learning rate is a hyperparameter that determines the step size at which

the model updates its weights during the optimisation process. It controls

how much the model adjusts its internal parameters in response to the error

calculated during training. In this case the chosen learning rate of 0.001

strikes a balance between making smaller, more precise steps towards the

optimal solution and avoiding overshooting or oscillations during the opti-

misation process.

Criterion[edit | edit source]

Cross-entropy is commonly used in classification problems because it quan-

tifies the difference between the predicted probabilities and the actual target

labels, providing a measure of how well the model is performing in classifying

the input data.

This can be expressed using this formula:

where (y) is the target probability, (p) is the predicted probability, and

(m) is the number of classes. So that is how “wrong” or “far away” the

prediction is from the true distribution.

In the context of CIFAR-10, where there are ten classes (e.g., airplanes,

cars, birds, etc.), the Cross-Entropy loss compares the predicted class proba-

bilities with the true one-hot encoded labels for each input sample. It applies

the logarithm to the probabilities and then sums up the negative log like-

lihoods across all classes. The objective is to minimize this loss function

during the training process, which effectively encourages the model to as-

sign high probabilities to the correct class labels and low probabilities to the

incorrect ones.

One of the reasons why Cross-Entropy Loss is considered suitable for

CIFAR-10 and classification tasks, in general, is its ability to handle multi-

class scenarios efficiently. By transforming the model’s output into probabil-

ities through the softmax activation, it inherently captures the relationships

between different classes, allowing for a more expressive representation of

class likelihoods.

Client-side settings[edit | edit source]

On the client side, three important tasks including training, validation and

testing are being performed. Each task comes to be executed by every client

participating in the FL infrastructure. At the end of a cycle, tasks come to be

blocked temporarily so that the results accumulated up to that point are sent

to the server, which will take care of aggregating them. Once aggregated,

each client will start again with the tasks assigned to it from the data formed

by the server.

The model is trained for 3 epochs in the case of the cloud environment

and for 4 epochs in the case of the local environment. The total steps per

epoch in both cases is for each task of 1250 steps.

Aggregation algorithm[edit | edit source]

In this FL scenario, the Federated Averaging (FedAvg) [10] algorithm was

employed as the aggregation method. FedAvg is a fundamental and widely

adopted algorithm used to aggregate model updates from multiple clients

(or participants) in a FL setting.

The primary objective of FedAvg is to allow collaborative model training

while preserving data privacy. After local training, clients communicate their

model updates (gradients) to the server, where these updates are aggregated

to create a global model. The global model is then sent back to the clients

that, use it as the starting point for the next round of training. This iterative

process continues until the global model converges to a satisfactory solution.

The Listing 4.4 describes how FedAvg works in a FL infrastructure:

As reader can see, the FedAvg algorithm works by averaging the model

updates from individual clients, weighted by the proportion of data samples

each client holds. This weighted average ensures that clients with larger

datasets have a more significant influence on the global model, while main-

taining fairness for clients with smaller datasets. By iteratively aggregating

the updates and distributing the global model back to clients, FedAvg en-

ables collaborative learning without sharing raw data.

Metrics[edit | edit source]

In order to make a good comparison, three of the most common and essential

metrics were chosen to evaluate model performance and effectiveness.

The chosen metrics are the following:

• Loss: The loss function quantifies the dissimilarity between the pre-

dicted output of the model and the actual ground truth labels in the

training data. It provides a measure of how well the model is perform-

ing during training. The goal is to minimize the loss function, as a

lower loss indicates that the model is better aligned with the training

data.

• Accuracy: Accuracy is a fundamental metric used to assess the model’s

overall performance. It represents the proportion of correctly predicted

samples to the total number of samples in the dataset. A higher ac-

curacy indicates that the model is making accurate predictions, while

a lower accuracy suggests that the model might need further improve-

ments. Calculating the accuracy of individual clients in a FL classifi-

cation problem is important to assess the performance of each client’s

local model. This helps in understanding how well each client is adapt-

ing to its local data distribution and making accurate predictions.

The formula for the accuracy can be simply expressed as follows:

It ranges between 0 and 1, where 1 indicates perfect accuracy, meaning

correctly predicted sample, and 0 indicates it will always fail in the

prediction.

F1-score: The F1-score is a metric that combines both precision and

recall to provide a balanced evaluation of the model’s performance,

especially when dealing with imbalanced datasets. Precision measures

the ratio of correctly predicted positive samples to all predicted positive

samples, while recall measures the ratio of correctly predicted positive

samples to all actual positive samples. The F1-score is the harmonic

mean of precision and recall, providing a single metric that considers

both aspects.

Server-side settings[edit | edit source]

After the choice of the metrics to evaluate, the last thing to decide were the

server settings. In fact, two important parameters are missing in this regard:

the number of rounds and the number of clients that would have participated

in the FL infrastructure.

A round represents a communication cycle between clients and the cen-

tral server in the FL training process. During each round, participating

clients perform local training using their available local data. Subsequently,

the updated model weights trained locally are sent to the central server or

coordination node. Here, the weights are centrally aggregated to obtain an

updated global model, which represents the combined knowledge of all par-

ticipating clients. At this point, the round is concluded and the aggregate

model is sent back to the clients, who will use this updated model to perform

a new round.

In this case, the number of rounds chosen for the cloud part is equal to

4 meanwhile for the local part there is a total of 10 rounds.

On the server-side, the second parameter to be chosen is the number of

clients that will participate in the various rounds of FL. As also seen in the

subsection 4.1.1, in this case, the number of cloud-side clients is 2 while on

the local-side is equal to 4.

Results[edit | edit source]

Flower running on SBC ORCA
# of cores
1
Flower 1-core htop MX8M+.png
Flower log 1-core MX8M+.png
4
Flower 4-cpu htop MX8M+.png
Flower log 4-core MX8M+.png

Applying NVFlare to a real-word case[edit | edit source]

TBD

Conclusions and future work[edit | edit source]

TBD

One important issue that was not addressed yet is the labeling of new samples. In other words, it was implicitly assumed that new samples collected by a device are somehow labelled prior to being used for training. This is a strong assumption because implies that

labeling of new samples issue

References[edit | edit source]

  • [1]