Changes

ML-TN-007 — AI at the edge: exploring Federated Learning solutions

14,427 bytes added, 08:32, 28 September 2023

→‎Model configuration

==== Model configuration ====

After data processing, three fundamental aspects were treated in order to have a FL system that converges efficiently to meaningful solutions. Indeed, a correct model configuration plays a crucial role in FL as it encompasses the selection of the appropriate model architecture, optimization algorithm, and criterion.

===== Model architecture =====

The model architecture plays a crucial role in defining how the data flows

through the network, the number and type of layers, the connections between

neurons, and the activation functions applied at different stages.

Since the main purpose of this work is to compare two frameworks, the

choice of architecture was made so as to maximise as much as possible the

metrics chosen at the end of the training but limiting the resource demands

on the part of the model and thus also the time required to complete the

process.

====== Cloud environment ======

The model architecture chosen for the cloud part

prioritises simplicity and compatibility with the resource-constrained CPUs

of embedded devices and virtual machines. This decision was motivated by

the need for a lightweight and efficient model that can be easily deployed

and executed on various devices participating in the FL system.

For that reason, the architecture chosen for the cloud part of this work

is a model taken from the "Training a classifier" tutorial available on the

official PyTorch website [22] (Listing 4.2). This selection fits perfectly with

the request for a lightweight model that could still deliver satisfactory results.

The model of choice is a sequential one since they are often suitable and

widely used for classification problems [61]. It is a type of NN architecture

composed by a plain stack of layers where each layer has exactly one input

tensor and one output tensor. As shown in the figure, the model consists of

several layers, starting with two Convolutional layers (Conv2d) that process

the input data followed by MaxPooling layers (MaxPool2d) that reduce the

spatial dimensions. Afterwards, there are three fully connected layers (Lin-

ear) responsible for the final classification. The model has a total of 62,006

parameters, which are the trainable weights and biases within the layers.

These parameters are optimised during the training process to achieve accu-

rate predictions. The model architecture is suitable for tasks such as image

classification and demonstrates a balance between depth and complexity,

allowing for efficient training and satisfactory performance.

====== Local environment ======

A more complex model architecture was chosen for

the local environment to leverage the usage of an NVidia GPU for model

training. This decision was driven by the aim to harness the computational

power of the GPU and expedite the training process, ultimately leading to

improved model performance and more efficient training. This led to better

metrics results compared to the cloud counterpart.

In this case, the architecture that was chosen is indeed the ResNet-18

[5], as shown in Listing 4.3. ResNet-18 is a highly effective DL model for

image classification tasks, known for its ability to handle deeper architectures

without sacrificing performance. It performs exceptionally well with the

CIFAR-10 dataset and is computationally efficient.

In the context of this work, the choice was to use ResNet-18 in the

’weights=DEFAULT’ mode since it offers several advantages. The ’DE-

FAULT’ mode simplifies integration into the FL system and allows for faster

convergence and better generalisation through transfer learning with pre-

trained weights. In fact, this means that the model was initially trained on

a large dataset, typically for a different task (e.g., ImageNet classification),

and its learned weights and parameters were saved. Instead of training the

model from scratch on a new task, you start with these pre-trained weights

as an initialisation.

The layers can be summarised as follows:

1. Convolutional Layers (Conv2d): The initial layer of ResNet-18 consists

of a 2D convolution operation that convolves the input image with a set

of learnable filters. These filters help detect various low-level features,

such as edges and corners, in the input image.

2. Batch Normalization Layers (BatchNorm2d): After each convolutional

layer, batch normalization is applied, which normalizes the output of

the previous layer to improve training stability and accelerate conver-

gence.

3. Rectified Linear Unit Activation (ReLU): Following the batch nor-

malization, a non-linear activation function called ReLU is applied

element-wise to introduce non-linearity into the model and allow it

to learn complex features.

4. Max Pooling Layers (MaxPool2d): After the initial convolutional block,

max pooling layers are used to downsample the spatial dimensions of

the feature maps, reducing computational complexity while retaining

important information.

5. Basic Blocks: The ResNet-18 model utilizes a series of eight basic

blocks, of which for simplicity only four are visible visible in the code

above, each consisting of multiple convolutional and batch normaliza-

tion layers. The basic blocks are designed to mitigate the vanishing

gradient problem and allow the model to be deeper without perfor-

mance degradation.

6. Adaptive Average Pooling (AdaptiveAvgPool2d): The adaptive average

pooling layer aggregates the spatial dimensions of the feature maps into

a fixed size, ensuring the model can handle input images of varying sizes

and aspect ratios.

7. Fully Connected Layers (Linear): Towards the end of the model, adap-

tive average pooling is applied to convert the spatial dimensions of the

feature maps into a fixed size. Subsequently, fully connected layers are

used to perform classification based on the learned features.

The Resnet-18 model used was also modified with custom fully connected

layers (Linear) in order to accommodate a different output classification task.

In the original ResNet-18, the model was designed for image classification

with 1,000 output classes. However, in this modified version, the number of

output classes was reduced to just 10 to align with the specific classification

problem at hand. By reducing the number of output classes, the model’s

architecture becomes more tailored to the target classification task, which,

in turn, reduces the number of trainable parameters by about half million

parameters.

===== Optimizer =====

The optimiser chosen, presented in the introduction paragraph, is the SGD

since it is a popular optimiser for classification problems due to its simplicity

and effectiveness. It was tuned using some hyperparameters: momentum and

learning rate.

Momentum is a technique used to accelerate the convergence of the op-

timisation process and improve its stability. It addresses the issue of slow

convergence and oscillations in the loss function by introducing a "velocity"

term that helps the optimiser navigate the optimisation landscape more ef-

ficiently. The value of 0.9 that was chosen, meaning that the optimiser gives

more weight to the past accumulated gradients, leading to smoother updates.

Learning rate is a hyperparameter that determines the step size at which

the model updates its weights during the optimisation process. It controls

how much the model adjusts its internal parameters in response to the error

calculated during training. In this case the chosen learning rate of 0.001

strikes a balance between making smaller, more precise steps towards the

optimal solution and avoiding overshooting or oscillations during the opti-

misation process.

===== Criterion =====

Cross-entropy is commonly used in classification problems because it quan-

tifies the difference between the predicted probabilities and the actual target

labels, providing a measure of how well the model is performing in classifying

the input data.

This can be expressed using this formula:

where (y) is the target probability, (p) is the predicted probability, and

(m) is the number of classes. So that is how “wrong” or “far away” the

prediction is from the true distribution.

In the context of CIFAR-10, where there are ten classes (e.g., airplanes,

cars, birds, etc.), the Cross-Entropy loss compares the predicted class proba-

bilities with the true one-hot encoded labels for each input sample. It applies

the logarithm to the probabilities and then sums up the negative log like-

lihoods across all classes. The objective is to minimize this loss function

during the training process, which effectively encourages the model to as-

sign high probabilities to the correct class labels and low probabilities to the

incorrect ones.

One of the reasons why Cross-Entropy Loss is considered suitable for

CIFAR-10 and classification tasks, in general, is its ability to handle multi-

class scenarios efficiently. By transforming the model’s output into probabil-

ities through the softmax activation, it inherently captures the relationships

between different classes, allowing for a more expressive representation of

class likelihoods.

==== Client-side settings ====

On the client side, three important tasks including training, validation and

testing are being performed. Each task comes to be executed by every client

participating in the FL infrastructure. At the end of a cycle, tasks come to be

blocked temporarily so that the results accumulated up to that point are sent

to the server, which will take care of aggregating them. Once aggregated,

each client will start again with the tasks assigned to it from the data formed

by the server.

The model is trained for 3 epochs in the case of the cloud environment

and for 4 epochs in the case of the local environment. The total steps per

epoch in both cases is for each task of 1250 steps.

==== Aggregation algorithm ====

In this FL scenario, the Federated Averaging (FedAvg) [10] algorithm was

employed as the aggregation method. FedAvg is a fundamental and widely

adopted algorithm used to aggregate model updates from multiple clients

(or participants) in a FL setting.

The primary objective of FedAvg is to allow collaborative model training

while preserving data privacy. After local training, clients communicate their

model updates (gradients) to the server, where these updates are aggregated

to create a global model. The global model is then sent back to the clients

that, use it as the starting point for the next round of training. This iterative

process continues until the global model converges to a satisfactory solution.

The Listing 4.4 describes how FedAvg works in a FL infrastructure:

As reader can see, the FedAvg algorithm works by averaging the model

updates from individual clients, weighted by the proportion of data samples

each client holds. This weighted average ensures that clients with larger

datasets have a more significant influence on the global model, while main-

taining fairness for clients with smaller datasets. By iteratively aggregating

the updates and distributing the global model back to clients, FedAvg en-

ables collaborative learning without sharing raw data.

==== Metrics ====

In order to make a good comparison, three of the most common and essential

metrics were chosen to evaluate model performance and effectiveness.

The chosen metrics are the following:

• Loss: The loss function quantifies the dissimilarity between the pre-

dicted output of the model and the actual ground truth labels in the

training data. It provides a measure of how well the model is perform-

ing during training. The goal is to minimize the loss function, as a

lower loss indicates that the model is better aligned with the training

data.

• Accuracy: Accuracy is a fundamental metric used to assess the model’s

overall performance. It represents the proportion of correctly predicted

samples to the total number of samples in the dataset. A higher ac-

curacy indicates that the model is making accurate predictions, while

a lower accuracy suggests that the model might need further improve-

ments. Calculating the accuracy of individual clients in a FL classifi-

cation problem is important to assess the performance of each client’s

local model. This helps in understanding how well each client is adapt-

ing to its local data distribution and making accurate predictions.

The formula for the accuracy can be simply expressed as follows:

It ranges between 0 and 1, where 1 indicates perfect accuracy, meaning

correctly predicted sample, and 0 indicates it will always fail in the

prediction.

F1-score: The F1-score is a metric that combines both precision and

recall to provide a balanced evaluation of the model’s performance,

especially when dealing with imbalanced datasets. Precision measures

the ratio of correctly predicted positive samples to all predicted positive

samples, while recall measures the ratio of correctly predicted positive

samples to all actual positive samples. The F1-score is the harmonic

mean of precision and recall, providing a single metric that considers

both aspects.

==== Server-side settings ====

After the choice of the metrics to evaluate, the last thing to decide were the

server settings. In fact, two important parameters are missing in this regard:

the number of rounds and the number of clients that would have participated

in the FL infrastructure.

A round represents a communication cycle between clients and the cen-

tral server in the FL training process. During each round, participating

clients perform local training using their available local data. Subsequently,

the updated model weights trained locally are sent to the central server or

coordination node. Here, the weights are centrally aggregated to obtain an

updated global model, which represents the combined knowledge of all par-

ticipating clients. At this point, the round is concluded and the aggregate

model is sent back to the clients, who will use this updated model to perform

a new round.

In this case, the number of rounds chosen for the cloud part is equal to

4 meanwhile for the local part there is a total of 10 rounds.

On the server-side, the second parameter to be chosen is the number of

clients that will participate in the various rounds of FL. As also seen in the

subsection 4.1.1, in this case, the number of cloud-side clients is 2 while on

the local-side is equal to 4.

=== Results ===

|}

== ~~Deep investigation of~~ Applying NVFlare to a real-word case ==

TBD

U0001

Bureaucrats, dave_user, Administrators

4,650

edits

DAVE Developer's Wiki β

Changes

ML-TN-007 — AI at the edge: exploring Federated Learning solutions

DAVE Developer's Wiki ^β