Changes

ML-TN-007 — AI at the edge: exploring Federated Learning solutions

6,075 bytes added, 16:34, 27 September 2023

→‎Frameworks Comparison

shedding light on their respective strengths and limitations.

=== Frameworks ~~Comparison =~~in-depth comparison: Flower vs NVFlare ==

== ~~Testing~~ = Introduction ===To conduct an in-depth comparison of the two selected frameworks ==, a problem simulating a real-world scenario was chosen. Specifically, a classification problem was used. The choice was due to the following reasons:* Firstly, classification problems are a widely studied and well-understood domain in ML, making them suitable for benchmarking and evaluation purposes.* Secondly, classification tasks are versatile and can be applied to various real-world scenarios, such as image classification, natural language processing, and medical diagnosis, making the evaluation results applicable to a wide range of applications.* Classification problems often involve complex data patterns and require efficient model training and optimization techniques, making them a suitable challenge for assessing the performance and scalability of FL frameworks. Additionally, classification problems allow for the evaluation of key metrics like accuracy, precision, recall, and F1-score, providing a comprehensive assessment of a framework’s capabilities in handling different aspects of ML tasks.

=== ~~Flower~~ System design === ==== Testing environments ====Within the FL infrastructure, multiple parties were involved to create a collaborative environment for training ML models. These parties, including data providers (clients) and a model aggregator (server), played essential roles in the FL process. Data providers contributed their locally-held data, while model aggregators facilitated the consolidation of the individual model updates from different parties. The interactions between these parties enabled the training of robust preserving models, making FL an effective approach for decentralized data scenarios. Two testing environments were used: the first one is denoted as "local", while the other is called "cloud".* Local environment: The local parties consist of a single desktop computer that acts both as the server and four separate clients. This configuration mimics a decentralized environment where the desktop computer takes on the roles of multiple participants, simulating the interactions and data contributions of distinct entities. As the server, it coordinates and manages the FL process, while functioning as an individual client allows it to provide diverse data contributions for training. This localized approach allowed for the use of Docker as the development environment, leveraging the power of a desktop computer assembled with an RTX 3080ti GPU to enhance performance. The power given by the NVidia GPU, allowed the use of a more complex model and the simulation of four clients on the same machine that acts also as the server. Being self-contained in a single host, the local environment is convenient for testing, especially when the focus is on the functional verification.* Cloud environment: In this case, cloud parties consist of two embedded devices or virtual machines acting as clients, and a notebook serving as the server. This configuration facilitates a distributed learning approach, enabling the clients to process their data locally while contributing to the model’s training. The server coordinates the learning process and aggregates the updates from the clients to improve the global model. This setup ensures a decentralized and privacy-preserving approach to ML, as the data remains on the clients’ devices, and only the model updates are shared during the training process. Leveraging embedded devices as clients enables the inclusion of resource-constrained devices in the FL ecosystem, making the framework more versatile and applicable to a wide range of scenarios. The notebook acting as the server provides a centralized point of coordination and ensures smooth communication and collaboration between the clients, making the FL process efficient and effective in leveraging distributed resources for improved model performance. Of course, this environment is more complicated to set up, but it better simulates real configurations. ==== ML framework ====The next step was to decide on the ML framework to be used. To this end, PyTorch [~~https://flower~~47] was selected as the primary ML framework. The flex- ibility of PyTorch allowed for the implementation of complex models and easy customisation to meet specific project requirements. The availability of pre-trained models and a vast collection of built-in functions expedited the development process and enabled focus on the core aspects of the project [21]. Another pivotal factor is PyTorch’s ability to leverage GPU for hard- ware acceleration, which is crucial for training models on distributed data in FL environments. Its integration with CUDA and optimisation for GPU computing make it a pragmatic choice for applications requiring high per- formance. Lastly, PyTorch was chosen for its adaptability within the existing devel- opment environment, including its compatibility with Docker and embedded devices based on the ARM64 (AArch64) architecture. PyTorch’s adaptabil- ity and support for ARM64 architecture were key factors in this decision. This interoperability has facilitated the integration of the framework into the research and development environment.~~dev/ Flower~~ ==== Data Preprocessing ====The step of "Data Preprocessing" [51] holds significant importance in en- suring the success and effectiveness of the entire process. This crucial phase involves the choice of the dataset and the transformation and preparation of data before it isused for training the ML model on distributed devices. The "Data Preprocessing" stage plays a vital role in harmonising the data collected from different parties, which might have varying data distributions and formats. By applying standardised preprocessing techniques across the data from multiple clients, the potential bias and inconsistencies arising from diverse data sources can be mitigated, leading to a more accurate and robust global model. The next subsections will take a closer look at the steps taken to complete this important aspect of the project. TBD ==== Model configuration ==== ==== Client-side settings ==== ==== Aggregation algorithm ==== ==== Metrics ==== ==== Server-side settings ==== === Results ===

{| class="wikitable"

|[[File:Flower log 4-core MX8M+.png|center|thumb|300x300px]]

|}

~~=== NVFlare ===~~

~~[https://developer.nvidia.com/flare NVFlare] is~~

~~TBD~~

~~== Comparing test results ==~~

~~TBD~~

== Deep investigation of NVFlare ==

U0001

Bureaucrats, dave_user, Administrators

4,650

edits

DAVE Developer's Wiki β

Changes

ML-TN-007 — AI at the edge: exploring Federated Learning solutions

DAVE Developer's Wiki ^β