Author

Tim Leonhardt

Date of Award

5-2025

Degree Type

Honors College Thesis

Academic Program

Computer Science BS

Department

Computing

First Advisor

Ahmed Sherif, Ph.D.

Advisor Department

Computing

Abstract

Federated Learning (FL) is a Machine Learning (ML) approach that decentralizes training across distributed devices, eliminating the need to centralize data. Unlike traditional ML, where models are trained on aggregated data, FL sends a global model to multiple nodes for local training, with updated parameters transmitted back to the server for aggregation. This process preserves data privacy, making FL ideal for sensitive applications like cybersecurity. However, FL introduces challenges such as data heterogeneity, communication overhead, and difficulties in achieving model convergence, which can impact performance.

This study investigates a fundamental assumption in ML and FL research: that the superior performance of a model in a centralized setting will extend to an FL setup. Specifically, this study explores whether models that excel in centralized ML—such as Neural Networks (NNs) known for handling unstructured data—perform equally well in FL environments where data is split across nodes, each with a limited dataset. This constraint can affect the model's ability to capture intricate patterns, potentially leading to suboptimal performance when aggregating parameters. Given the critical role of cybersecurity, particularly with the rise of autonomous and connected vehicles, it is essential to understand how FL can preserve privacy without affecting the model performance.

The Car Hacking: Attack & Defense Challenge 2020 Dataset, featuring CAN bus traffic data from a Hyundai Avante CN7 with normal and attack messages, was used to simulate a real-world cybersecurity environment. This study focuses on binary classification to evaluate FL models' effectiveness in detecting attacks, highlighting challenges in distributed data environments. Preprocessing involved label encoding and limiting the task to binary detection.

The results reveal that strong centralized ML performance does not always translate to FL. While Naive Bayes excelled in centralized settings, XGBoost performed better in FL, highlighting the need for tailored model selection in distributed environments with limited local training.

These findings underscore the need to tailor FL models to distributed systems' constraints. Future research should examine the effects of more clients and training rounds to refine ML-FL performance dynamics. This study offers insights into developing FL-based IDS, enhancing privacy and adaptability in cybersecurity for connected and autonomous vehicles.

Included in

Cybersecurity Commons

Share

COinS