TY - GEN
T1 - Enhancing Flight Delay Prediction with Network-Aware Ensemble Learning
AU - Afrane, Mary Dufie
AU - Xu, Yao
AU - Li, Lixin
N1 - Publisher Copyright:
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2026.
PY - 2025/10/31
Y1 - 2025/10/31
N2 - This study presents a comprehensive framework for predicting departure delays in U.S. domestic aviation by integrating advanced feature engineering, network analysis, and ensemble learning methods. Using a dataset of 2,638,673 flights across 354 airports from May to August 2024, we engineered predictors using temporal features (cyclical time), operational metrics (airport congestion), and network characteristics (in-/out-degree centrality and cluster labels). We extracted data for the five airlines with the highest number of flights: Southwest (WN), American (AA), Delta (DL), United (UA), and SkyWest (OO). A novel greedy mutual information and correlation-based feature selection method was then applied to each dataset to improve prediction performance. Multiple classifiers, including Random Forest (RF), Extra Trees (ET), XGBoost, and LightGBM, were evaluated. RF and ET consistently outperformed the others, motivating their inclusion in a Voting ensemble. The Voting classifier achieved robust performance across all five airlines, with overall accuracy ranging from 88.9% to 91.8%, F1–scores between 88.5% and 91.4%, and AUC–ROC values all above 95%. DL yielded the highest performance (91.8% accuracy and 96.8% AUC–ROC). These results demonstrate that combining network–cluster information with rich historical features substantially improves delay prediction, providing a scalable approach for airlines and air traffic managers to mitigate operational disruptions.
AB - This study presents a comprehensive framework for predicting departure delays in U.S. domestic aviation by integrating advanced feature engineering, network analysis, and ensemble learning methods. Using a dataset of 2,638,673 flights across 354 airports from May to August 2024, we engineered predictors using temporal features (cyclical time), operational metrics (airport congestion), and network characteristics (in-/out-degree centrality and cluster labels). We extracted data for the five airlines with the highest number of flights: Southwest (WN), American (AA), Delta (DL), United (UA), and SkyWest (OO). A novel greedy mutual information and correlation-based feature selection method was then applied to each dataset to improve prediction performance. Multiple classifiers, including Random Forest (RF), Extra Trees (ET), XGBoost, and LightGBM, were evaluated. RF and ET consistently outperformed the others, motivating their inclusion in a Voting ensemble. The Voting classifier achieved robust performance across all five airlines, with overall accuracy ranging from 88.9% to 91.8%, F1–scores between 88.5% and 91.4%, and AUC–ROC values all above 95%. DL yielded the highest performance (91.8% accuracy and 96.8% AUC–ROC). These results demonstrate that combining network–cluster information with rich historical features substantially improves delay prediction, providing a scalable approach for airlines and air traffic managers to mitigate operational disruptions.
KW - Ensemble learning
KW - Feature selection
KW - Flight delay prediction
KW - Network clustering
KW - Voting classifier
UR - https://www.scopus.com/pages/publications/105021801904
U2 - 10.1007/978-3-032-06744-9_9
DO - 10.1007/978-3-032-06744-9_9
M3 - Conference article
AN - SCOPUS:105021801904
SN - 9783032067432
T3 - Lecture Notes in Computer Science
SP - 109
EP - 121
BT - Database Engineered Applications - 29th International Symposium, IDEAS 2025, Proceedings
A2 - Bergami, Giacomo
A2 - Ezhilchelvan, Paul
A2 - Manolopoulos, Yannis
A2 - Ilarri, Sergio
A2 - Bernardino, Jorge
A2 - Leung, Carson K.
A2 - Revesz, Peter Z.
PB - Springer Science and Business Media Deutschland GmbH
T2 - 29th International Database Engineered Applications Symposium, IDEAS 2025
Y2 - 14 July 2025 through 16 July 2025
ER -