Abstract
The need for data integration is becoming ubiquitous and encompasses many disciplines due to the technological development in instrumentation. Combining the information from distinct data sources in modeling, so as to improve the prediction accuracy and have a holistic view of the problem is a challenge for statisticians. In this paper, we present a flexible statistical framework for integrating various types of data from distinct sources through model-based boosting (IMBoost) with two types of base models: regression trees and penalized splines. The performance of IMBoost is illustrated through two recent studies in environmental soil science, where multiple sensors were used to quantify several soil parameters. Empirical results are promising and show the proposed algorithms substantially improve the prediction performance through combining the strength from distinct data sources. We also proposed a surrogate model approach, which allows IMBoost to handle situations when partial samples are missing from distinct sources.
Original language | English |
---|---|
Article number | 400 |
Journal | SN Computer Science |
Volume | 2 |
Issue number | 5 |
DOIs | |
State | Published - Sep 2021 |
Keywords
- Boosting
- Data integration
- Missing data
- Penalized splines
- Regression tree