Insurance Claim Analysis Using Extreme Gradient Boosting Trees-A Machine Learning Approach
Abstract/ Overview
The emergence of big data has revolutionized the way insurance companies deal with data that
they receive in the course of their business, big data involves huge volumes of data of different
varieties. Therefore the current methods used for analysis such as statistical methods and actuarial
formulas in insurance sector are becoming inadequate to solve the emerging problems and
opportunities from advancement in technology. Moreover, the data may be prone to missing values.
Extreme gradient Boosting Algorithm (XGBoost) which is an ensemble learning which has
the capacity to effectively address the two unique characteristics of the data. This research utilized
an Extreme boosting algorithm to process insurance claim data in-order to model the frequency
of claim and severity of claims for claim prediction. XGBoost creates tree-based models by iteratively
fitting decision trees to the residuals of the previous predictions, effectively reducing the
error in each iteration. Using the algorithm we aim to enhance the accuracy of predictions that will
yield better estimates for improved risk assessment and pricing of insurance products within the
insurance sector. The XGBoost algorithm models were evaluated using Root Mean Squared Error
(RMSE), Mean Absolute Error (MAE) and Rsquared (RSQ). Results showed that XGBoost models
for the claim frequency had a RMSE estimate of 0.949, MAE of 0.7741 and RSQ 0.781 and
claim severity model had the metrics 899.12,736.77 and 0.9625 respectively. We also compared
the performance of the XGBoost models with zero inflated poisson model, multiple linear regression
and generalized Pareto Model. The XGBoost model had the best metrics (RMSE, MAE and
RSQ), we therefore concluded that the Extreme Gradient Boosting Model was the optimal model.
Key words: Big data, Frequency, Severity, machine learning, gradient boost, XGBoost