The topic of this master thesis is the prediction of potential risk insurance contracts with respect to their unpaid payments. A custom machine learning framework proposed in the thesis is used for building and optimization of a classification model identifying potentially risky insurance contracts. The aim of the project is to create a functional, high quality and implementable model through which a more efficient reminder process can be achieved, thanks to its ability to detect contracts with ... show full abstractThe topic of this master thesis is the prediction of potential risk insurance contracts with respect to their unpaid payments. A custom machine learning framework proposed in the thesis is used for building and optimization of a classification model identifying potentially risky insurance contracts. The aim of the project is to create a functional, high quality and implementable model through which a more efficient reminder process can be achieved, thanks to its ability to detect contracts with a higher risk of defaulting on their obligations to the company. This tool provides the possibility of prioritizing contracts, making communication to clients more effective and providing early intervention, which can help achieve financial stability and improve service delivery. In this thesis, particular attention is paid to the description and analysis of the used predictive algorithms. The used models work based on different approaches and using different assumptions. The set of algorithms includes models assuming a linear relationship between the independent variables and the target variable, also models capturing non-linearity in the data, up to models based on the principle of decision trees. In the selection of evaluation metrics to assess the predictive ability of the models, the greatest weight is placed on the statistics of recall, F1-score, and model fitting time. Variable selection and hyperparameter tuning techniques are used in the optimization phase of the modelling. By evaluating the models, the best performing model is determined, which turns out to be the gradient boosting model. This model is tested on a new dataset to evaluate its true predictive ability in a real-world setting. The model evaluation is complemented by an assessment of the interpretability of the model. |