How do you find a Credit Card DEFAULTER?

Prinkesh Shotri
4 min readNov 29, 2021

--

A data based approach using Kaggle data of credit card clients in Taiwan from April 2005 to September 2005.

Introduction

Suppose you are running a credit card company which is currently plagued by customers who are defaulting in their credit repayments. At present world over, financial institutions are reeling under the mountain of NPAs(Non-Performing Assets) and with credit card (a collateral free credit) being the prime suspect.

I, being a Banker for past 11 years have faced the similar situation. On one hand we had to increase the credit disbursement by increasing our customer base and on the other hand we had to ensure that the new customers do not become NPAs.

Many banks and financial institutions have now started applying various data analysis techniques to help them better understand the characteristics of someone who defaults in their repayments. And I can say that with reasonable level of accuracy, we may predict the default rate.

I have used one of the techniques using Machine Learning and although, what I present here is a very basic model but I feel that it will still provide some good insights. For this model I’ve used dataset from Kaggle. This dataset contains information on default payments, demographic factors, credit data, history of payment, and bill statements of credit card clients in Taiwan from April 2005 to September 2005.

I will try to answer the following three questions.

· How is Probability of default relates to Sex/Education/Marital Status

· Does Age of a customer plays an important role

· How accurately we can measure the rate of default and what are the shortcomings

Before starting our analysis, it is imperative to give some details about important features in the dataset. The main features used in our model have been coded into the categories as below:

· SEX: Gender (1=male, 2=female)

· EDUCATION: (1=graduate school, 2=university, 3=high school, 4=others, 5=unknown, 6=unknown)

· MARRIAGE: Marital status (1=married, 2=single, 3=others)

· default payment next month: Default payment (1=yes, 0=no)

(Note: default payment next month is our target variable in the dataset)

So, let’s try answering our questions with the help of supporting graphs.

What is the Probability of default relates to Sex/Education/Marital Status

The question, I was more interested in:

“Does the gender play any role in guessing who can potentially default in their repayment schedule?”

Although, not much can be inferred from the graph above but we can still see that among the non-defaulting customers, Females have been fairly regular with their repayments i.e. number of female customers outnumbered the male customers in making the timely payments.

Another question comes into the mind is:

Does Education and play a prominent role in deciding the default rate?”

Sadly, data remains inconclusive as to who would default based on their education level. Same can be said about the marital status of our customers as shown below:

Does Age helps in making a decision?

So, we move into our most important question:

How big a roles age/age group plays? We all know that customers from a certain age group are less careful with their spending either due to lack of financial knowledge or due to lifestyle related peer pressure. He may sometimes indulge in excessive spending which are usually more that what he could afford.”

Let’s see if our data confirms to above mentioned characteristics.

We can observe that although there is not much difference in creditworthiness with different age groups but based on our data we can clearly infer that the creditworthiness does increase with respect to age as in comparisons to customers in 40s and above, customers in 20s and 30s are more prone to defaults.

How accurate our model is for predicting the default in repayment.

We have used combination of all the above mentioned features and some more which have greater correlation with our target variable. Out of the many classification techniques available in Machine Learning, we used “Logistics Regression” as well as “Random Forrest ensemble” methods to predict the probability of the default in repayment and as can be shown below from our metrics, that we achieved an accuracy of around 82%

Conclusion

Although an individual feature may not be conclusive enough in deciding the probability, we can however reach to a reasonable accuracy in predicting (i.e. 82%) the defaults in repayment by combining various features. We used very basic methods in machine learning to make above predictions and readers are encouraged to use more modern techniques (even Deep Learning methods) to achieve better accuracy.

To see more about this analysis, see the link to my Github

--

--

Prinkesh Shotri
Prinkesh Shotri

Written by Prinkesh Shotri

A bank manager by profession and a data science enthusiast by heart. Hoping to transition into the exciting field of Machine Learning and make a mark for myself

No responses yet