Credit Risk — Application Scorecard Modeling Using R

Jaiprakash Prajapati
12 min readSep 29, 2020

Introduction

Loaning Agencies (Bank, NBFC etc.) have gathered a lot of data portraying the default conduct of their customers. Examples: Demographic data about a borrower’s date of birth, gender, income and work status. On top of this, Agencies have gathered heaps of business experience about their credit items.

By breaking down the two wellsprings of information a credit risk application scoring model is made for future credit applications and helps choosing which ones to accept and which to reject.

The analysis of credit risk and the decision making for granting loans is one of the most important operations for financial institutions. By taking into account past history, a data driven risk models is created which calculates the chances of a borrower defaults on loan. In the Banking term we need to build a Probability of Default, Loss Given default and exposure at default models as per Basel norms.

What topics we will cover?

Advantages of Credit Scoring

  1. No longer need to depend upon the experience, instinct, or presence of mind of one or various business / Domain specialists.
  2. Better in terms of speed and accuracy, can now make faster decisions compared to the judgmental approach.

3. A reduction in operating costs, Bad Debts and consequently it will also improve portfolio management.

Techniques to build Scorecards

Supervised classification techniques like “Logistic Regression” and “Decision trees”, are well known in loaning spaces to build application and behavioral scorecards. Will not discuss these techniques here in detail.

Basics of Credit Risk Modelling

  1. Probability of Default (PD): This is measured as the likelihood that a borrower will default on their loans and is the most important part of a credit risk model. For individuals, this score is based on their debt-income ratio and existing credit score.
  2. Loss given Default (LGD): This is measured as the amount of money a bank or other financial institution loses when a borrower defaults on a loan, depicted as a percentage of total exposure at the time of default.
  3. Exposure at default (EAD): This is measured as the total value a lender is exposed to when a loan defaults. Using the internal ratings-based (IRB) approach, financial institutions calculate their risk. Banks often use internal risk management default models to estimate respective EAD systems.
  4. Information Value (IV): It is one of the most useful technique to select important variables in a predictive model. It helps to rank variables on the basis of their importance. The IV is calculated using the following formula:

IV = ∑ (% of non-events — % of events) * WOE

IV Inference Table

5. Weight of Evidence(WOE): The weight of evidence tells the predictive power of an independent variable in relation to the dependent variable. it is described as a measure of the separation of good and bad customers. “Bad Customers” refers to the customers who defaulted on a loan and “Good Customers” refers to the customers who paid back loan. The WOE is calculated using the following formula:

WOE = In ( % of non-events ➗ % of events)

Logistic Regression versus Decision Trees in Scorecard

Logistic regression is the most popular scorecard construction technique used in Loaning spaces. Its key advantage, when compared to decision trees, is that a continuous range of scores is provided between 0 and 1.

For decision trees, every leaf node corresponds to a particular score. Hence, only a limited set of score values is provided, which may not be sufficient to provide a fine, granular distinction between obligors in terms of default risk.

Other classification techniques to build scorecards are discriminant analysis, neural networks, and support vector machines, as well as ensemble methods such as bagging, boosting, and random forests. However, these techniques yield very complex models that are difficult to comprehend and thus not helpful for building credit scoring models, where model interpretability is a key concern.

Application Scorecard

Application scoring is the primary significant measurable credit scoring approach. The reason for application scoring is to come up with a credit score that reflects the default risk of a customer at the moment of loan application. This is a significant scoring instrument, as it will enable the moneylender to choose whether the credit application should be accepted or rejected.

We will talk about on the data which are considered in Application scoring. This data is for the most part taken from Loan application form. Eg: age, sex, conjugal status, pay, time at residence, employment tenure, time in industry, postal code, geographical location, residential status,employment status, lifestyle code, existing customer (Y/N), number of years as client, number of products internally, total liabilities, total debt, total debt service ratio, gross debt service ratio, revolving debt/total debt, and number of credit cards. All of this information is inside accessible with the bank. They can be supplemented alongside with credit bureau data.

Developing Application Scorecard with Logistic Regression

To select the variable for scorecard there are two methods widely used

  1. IV Value
  2. BIC Criterion

The Data

Our data comes from a Credit Risk Analytics named “hmeq.csv”. It contains 5960 data observations and 13 features.

The data set HMEQ reports characteristics and delinquency information for home equity loans. A home equity loan is a loan where the obligor uses the equity of his or her home as the underlying collateral. The data set has the following characteristics:

◾ BAD: 1 = applicant defaulted on loan or seriously delinquent; 0 = applicant paid loan
◾ LOAN: Amount of the loan request
◾ MORTDUE: Amount due on existing mortgage
◾ VALUE: Value of current property
◾ REASON: DebtCon = debt consolidation; HomeImp = home improvement
◾ JOB: Occupational categories
◾ YOJ: Years at present job
◾ DEROG: Number of major derogatory reports
◾ DELINQ: Number of delinquent credit lines
◾ CLAGE: Age of oldest credit line in months
◾ NINQ: Number of recent credit inquiries
◾ CLNO: Number of credit lines
◾ DEBTINC: Debt-to-income ratio

In this blog we will focus on IV Value for Variable selection.

Load the data

Let’s load the hmeq dataset into a data frame:

#=======================================
# Step 1: Data Pre-processing
#=======================================
# Loading required packages:
library(tidyverse)
library(magrittr)
library(psych)
library(caTools)
library(summarytools)
library(purrr)
library(stringr)
data <- read.csv(“hmeq.csv”,stringsAsFactors = FALSE)

Exploration — getting a feel for our data

mydata <- round(summarytools::descr(data),2)
View(mydata)
Statistical Data Summary

Do we have missing data?

We will not go into basics of ways to “handle” missing data. In the above figure if you look at the “N.valid” & “pct.valid” rows, so you will notice that variables CLAGE, CLNO, DEBTINC, DELINQ, DEROG, MORTDUE, NINQ, VALUE, YOJ have missing values.

# Function for detecting missing observations: 
na_rate <- function(x) {x %>% is.na() %>% sum() / length(x)}
sapply(data, na_rate) %>% round(2) * 100
## Replace missing continuous variables with mean
# Return the column names containing missing observations
list_na <- colnames(data)[ apply(data, 2, anyNA) ]
# Missing continuous variables will be replaced by the mean
average_missing <- apply(data[,colnames(data) %in% list_na],2,mean,
na.rm = TRUE)
## Missing interval variables will be replaced by the median
median_missing <- apply(data[,colnames(data) %in% list_na],2,median,
na.rm = TRUE)
#Missing categorical variables will be replaced by the mode
calculate_mode <- function(x) {
uniqx <- unique(x)
uniqx[which.max(tabulate(match(x, uniqx)))]
}
# Create a data frame with the mean, median & Mode imputationdata_replace <-
data %>% mutate(
MORTDUE = ifelse(is.na(MORTDUE), average_missing[“MORTDUE”], MORTDUE),
VALUE = ifelse(is.na(VALUE), average_missing[“VALUE”], VALUE),
CLAGE = ifelse(is.na(CLAGE), average_missing[“CLAGE”], CLAGE),
DEBTINC = ifelse(is.na(DEBTINC), average_missing[“DEBTINC”], DEBTINC),
YOJ = ifelse(is.na(YOJ), average_missing[“YOJ”], YOJ),
DEROG = ifelse(is.na(DEROG), median_missing[“DEROG”], DEROG),
DELINQ = ifelse(is.na(DELINQ), median_missing[“DELINQ”], DELINQ),
NINQ = ifelse(is.na(NINQ), median_missing[“NINQ”], NINQ),
CLNO = ifelse(is.na(CLNO), median_missing[“CLNO”], CLNO),
REASON = ifelse(REASON==””, calculate_mode(REASON), REASON),
JOB = ifelse(JOB==””, calculate_mode(JOB), JOB)
)
# After treatment check for Missings:
sapply(data_replace, na_rate) %>% round(2) * 100
After Missing Treatment

Conversion of Data Type

We will convert all the character type data into category type(Factor)

## Changing Character to factor
data_replace <- data_replace %>%
mutate_if(is_character, as.factor)

Splitting Datasets

While working on a model we want to Train & Test it on the datasets. To do this we split the datasets into two sets, one for training and the other for testing; and need do this before starting training the model. we will split the datasets in a 70–30 ratio.

## Setting the seed
set.seed(123)
## Use 70% data set for training model.
split <- sample.split(data_replace$BAD,SplitRatio = 0.7)
train = subset(data_replace, split==TRUE)
test = subset(data_replace, split==FALSE)
## Checking Proportions in Train and Test
prop.table(table(train$BAD)) * 100
prop.table(table(test$BAD)) * 100

Why variable selection?

1. To eliminate the redundant predictors.
2. Unnecessary predictors will add noise to the model.
3. Chances of Col-linearity by having too many variables.
4. Accuracy: if the model is to be used for prediction, we always expect quick and accurate predictors.

Hence we can conclude Variable selection is intended to select the “best” subset of predictors. In this blog we will use IV (Information Value) method for Variable selection.

Variable selection is a means to an end and not an end itself. The aim is to construct a model that predicts well or explains the relationships in the data. Automatic variable selections are not guaranteed to be consistent with these goals.

Prior to variable selection, below steps are must to explore, but we will not go in detail here:
1. Identify outliers and influential points — you should impute / Replace / exclude them on scenario basis.
2. Check if any transformations of the variables that seem appropriate is required.

## Calculate information values (IV): 
iv_values <- iv(train, y = “BAD”, positive = “BAD|1”) %>%
arrange()
print(iv_values)variable info_value
1 LOAN 0.79509838
2 DELINQ 0.67942645
3 DEROG 0.40258122
4 YOJ 0.34991504
5 CLNO 0.24815360
6 NINQ 0.14831341
7 MORTDUE 0.09694989
8 JOB 0.08980137
9 VALUE 0.07341026
10 CLAGE 0.05762004
11 DEBTINC 0.02684238
12 REASON 0.01226045
If you refer “IV Inference Table” , we discussed values between 0.02 to 0.5 are predictors ranging from weak to good

If you refer “IV Inference Table”, we discussed values between 0.1 to 0.5 are predictors ranging from medium to good. IV with < 0.1 is week Predictive , We only use variables with IV >= 0.1 for training our Model. We will get 6 variables selected for modelling (LOAN, DELINQ, DEFROG, YOJ, CLNO, NINQ)

## Copying Variable names with IV with < 0.1
variables_selected <- iv_values %>%
filter(info_value >= 0.1) %>%
pull(1)
print(variables_selected)[1] "LOAN" "DELINQ" "DEROG" "YOJ" "CLNO" "NINQ"## Create new Data frame with only selected variables:
train_iv <- train %>% select(variables_selected, "BAD")
test_iv <- test %>% select(variables_selected, "BAD")

Weight of Evidence(WOE) Binning

Binning is a categorization process to transform a continuous variable into a small set of groups or bins. Binning is widely used in credit scoring. While binning can be used to find Kolmogorov-Smirnov (KS) and lift chart from scores.

Four binning algorithms are commonly used - equal-width binning, equal-size binning, optima binning and Multi-Interval Discretization binning in machine learning.

A good binning algorithm should follow the below rule:
1. Missing values are binned separately.
2. Each bin should contain at least 5% of observations.
3. No bins should have 0 accounts for good or bad.

WOE is widely used in credit scoring to separate good accounts and bad accounts. It compares the proportion of good accounts to bad accounts at each attribute level, and measures the strength of the attributes of an independent variable in separating good and bad accounts. Hence this is a quantitative method for combining evidence in support of a statistical
hypothesis.

Here we will choose an automated approach to WOE binning by Scorecard Package-

## Creating Optimal WOE Binning in our data
bins_var <- woebin(train_iv,
y = “BAD”,
no_cores = 4,
positive = “BAD|1”)
## Creates a data frame of binned variables
train_iv_woe <- woebin_ply(train_iv,
bins_var)
test_woe <- woebin_ply(test_iv,
bins_var)
head(train_iv_woe)BAD LOAN_woe DELINQ_woe DEROG_woe YOJ_woe CLNO_woe NINQ_woe
1 1.25 -0.47 -0.27 -0.03 -0.15 -0.14
1 1.25 -0.47 -0.27 0.26 -0.15 -0.14
0 1.25 -0.47 -0.27 0.26 -0.15 -0.26
1 1.25 1.70 1.84 0.26 -0.15 -0.14
1 1.25 -0.47 -0.27 -0.03 0.76 -0.26
1 1.25 -0.47 -0.27 0.09 -0.15 -0.26

Model Using Logistic Regression

# Logistic Regression:
score_model <- glm(BAD ~ ., family = binomial, data = train_iv_woe)
# Show results:
summary(score_model)
Call:
glm(formula = BAD ~ ., family = binomial, data = train_iv_woe)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.0960 -0.5746 -0.4327 -0.3245 2.4676
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.38625 0.04441 -31.217 < 2e-16 ***
LOAN_woe 1.07404 0.10816 9.930 < 2e-16 ***
DELINQ_woe 0.91438 0.05323 17.179 < 2e-16 ***
DEROG_woe 0.77030 0.06961 11.066 < 2e-16 ***
YOJ_woe 0.86897 0.16420 5.292 1.21e-07 ***
CLNO_woe 0.95275 0.15249 6.248 4.16e-10 ***
NINQ_woe 0.85697 0.11520 7.439 1.02e-13 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)Null deviance: 4168.7 on 4171 degrees of freedom
Residual deviance: 3374.9 on 4165 degrees of freedom
AIC: 3388.9
Number of Fisher Scoring iterations: 5

Calculate PD (Probability of default)

# Calculate PD on test data 
test_pred <- predict(score_model, test_woe, type = “response”)

Model Evaluation and Diagnostics

A logistic regression Credit Scoring model has been built and the coefficients have been examined. However, some critical questions still remain. Is the model good? How well does the model fit the data? Are the predictions accurate? In the rest of this blog will try answering these questions with help of R code.

The most used criteria in credit scoring are:

Gain or lift -
Is a measure of the effectiveness of a classification model calculated as the ratio between the results obtained with and without the model. Gain and lift charts are visual aids for evaluating performance of classification models. However, in contrast to the confusion matrix that evaluates models on the whole population gain or lift chart evaluates model performance in a portion of the population.

K-S or Kolmogorov-Smirnov Chart -
Is a measure of the degree of separation between the positive and negative distributions. The K-S is 100 if the scores partition the population into two separate groups in which one group contains all the positives and the other all the negatives. On the other hand, If the model cannot differentiate between positives and negatives, then it is as if the model selects cases randomly from the population. The K-S would be 0. In most classification models the K-S
will fall between 0 and 100, and that the higher the value the better the model is at separating the positive from negative cases.

Area under ROC curve -
ROC is a probability curve and AUC represents degree or measure of separability. It tells how much model is capable of distinguishing between classes. Higher the AUC, better the model is at predicting 0 as 0s and 1 as 1s. A random classifier has an area under the curve of 0.5, while AUC for a perfect classifier is equal to 1. In practice, most of the classification models have an AUC between 0.5 and 1.The ROC curve is plotted with TPR against the FPR where TPR is on y-axis and FPR is on the x-axis.

Precision and Recall -
Precision means the percentage of your results which are relevant. On the other hand, recall refers to the percentage of total relevant
results correctly classified by your algorithm

# Model Performance for test data: 
perf_eva(test_pred,
test_iv$BAD,
binomial_metric = c(“mse”, “rmse”,”logloss”, “r2”,
“ks”, “auc”, “gini”),
confusion_matrix = TRUE,
show_plot = c(‘ks’, ‘lift’, ‘roc’, ‘pr’),
positive = “bad|1”,
title = “Test Data”)
[INFO] The threshold of confusion matrix is 0.2438.
$binomial_metric
$binomial_metric$`Test Data`
MSE RMSE LogLoss R2 KS AUC Gini
1: 0.126 0.3556 0.4064 0.2084 0.4437 0.7843 0.5687
$confusion_matrix
$confusion_matrix$`Test Data`
label GOOD BAD error
1: 0 1188 243 0.1698113
2: 1 141 216 0.3949580
3: total 1329 459 0.2147651
Model Performance

Generate Scorecard from the generated Model

## Calculate scorecard scoreshmeq_Scorecard <- scorecard(bins_var, 
score_model,
points0 = 600,
odds0 = 1/19,
pdo = 50,
basepoints_eq0 = TRUE)
Scorecard on Binned Variables

Consider a Loan Applicant with this data where LOAN = 1300 , DELINQ = 2, DEROG = 0, YOJ =7, CLNO=14, NINQ = 0.

Then he/she will be classified into the respective bins: LOAN = 1300 (< to 6000) , DELINQ = 2 (1 to 2), DEROG = 0 (< to 1), YOJ =7 (6 to 10), CLNO=14 (9 to 27), NINQ = 0 (< to 1).

Score will be (-15) + (-31) + 96 + 100 + 92 + 97 = 339.

This is how the scores are generated, Post score generation there will be below steps.. . This i will cover in the next blog.

End Notes

In this article we saw Credit risk scoring application scorecard with IV selection method , there are other techniques too which we will discuss in upcoming blogs on this topic.I strongly believe that knowledge sharing is the ultimate form of learning.

References

1. http://www.m-hikari.com/ams/ams-2014/ams-65-68-2014/zengAMS65-68-2014.pdf

2. http://www.biostat.jhsph.edu/~iruczins/teaching/jf/ch10.pdf

3. https://rpubs.com/chidungkt/442168

4. http://www.creditriskanalytics.net/

--

--