Predicting Customer Churn Using Logistic Regression

Andrew Cole
May 13, 2020
9 min read

Part 2: Building the Model

In my previous post, we completed a pretty in-depth walk through of the exploratory data analysis process for a customer churn analysis dataset. Our data, sourced from Kaggle, is centered around customer churn, the rate at which a commercial customer will leave the commercial platform that they are currently a (paying) customer, of a telecommunications company, Telco. Now that the EDA process has been complete, and we have a pretty good sense of what our data tells us before processing, we can move on to building a Logistic Regression classification model which will allow for us to predict whether a customer is at risk to churn from Telco’s platform.

The complete GitHub repository with notebooks and data walkthrough can be found here.

Photo by Austin Distel on Unsplash

The Logistic Regression

When working with our data that accumulates to a binary separation, we want to classify our observations as the customer “will churn” or “won’t churn” from the platform. A logistic regression model will try to guess the probability of belonging to one group or another. The logistic regression is essentially an extension of a linear regression, only the predicted outcome value is between [0, 1]. The model will identify relationships between our target feature, Churn, and our remaining features to apply probabilistic calculations for determining which class the customer should belong to. We will be using the ‘ScikitLearn’ package in Python.

Recapping the Data

As a reminder, in our dataset we have 7043 rows (each representing a unique customer) with 21 columns: 19 features, 1 target feature (Churn). The data is composed of both numerical and categorical features, so we will need to address each of the datatypes respectively.

Target:

Churn — Whether the customer churned or not (Yes, No)

Numeric Features:

Tenure — Number of months the customer has been with the company
MonthlyCharges — The monthly amount charged to the customer
TotalCharges — The total amount charged to the customer

Categorical Features:

CustomerID
Gender — M/F
SeniorCitizen — Whether the customer is a senior citizen or not (1, 0)
Partner — Whether customer has a partner or not (Yes, No)
Dependents — Whether customer has dependents or not (Yes, No)
PhoneService — Whether the customer has a phone service or not (Yes, No)
MulitpleLines — Whether the customer has multiple lines or not (Yes, No, No Phone Service)
InternetService — Customer’s internet service type (DSL, Fiber Optic, None)
OnlineSecurity — Whether the customer has Online Security add-on (Yes, No, No Internet Service)
OnlineBackup — Whether the customer has Online Backup add-on (Yes, No, No Internet Service)
DeviceProtection — Whether the customer has Device Protection add-on (Yes, No, No Internet Service)
TechSupport — Whether the customer has Tech Support add-on (Yes, No, No Internet Service)
StreamingTV — Whether the customer has streaming TV or not (Yes, No, No Internet Service)
StreamingMovies — Whether the customer has streaming movies or not (Yes, No, No Internet Service)
Contract — Term of the customer’s contract (Monthly, 1-Year, 2-Year)
PaperlessBilling — Whether the customer has paperless billing or not (Yes, No)
PaymentMethod — The customer’s payment method (E-Check, Mailed Check, Bank Transfer (Auto), Credit Card (Auto))

Preprocessing Our Data For Modeling

We moved our data around a bit during the EDA process, but that pre-processing was mainly for ease of use and digestion, rather than functionality for a model. For the purposes of our Logistic Regression, we must pre-process our data in a different way, particularly to accommodate the categorical features which we have in our data. Let’s take a look at our data info one more time to get a sense of what we are working with.

We do not have any missing data and our data-types are in order. Note that the majority of our data are of ‘object’ type, our categorical data. This will be our primary area of focus in the preprocessing step. At the top of the data, we see two columns that are unnecessary, ‘Unnamed: 0’ and ‘customerid’. These two columns will be irrelevant to our data, as the former does not have any significant values and the latter is a unique identifier of the customer which is something we do not want. We quickly remove these features from our DataFrame via a quick pandas slice:

df2 = df.iloc[:,2:]

The next step is addressing our target variable, Churn. Currently, the values of this feature are “Yes” and “No”. This is a binary outcome, which is what we want, but our model will not be able to meaningfully interpret this in its current string-form. Instead, we want to replace these variables with numeric binary values:

df2.churn.replace({"Yes":1, "No":0}, inplace = True)

Up next, we must deal with our remaining categorical variables. A Dummy Variable is a way of incorporating nominal variables into a regression as a binary value. These variables allow for the computer to interpret the values of a categorical variable as high(1) or low(0) scores. Because the variables are now numeric, the model can assess directionality and significance in our variables instead of trying to figure out what “Yes” or “No” means. When adding dummy variables is performed, it will add new binary features with [0,1] values that our computer can now interpret. Pandas has a simple function to perform this step.

dummy_df = pd.get_dummies(df2)

Note: It is very important to pay attention to the “drop_first” parameter when categorical variables have more than binary values. We cannot use all values of categorical variables as features because this would raise the issue of multicollinearity (computer will place false significance on redundant information) and break the model. We must leave one of the categories out as a reference category.

Our new DataFrame features are above and now include dummy variables.

Splitting our Data

We must now separate our data into a target feature and predicting features.

# Establish target feature, churn
y = dummy_df.churn.values

# Drop the target feature from remaining features
X = dummy_df.drop('churn', axis = 1)

# Save dataframe column titles to list, we will need them in next step

cols = X.columns

Feature Scaling

Our data is almost fully pre-processed but there is one more glaring issue to address, scaling. Our data is full of numeric data now, but they are all in different units. Comparing a binary value of 1 for ‘streamingtv_Yes’ with a continuous price value of ‘monthlycharges’ will not give any relevant information because they have different units. The variables simply will not give an equal contribution to the model. To fix this problem, we will standardize our data values via rescaling an original variable to have equal range & variance as the remaining variables. For our purposes, we will use Min-Max Scaling [0,1] because the standardize value will lie within the binary range.

# Import the necessary sklearn method
from sklearn.preprocessing import MinMaxScaler

# Instantiate a Min-Max scaling object
mm = MinMaxScaler()

# Fit and transform our feature data into a pandas dataframe
X = pd.DataFrame(mm.fit_transform(X))

Train — Test — Split

We now conduct our standard train test split to separate our data into a training set and testing set.

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = .25, random_state = 33)

Building the Model

Building the model can be done relatively quickly now, one we choose some parameters:

from sklearn.linear_model import LogisticRegression

# Instantiate a logistic regression model without an intercept, arbitrarily large C value will offset the lack of intercept

logreg = LogisticRegression(fit_intercept = False, C = 1e12, solver 
= 'liblinear')

# Fit the model to our X and y training sets
logreg.fit(X_train, y_train)

Now that our model is built, we must predict our future values.

y_hat_train = logreg.predict(X_train)
y_hat_test = logreg.predict(X_test)

At this point, our model is actually completely built even though we don’t see an output. Let’s take a look at evaluating our performance.

Evaluating Model Performance

How many times was the classifier correct on the training set?

Because we’re trying to predict whether a customer will leave or not, what better way to check our model performance than to see how often it was correct! To do so, we will take the residual distance between actual training data and predicted training data, as well as actual test data and predicted test data.

# Find residual differences between train data and predicted train data
resiudals = np.abs(y_train, y_hat_train)

# Print the number of times our model was correct ('0') and incorrect ('1')
print(pd.Series(residuals).value_counts()

# Print normalized amount of times our model was correct (percentage)
print(pd.Series(residuals).value_counts(normalize = True)

Non-Normalized Train Results:

Correct: 4270
Incorrect: 1012

Normalized Train Results:

Correct: .8084
Incorrect: .1916

This is pretty good! On our first pass, an 80% correct rate is a strong number. Remember, 100% accuracy would actually be a problem, as our model would be completely overfit to our data. Let’s check our test data (perform the same code block as above, using y_test and y_hat_test as the residual arguments.

Non-Normalized Test Results:

Correct: 1409
Incorrect: 352

Normalized Test Results

Correct: .8001
Incorrect: .1999

Again, positive results! Our test and train set sizes are different, so the normalized results are more meaningful here. The fact that our model performs about the same on our train and test sets is a positive sign that our model is performing well.

Confusion Matrix

A confusion matrix is an extremely strong method of evaluating the performance of our classifier. A confusion matrix is a visual representation which tells us the degree of four important classification metrics:

True Positives (TP): The number of observations where the model predicted the customer would churn (1), and they actually do churn (1)
True Negatives (TN): The number of observations where the model predicted the customer would not churn (0), and they actually do not churn (0).
False Positives (FP): The number of observations where the model predicted the customer will churn (1), but in real life they do not churn (0).
False Negatives (FN): The number of observations where the model predicted the customer will not churn (0), but in real life they do churn (1).

One axis of a confusion matrix will represent the ground-truth value, while the other will represent the predicted values. At this step, it is very important to have business domain knowledge. Certain metrics are more prevalent to our model than not. For example, if we were modeling whether a patient has a disease, it would be much worse for a high number of false negatives than a high number of false positives. If there are many false positives, then that just means some patients would need to undergo some unnecessary testing and maybe an annoying doctor visit or two. But, a high false negative means that many patients would actually be sick and diagnosed as healthy, potentially having dire consequences. For our purposes of churn, it is worse for us to predict a customer not churning when that customer actually churns in reality, meaning that our False Negatives are more important to pay attention to.

from sklearn.metrics import confusion_matrix

# Pass actual test and predicted target test outcomes to function
cnf_matrix = confusion_matrix(y_test, y_hat_test)

Confusion Matrix:

We have 224 out of 1761 observations as False Negatives. This is slightly larger than we’d like but it is still a promising number. In order to derive real meaning from the confusion matrix, we must use these four metrics to produce more descriptive metrics:

Precision: How precise the predictions are
Precision = TP/PP
“Out of all the times the model said the customer would churn, how many times did the customer actually churn”

2. Recall: Indicates what percentage of the classes we’re interested in were actually captured by the model

Recall = TP/(TP + FN)
“Out of all customers we saw that actually churn, what percentage of them did our model correctly identify as ‘going to churn’

3. Accuracy: Measures the total number of predictions a model gets right, including both true positives and true negatives

Accuracy = (TP + TN)/(TP + FP + TN + FN)
“Out of all predictions made, what percentage were correct?”

4. F1 Score: Harmonic Mean of Precision and Recall — a strong indicator of precision and recall (cannot have a high F1 score without strong model underneath)

F1 = 2(Precision * Recall)/(Precision + Recall)
Penalizes models heavily if they are skewed towards precision or recall
Generally the most used metric for model performance

These metrics can be calculated by hand in a lengthier route, but fortunately for us, Sklearn has modules which will calculate these for us. All we have to do is pass our target data set and our predicted target data set.

from sklearn.metrics import precision_sore, recall_score, accuracy_score, f1_score

precision_train = precision_score(y_train, y_hat_train)
precision_test = precisoin_score(y_test, y_hat_test)

recall_train = recall_score(y_train, y_hat_train)
recall_test = recall_score(y_test, y_hat_test)

accuracy_train = accuracy_score(y_train, y_hat_train)
accuracy_test = accuracy_score(y_test, y_hat_test)

f1_train = f1_score(y_train, y_hat_train)
f1_test = f1_score(y_test, y_hat_test)

Precision:

Train: 0.6615
Test: 0.6666

Recall:

Train: 0.5558
Test: 0.5333

Accuracy:

Train: 0.8084
Test: 0.8001

F1 Score:

Train: 0.6041
Test: 0.5926

Our results are encouraging, yet not completely satisfying. Our recall and precision scores are a little bit lower than we would expect, but our accuracy score is the strongest metric and a very good sign, especially on the first try. Remember, building models is an iterative process so strong scores on the first go-around is encouraging!

ROC Curve & AUC

Another comprehensive way of evaluating our model performance and an alternative to confusion matrices is the AUC metric and an ROC-curve graph.

ROC Curve — Receiver Operator Characteristic Curve

This visual graph will illustrate the true positive rate (recall — TPR) against the false positive rate (FPR) of our classifier. Best performing models will have an ROC curve that hugs the upper left corner of the graph. This would represent that we correctly classify the positives much more often than we incorrectly classify them. The dotted blue line in the graph below represents a 1:1 linear relationship and is representative of a bad classifier, because the model guesses one incorrectly for every correct guess, making it no better than just flipping a coin!

AUC — Area Under Curve

The AUC will give us a singular numeric metric to compare instead of a visual representation. An AUC = 1 would represent a perfect classifier, and an AUC = 0.5 represents a classifier which only has 50% precision. This metric quantifies the overall accuracy of our classifier model.

Coding obtaining the metrics follows:

from sklearn.metrics import roc_curve, auc

# Calculate probability score of each point in training set
y_train_score = model.decision_function(X_train)

# Calculate false positive rate(fpr), true pos. rate (tpr), and thresholds for train set
train_fpr, train_tpr, train_thresholds = roc_curve(y_train, y_hat_train)

# Calculate probability score of each point in test set
y_test_score = model.decision_function(X_test)

# Calculate fpr, tpr, and thresholds for test set
test_fpr, test_tpr, test_thresholds = roc_curve(y_test, y_hat_test)

Code for plotting the ROC Curve:

# Plot the training FPR and TPR
plt.plot(train_fpr, train_tpr, label = 'ROC Curve')

# Plot positive sloped 1:1 line for reference
plt.plot([0,1],[0,1])
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xticks([i/20.0 for i in range(21)])
plt.yticks([i/20.0 for i in range(21)])

Training AUC = 0.8517
Testing AUC = 0.8388

These metrics look great! Notice how both our test and train curves hug the upper left corner and have very strong AUC values. With such strong models, we can now turn our eyes to tuning some model parameters/hyperparameters to slowly elevate our scores.

Conclusion

We built a pretty strong model for our first go around. Building any machine learning model is an iterative process, and classification modeling itself has several types of models. In some following posts, I will explore these other methods, such as Random Forest, Support Vector Modeling, and XGboost, to see if we can improve on this customer churn model!

#DataAnalysis #MachineLearning #LogisticRegression #CustomerChurn #datascience