Skip to main content

Command Palette

Search for a command to run...

Customer Churn Prediction Case Study

Published
6 min read
Customer Churn Prediction Case Study

End-to-End Machine Learning Project with Business Impact


Project Overview

Customer churn is one of the biggest challenges for subscription-based businesses. For telecom companies in particular, losing a customer often costs significantly more than retaining one.

In this case study, I built an end-to-end machine learning solution to predict customer churn and translate the results into actionable retention strategies. The focus was not just on model performance, but on interpretability, decision-making, and real business impact.


Problem Statement

A telecommunications company was experiencing increasing customer churn and needed data-driven insights to support its retention efforts.

The business wanted answers to three key questions:

  1. Which customers are likely to churn?

  2. What factors are driving churn?

  3. How can the retention team act on these insights to reduce customer loss?


Objective

The goal of this project was to:

  • Build a churn prediction model

  • Identify key churn drivers

  • Recommend practical, data-backed retention strategies

  • Design a solution that could realistically be used by business stakeholders


Dataset Description

The dataset contains 500 customer records with 19 features, covering customer demographics, billing, service usage, and support interactions.

Feature categories

Demographics

  • Age

  • Gender

Account and contract details

  • Tenure

  • Contract type

  • Payment method

Billing and usage

  • Monthly charges

  • Total charges

  • Internet service

  • Phone service

Customer experience

  • Support calls

  • Customer satisfaction score

Service add-ons

  • Streaming TV

  • Streaming movies

  • Online security

  • Tech support

Target variable

  • Churn (0 = active, 1 = churned)

Approach and Methodology

I approached the project in structured phases to mirror a real-world data science workflow.

1. Data understanding and cleaning

  • Inspected data types and distributions

  • Checked for missing values and inconsistencies

  • Ensured the dataset was suitable for modeling

      #Load dataset 
    
      df = pd.read_csv("/kaggle/input/week-16-regression-3/customer_churn_prediction.csv")
      df.head()
    

#Data Overview

df.info()
df.describe()
#check missing values
df.isnull().sum()

2. Exploratory Data Analysis (EDA)

EDA was used to understand customer behavior and uncover churn patterns.

Key findings included:

  • EDA Summary (Key Insights):

    • Churn distribution is fairly balanced, with slightly more non-churn customers than churned ones. This means the dataset is suitable for classification without severe class imbalance.

    • Age shows a mild relationship with churn. Customers who churn tend to be slightly older on average, though the overlap is large, so age alone is not a strong predictor.

    • Monthly charges are higher for churned customers. Customers who left generally pay more per month, suggesting price sensitivity is a major factor influencing churn.

    • Tenure is clearly related to churn. Customers with shorter tenure are more likely to churn, while long-term customers tend to stay, indicating loyalty increases over time.

    • Overall, financial and engagement factors matter more than demographics. Monthly charges and tenure show stronger separation between churn and non-churn compared to age.

3. Feature engineering and preprocessing

  • Encoded categorical variables

  • Scaled numerical features

  • Split the data into training and test sets to evaluate generalization

      categorical_cols = [
          "Gender", "Contract_Type", "Internet_Service", "Payment_Method"
      ]
    
      le = LabelEncoder()
      for col in categorical_cols:
          df[col] = le.fit_transform(df[col])
    
#Define Features and Target
X = df.drop(columns=["Customer_ID", "Churn"])
y = df["Churn"]
#Train-Test split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=42, stratify=y
)

Modeling and Evaluation

I trained and evaluated two classification models:

  • Logistic Regression

  • Random Forest Classifier

Because churn prediction has asymmetric business costs, I evaluated models using accuracy, precision, recall, F1 score, and AUC.

Model performance summary

ModelAccuracyRecallF1 ScoreAUC
Logistic Regression0.5760.5440.5390.586
Random Forest0.5280.4910.4870.578

Logistic Regression consistently outperformed Random Forest across all metrics.


Model Selection Rationale

Logistic Regression was selected for deployment for three main reasons:

  1. Better overall performance and higher recall, which is critical for identifying at-risk customers

  2. Strong interpretability, allowing business stakeholders to understand why customers churn

  3. Better alignment with business needs, where missing a churner is more costly than a false alarm

To further improve recall, I recommended lowering the probability threshold from 0.5 to approximately 0.4.


Key Churn Drivers

Using feature importance analysis, the strongest churn drivers were identified as:

  • Monthly charges

  • Tenure

  • Total charges

  • Customer satisfaction score

  • Contract type

  • Support call frequency

  • Age

These results show that churn is driven primarily by pricing, customer experience, and relationship duration rather than static demographic attributes.


Business Recommendations

Based on the insights from the model and EDA, I proposed the following actions:

  • Offer targeted discounts or flexible pricing to high-billing customers

  • Strengthen onboarding and engagement during the first three to six months

  • Trigger proactive outreach when customer satisfaction scores drop

  • Improve support quality for customers with frequent service calls

  • Encourage long-term contracts through incentives

  • Personalize retention strategies by age group

These recommendations directly link model insights to measurable business actions.


Implementation Strategy

To ensure the solution remains effective in production, I recommended:

  • Retraining the model every three to six months

  • Monitoring recall, AUC, churn rate, and false negative rate

  • Measuring business impact through retention campaign success and customer lifetime value


Limitations and Future Improvements

While the model provides useful insights, its predictive power is moderate.

Future improvements could include:

  • Adding time-series and behavioral usage data

  • Incorporating complaint resolution history

  • Testing advanced models such as Gradient Boosting or XGBoost

  • Addressing potential class imbalance

  • Integrating near real-time customer activity


Results and Impact

Although this was an offline project, the expected business impact includes:

  • Earlier identification of at-risk customers

  • More targeted and cost-effective retention campaigns

  • Reduced customer churn and improved customer lifetime value

The project demonstrates how even moderately performing models can deliver meaningful value when combined with strong business understanding.


Key Skills Demonstrated

  • Business problem framing

  • Exploratory data analysis

  • Feature engineering and preprocessing

  • Classification modeling

  • Model evaluation and selection

  • Translating machine learning outputs into business strategy

  • Communicating insights to non-technical stakeholders


Final Reflection

This case study highlights an important principle of applied data science: models do not need to be perfect to be useful. What matters most is understanding the problem, interpreting results correctly, and turning insights into action.

This project showcases my ability to think beyond metrics and build solutions that support real business decisions.


image credit : Pinterest