Ogunyemi Ezekiel Timilehin

End-to-End Employee Attrition Prediction System

OGUNYEMI EZEKIEL TIMILEHIN — Tue, 17 Feb 2026 19:00:58 GMT

Decision Tree & Random Forest Classification

Employee attrition is one of the most expensive silent risks in any organization. Replacing talent costs money, time, productivity, and morale.

In this project, I built a complete machine learning pipeline to predict whether an employee is likely to leave the company using Decision Trees and Random Forest classifiers.

The goal is simple but powerful:

Help HR identify at-risk employees early and design proactive retention strategies.

Section A: Data Loading and Exploration

We begin by loading the dataset and examining its structure, feature types, and overall distribution.

This step ensures:

There are no structural inconsistencies
Target distribution is understood
Data types are correctly identified

# Load dataset
df = pd.read_csv("/kaggle/input/week-18/employee_attrition_prediction.csv")

df.head()
df.shape
df.info()
df.describe()

At this stage, I verified:

No critical structural issues
Balanced but realistic attrition distribution
Presence of both numerical and categorical features

Section B: Exploratory Data Analysis (EDA)

Exploratory Data Analysis is where the business story begins to unfold.

The objective here was to understand how each feature relates to employee attrition.

Distribution of Numerical Features by Attrition Status

I analyzed how numerical variables differ between employees who left and those who stayed.

Insights

Clear patterns emerged:

Employees who left tend to have lower monthly income
Early-tenure employees (low Years at Company) show higher exit probability
Job satisfaction and work-life balance appear negatively correlated with attrition

This suggests attrition is not random — it is structurally influenced by engagement and compensation variables.

Categorical Feature Analysis

Next, I examined categorical features such as:

Department
Job Role
Overtime
Education Level

# Analyze categorical features by attrition
categorical_cols = ['Gender','Education_Level','Department','Job_Role','Overtime']

for col in categorical_cols:
    plt.figure(figsize=(6,4))
    sns.countplot(x=col, hue='Left_Company', data=df)
    plt.xticks(rotation=45)
    plt.title(f"{col} vs Attrition")
    plt.show()

Insights

One variable stood out strongly:

Overtime

Employees working overtime were significantly more likely to leave.

Departmental differences were also visible, particularly in:

Sales
Engineering

This highlights operational pressure and workload imbalance as major drivers.

Correlation Heatmap

To understand numerical relationships, I created a correlation matrix. Create correlation heatmap

Insights

While no extreme multicollinearity was observed, engagement variables such as:

Job Satisfaction
Work-Life Balance

showed meaningful negative relationships with attrition.

This reinforces the behavioral nature of employee exits.

EDA Summary Findings

Overtime is a dominant risk factor.
Lower job satisfaction significantly increases exit likelihood.
Early-career employees are more vulnerable.
Income plays a stabilizing role.
Attrition appears organizational and behavioral rather than demographic.

Section C: Data Preprocessing

Before modeling, categorical variables were encoded appropriately.

Tree-based models do not require feature scaling because:

They split on thresholds
They are not distance-based
They are invariant to monotonic transformations

Encoding Categorical Variables

Gender → Binary encoding
Education Level → Ordinal encoding
Department → Encoded
Job Role → Label encoded
Overtime → Binary encoding

# Handle categorical variables
df['Gender'] = df['Gender'].map({'Male':1, 'Female':0})

# Education
edu_map = {'Bachelor':0, 'Master':1, 'PhD':2}
df['Education_Level'] = df['Education_Level'].map(edu_map)

# Overtime
df['Overtime'] = df['Overtime'].map({'Yes':1, 'No':0})

# Encode Department and Job_Role
le_dept = LabelEncoder()
df['Department'] = le_dept.fit_transform(df['Department'])

le_role = LabelEncoder()
df['Job_Role'] = le_role.fit_transform(df['Job_Role'])

Creating Feature Matrix and Target Vector

Features:
All columns except Employee_ID and Left_Company

Target:
Left_Company

# Create X and y
X = df.drop(columns=['Employee_ID','Left_Company'])
y = df['Left_Company']

Train-Test Split

80% training
20% testing
random_state = 42

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

Section D: Model Building

Now we move into modeling.

D1: Decision Tree Classifier

I first built a Decision Tree using:

criterion = 'entropy'
random_state = 0

Then experimented with different max_depth values:

3, 5, 7, 10, None

# Test different max_depth values
depths = [3,5,7,10,None]
train_acc = []
test_acc = []

for d in depths:
    dt = DecisionTreeClassifier(
        criterion='entropy',
        max_depth=d,
        random_state=0
    )

    dt.fit(X_train, y_train)

    train_acc.append(accuracy_score(y_train, dt.predict(X_train)))
    test_acc.append(accuracy_score(y_test, dt.predict(X_test)))

After experimentation, I selected the optimal depth and trained the final model.

# Final Decision Tree
dt_final = DecisionTreeClassifier(
    criterion='entropy',
    max_depth=5,  # replace with optimal
    random_state=0
)

dt_final.fit(X_train, y_train)
dt_pred = dt_final.predict(X_test)

print(classification_report(y_test, dt_pred))

Decision Tree Results

precision    recall  f1-score   support

0       1.00      1.00      1.00        87
1       1.00      1.00      1.00        13

accuracy                           1.00       100

The model perfectly classified all test samples.

Confusion Matrix:

Zero false positives.
Zero false negatives.

D2: Random Forest Classifier

Next, I implemented Random Forest with:

criterion = 'entropy'
random_state = 0

I experimented with n_estimators values:
10, 50, 100, 150

# Test different n_estimators
n_values = [10,50,100,150]
rf_train = []
rf_test = []

for n in n_values:
    rf = RandomForestClassifier(
        n_estimators=n,
        criterion='entropy',
        random_state=0
    )

    rf.fit(X_train, y_train)

    rf_train.append(accuracy_score(y_train, rf.predict(X_train)))
    rf_test.append(accuracy_score(y_test, rf.predict(X_test)))

# Plot n_estimators vs accuracy
plt.plot(n_values, rf_train, label='Train Accuracy')
plt.plot(n_values, rf_test, label='Test Accuracy')
plt.legend()
plt.xlabel("Number of Trees")
plt.ylabel("Accuracy")
plt.show()

Final model trained using optimal number of estimators.

# Final Random Forest
rf_final = RandomForestClassifier(
    n_estimators=100,  # replace with optimal
    criterion='entropy',
    random_state=0
)

rf_final.fit(X_train, y_train)
rf_pred = rf_final.predict(X_test)

print(classification_report(y_test, rf_pred))

Random Forest Results

precision    recall  f1-score   support

0       1.00      1.00      1.00        87
1       1.00      1.00      1.00        13

accuracy                           1.00       100

Confusion Matrix:

Identical performance to Decision Tree.

Decision Tree CM
[[87  0]
 [ 0 13]]
Random Forest CM
[[87  0]
 [ 0 13]]

D3: Feature Importance Analysis

Tree-based models provide interpretability through feature importance scores.

# Extract and visualize feature importance
# Extract and visualize feature importance
importances = rf_final.feature_importances_
features = X.columns

importance_df = pd.DataFrame({
    'Feature': features,
    'Importance': importances
}).sort_values(by='Importance', ascending=False)

importance_df.head()

Top Influential Features

Overtime
Job Satisfaction
Work-Life Balance
Monthly Income
Years at Company
Department
Job Role

Overtime consistently ranked as the strongest predictor.

Section E: Model Comparison and Selection

Comparison Table

# Create comparison table
models = {
    "Decision Tree": dt_pred,
    "Random Forest": rf_pred
}

for name, pred in models.items():
    print(name)
    print(classification_report(y_test, pred))

Model Selection Assessment (Most Important Section)

Both models achieved:

Accuracy: 1.00
Precision: 1.00
Recall: 1.00
F1-Score: 1.00

However, identical performance does not automatically imply identical quality.

Why Random Forest is Preferred

Although Decision Tree achieved perfect accuracy, it is highly prone to overfitting, especially in moderate-sized datasets.

Random Forest:

Reduces variance via ensemble averaging
Handles nonlinear patterns more robustly
Generalizes better to unseen data
Is less sensitive to minor dataset changes

In business-critical systems such as attrition prediction — where false negatives are costly — stability matters more than simplicity.

Therefore:

Random Forest is the recommended deployment model.

Section F: Final Report

Summary of Findings

The analysis shows that employee attrition is primarily driven by:

Excessive overtime
Low job satisfaction
Poor work-life balance
Lower income
Short tenure

Attrition risk is concentrated among early-career employees working overtime in high-pressure departments.

Both Decision Tree and Random Forest achieved perfect classification performance. However, Random Forest offers superior theoretical generalization and lower overfitting risk.

Business Recommendations

Workload Management

Monitor overtime hours aggressively
Implement burnout prevention policies
Balance departmental workload distribution

Engagement Enhancement

Quarterly satisfaction surveys
Managerial coaching programs
Clear career progression structures

Compensation Review

Benchmark salaries
Introduce performance-based incentives
Target retention bonuses for high-risk roles

Early Career Programs

Structured onboarding
Mentorship systems
First 3-year retention strategy

Retention strategies should prioritize employees flagged as high-risk by the model.

Technical Recommendations

Deploy Random Forest in production
Monitor recall for attrition class
Retrain model quarterly
Implement drift detection mechanisms

Compared to:

KNN:

Distance-based, less interpretable

SVM:

Strong but less explainable

Tree-based methods:

Naturally handle nonlinearities
Provide feature importance
Require minimal preprocessing

For HR datasets, tree-based models are particularly suitable.

Final Conclusion

This project demonstrates the successful development of a full machine learning pipeline for employee attrition prediction.

While both models achieved perfect performance, Random Forest is the recommended deployment model due to ensemble stability and reduced overfitting risk.

Most importantly, the analysis reveals that attrition is driven by workload intensity, engagement levels, compensation dissatisfaction, and early tenure vulnerability.

A data-driven HR strategy focused on these areas can significantly reduce employee turnover and organizational risk.

Photo credit : Pinterest

Customer Churn Prediction Case Study

OGUNYEMI EZEKIEL TIMILEHIN — Fri, 30 Jan 2026 17:49:40 GMT

End-to-End Machine Learning Project with Business Impact

Project Overview

Customer churn is one of the biggest challenges for subscription-based businesses. For telecom companies in particular, losing a customer often costs significantly more than retaining one.

In this case study, I built an end-to-end machine learning solution to predict customer churn and translate the results into actionable retention strategies. The focus was not just on model performance, but on interpretability, decision-making, and real business impact.

Problem Statement

A telecommunications company was experiencing increasing customer churn and needed data-driven insights to support its retention efforts.

The business wanted answers to three key questions:

Which customers are likely to churn?
What factors are driving churn?
How can the retention team act on these insights to reduce customer loss?

Objective

The goal of this project was to:

Build a churn prediction model
Identify key churn drivers
Recommend practical, data-backed retention strategies
Design a solution that could realistically be used by business stakeholders

Dataset Description

The dataset contains 500 customer records with 19 features, covering customer demographics, billing, service usage, and support interactions.

Feature categories

Demographics

Age
Gender

Account and contract details

Tenure
Contract type
Payment method

Billing and usage

Monthly charges
Total charges
Internet service
Phone service

Customer experience

Support calls
Customer satisfaction score

Service add-ons

Streaming TV
Streaming movies
Online security
Tech support

Target variable

Churn (0 = active, 1 = churned)

Approach and Methodology

I approached the project in structured phases to mirror a real-world data science workflow.

1. Data understanding and cleaning

Inspected data types and distributions
Checked for missing values and inconsistencies

Ensured the dataset was suitable for modeling

  #Load dataset 

  df = pd.read_csv("/kaggle/input/week-16-regression-3/customer_churn_prediction.csv")
  df.head()

#Data Overview

df.info()
df.describe()

#check missing values
df.isnull().sum()

2. Exploratory Data Analysis (EDA)

EDA was used to understand customer behavior and uncover churn patterns.

Key findings included:

EDA Summary (Key Insights):
- Churn distribution is fairly balanced, with slightly more non-churn customers than churned ones. This means the dataset is suitable for classification without severe class imbalance.
- Age shows a mild relationship with churn. Customers who churn tend to be slightly older on average, though the overlap is large, so age alone is not a strong predictor.
- Monthly charges are higher for churned customers. Customers who left generally pay more per month, suggesting price sensitivity is a major factor influencing churn.
- Tenure is clearly related to churn. Customers with shorter tenure are more likely to churn, while long-term customers tend to stay, indicating loyalty increases over time.
- Overall, financial and engagement factors matter more than demographics. Monthly charges and tenure show stronger separation between churn and non-churn compared to age.

3. Feature engineering and preprocessing

Encoded categorical variables
Scaled numerical features

Split the data into training and test sets to evaluate generalization

  categorical_cols = [
      "Gender", "Contract_Type", "Internet_Service", "Payment_Method"
  ]

  le = LabelEncoder()
  for col in categorical_cols:
      df[col] = le.fit_transform(df[col])

#Define Features and Target
X = df.drop(columns=["Customer_ID", "Churn"])
y = df["Churn"]

#Train-Test split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=42, stratify=y
)

Modeling and Evaluation

I trained and evaluated two classification models:

Logistic Regression
Random Forest Classifier

Because churn prediction has asymmetric business costs, I evaluated models using accuracy, precision, recall, F1 score, and AUC.

Model performance summary

Model	Accuracy	Recall	F1 Score	AUC
Logistic Regression	0.576	0.544	0.539	0.586
Random Forest	0.528	0.491	0.487	0.578

Logistic Regression consistently outperformed Random Forest across all metrics.

Model Selection Rationale

Logistic Regression was selected for deployment for three main reasons:

Better overall performance and higher recall, which is critical for identifying at-risk customers
Strong interpretability, allowing business stakeholders to understand why customers churn
Better alignment with business needs, where missing a churner is more costly than a false alarm

To further improve recall, I recommended lowering the probability threshold from 0.5 to approximately 0.4.

Key Churn Drivers

Using feature importance analysis, the strongest churn drivers were identified as:

Monthly charges
Tenure
Total charges
Customer satisfaction score
Contract type
Support call frequency
Age

These results show that churn is driven primarily by pricing, customer experience, and relationship duration rather than static demographic attributes.

Business Recommendations

Based on the insights from the model and EDA, I proposed the following actions:

Offer targeted discounts or flexible pricing to high-billing customers
Strengthen onboarding and engagement during the first three to six months
Trigger proactive outreach when customer satisfaction scores drop
Improve support quality for customers with frequent service calls
Encourage long-term contracts through incentives
Personalize retention strategies by age group

These recommendations directly link model insights to measurable business actions.

Implementation Strategy

To ensure the solution remains effective in production, I recommended:

Retraining the model every three to six months
Monitoring recall, AUC, churn rate, and false negative rate
Measuring business impact through retention campaign success and customer lifetime value

Limitations and Future Improvements

While the model provides useful insights, its predictive power is moderate.

Future improvements could include:

Adding time-series and behavioral usage data
Incorporating complaint resolution history
Testing advanced models such as Gradient Boosting or XGBoost
Addressing potential class imbalance
Integrating near real-time customer activity

Results and Impact

Although this was an offline project, the expected business impact includes:

Earlier identification of at-risk customers
More targeted and cost-effective retention campaigns
Reduced customer churn and improved customer lifetime value

The project demonstrates how even moderately performing models can deliver meaningful value when combined with strong business understanding.

Key Skills Demonstrated

Business problem framing
Exploratory data analysis
Feature engineering and preprocessing
Classification modeling
Model evaluation and selection
Translating machine learning outputs into business strategy
Communicating insights to non-technical stakeholders

Final Reflection

This case study highlights an important principle of applied data science: models do not need to be perfect to be useful. What matters most is understanding the problem, interpreting results correctly, and turning insights into action.

This project showcases my ability to think beyond metrics and build solutions that support real business decisions.

image credit : Pinterest

Building a Real-World Car Price Prediction System with Machine Learning

OGUNYEMI EZEKIEL TIMILEHIN — Wed, 21 Jan 2026 16:28:19 GMT

Pricing used cars accurately is a major challenge in the automotive industry. Overpricing leads to slow sales, while underpricing reduces profit margins. In this project, I built an end-to-end Car Price Prediction System that uses machine learning to estimate fair market prices for used cars based on their features.

This project applies concepts from Weeks 14–15 of machine learning, covering data preprocessing, exploratory data analysis, regression models, evaluation techniques, and real-world business interpretation.

Project Objective

The goal of this project is to build an intelligent pricing system that helps an automotive company:

Price vehicles competitively
Identify undervalued cars for purchase
Maximize profit margins
Provide instant price estimates to customers

The system predicts car prices using structured vehicle data such as brand, mileage, engine size, and service history.

Dataset Overview

The dataset used is:

assessment_car_price_prediction.csv

It contains 200 records of used cars with a mix of numerical and categorical features.

Key Features

Brand
Year
Mileage
Engine Size
Horsepower
Fuel Type
Transmission
Previous Owners
Accident History
Service Records

Target Variable

Price (USD)

Phase 1: Data Understanding & Preprocessing

1.1 Data Loading and Initial Exploration

The first step was to load the dataset and inspect its structure. This included checking the number of rows and columns, reviewing data types, examining sample records, and confirming data completeness.

# Data Loading and Exploration This section loads the dataset and performs initial inspection to understand its structure, data types, and completeness.
import pandas as pd
import numpy as np

# Load dataset
df = pd.read_csv("/kaggle/input/week15-dataset/assessment_car_price_prediction.csv")

# Basic information
df.shape, df.info()

# View data
df.head(), df.tail()

# Statistical summary
df.describe()

# Missing values
df.isnull().sum()

This inspection revealed that:

The dataset contains 200 rows and 11 columns
There are no missing values
Five features are categorical, while the rest are numerical

This clean structure allowed us to proceed directly to exploratory analysis.

1.2 Exploratory Data Analysis (EDA)

EDA helps uncover patterns, trends, and anomalies that influence pricing behavior.

Key visual analyses performed include:

Distribution of car prices
Price variation by brand
Price variation by fuel type
Correlation analysis among numerical features
Relationship between mileage and price
Relationship between vehicle year and price

Key Insights from EDA

Car prices are right-skewed, indicating more affordable cars than luxury ones.
Premium brands command higher median prices.
Mileage shows a strong negative relationship with price.
Newer vehicles consistently sell for higher prices.

1.3 Data Preprocessing

Before modeling, the data was transformed into a machine-learning-ready format.

Categorical Encoding

Brand, Fuel Type, and Transmission were one-hot encoded.
Accident History and Service Records were label encoded (Yes = 1, No = 0).

Feature Scaling

Numerical features such as mileage and horsepower were standardized to ensure fair contribution during training.

Train-Test Split

The dataset was split into:

70% training data
30% testing data
with random_state = 42 for reproducibility.

# Data Preprocessing (Categorical Encoding)
import time
import numpy as np
import pandas as pd

from sklearn.preprocessing import OneHotEncoder, LabelEncoder
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler

# Label encoding
df['Accident_History'] = df['Accident_History'].map({'Yes':1, 'No':0})
df['Service_Records'] = df['Service_Records'].map({'Yes':1, 'No':0})

# Features
X = df.drop('Price', axis=1)
y = df['Price']

categorical = ['Brand', 'Fuel_Type', 'Transmission']
numerical = ['Year', 'Mileage', 'Engine_Size', 'Horsepower', 'Previous_Owners',
             'Accident_History', 'Service_Records']

preprocessor = ColumnTransformer([
    ('cat', OneHotEncoder(drop='first'), categorical),
    ('num', StandardScaler(), numerical)
])

X_train_processed = preprocessor.fit_transform(X_train)
X_test_processed = preprocessor.transform(X_test)

categorical = ['Brand', 'Fuel_Type', 'Transmission']
numerical = [
    'Year', 'Mileage', 'Engine_Size', 'Horsepower',
    'Previous_Owners', 'Accident_History', 'Service_Records'
]

# Categorical pipeline
cat_pipeline = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('encoder', OneHotEncoder(drop='first', handle_unknown='ignore'))
])

# Numerical pipeline
num_pipeline = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

# Combine pipelines
preprocessor = ColumnTransformer([
    ('cat', cat_pipeline, categorical),
    ('num', num_pipeline, numerical)
])

import numpy as np

print("Missing values in X_train:",
      np.isnan(X_train_processed.toarray() if hasattr(X_train_processed, "toarray") else X_train_processed).sum())

print("Missing values in X_test:",
      np.isnan(X_test_processed.toarray() if hasattr(X_test_processed, "toarray") else X_test_processed).sum())

print("X_train shape:", X_train_processed.shape)
print("X_test shape:", X_test_processed.shape)
print("y_train shape:", y_train.shape)
print("y_test shape:", y_test.shape)

All checks confirmed that the processed datasets contained no missing values and were correctly shaped.

Phase 2: Model Development

To identify the most suitable model, multiple regression techniques were tested.

2.1 Baseline Model: Multiple Linear Regression

Linear Regression was chosen as the baseline model due to its simplicity and interpretability.

# Baseline Model: Multiple Linear Regression
lr_pipeline = Pipeline(steps=[
    ("preprocessor", preprocessor),
    ("model", LinearRegression())
])

start = time.time()
lr_pipeline.fit(X_train, y_train)
train_time_lr = time.time() - start

y_train_pred_lr = lr_pipeline.predict(X_train)
y_test_pred_lr = lr_pipeline.predict(X_test)

lr_results = {
    "Model": "Linear Regression",
    "Train R2": r2_score(y_train, y_train_pred_lr),
    "Test R2": r2_score(y_test, y_test_pred_lr),
    "MAE": mean_absolute_error(y_test, y_test_pred_lr),
    "RMSE": mean_squared_error(y_test, y_test_pred_lr) ** 0.5,
    "Training Time": train_time_lr
}

Evaluation metrics included:

R² Score
Mean Absolute Error (MAE)
Root Mean Squared Error (RMSE)
Training time

2.2 Polynomial Regression

Polynomial Regression was tested with degrees 2 and 3 to capture non-linear relationships.

# Polynomial Regression
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error
import time

# Store evaluation results (FOR TABLE)
poly_results = []

# Store trained models (FOR PLOTTING / SELECTION)
poly_models = {}

for degree in [2, 3]:
    poly_pipeline = Pipeline([
        ('preprocessor', preprocessor),
        ('poly', PolynomialFeatures(degree=degree, include_bias=False)),
        ('model', LinearRegression())
    ])

    start = time.time()
    poly_pipeline.fit(X_train, y_train)
    train_time = time.time() - start

    y_train_pred = poly_pipeline.predict(X_train)
    y_test_pred = poly_pipeline.predict(X_test)

    # Save model
    poly_models[degree] = poly_pipeline

    # Save evaluation metrics
    poly_results.append({
        "Model": f"Polynomial Regression (deg {degree})",
        "Train R2": r2_score(y_train, y_train_pred),
        "Test R2": r2_score(y_test, y_test_pred),
        "MAE": mean_absolute_error(y_test, y_test_pred),
        "RMSE": mean_squared_error(y_test, y_test_pred) ** 0.5,
        "Training Time": train_time
    })

# Select best polynomial degree 
best_poly_degree = 2
poly_pipeline = poly_models[best_poly_degree]

y_train_pred_poly = poly_pipeline.predict(X_train)
y_test_pred_poly = poly_pipeline.predict(X_test)

While higher degrees improved training performance, they showed reduced generalization on test data.

2.3 Support Vector Regression (SVR)

Support Vector Regression with an RBF kernel was evaluated using different hyperparameter configurations.

#SVR
from sklearn.pipeline import Pipeline
from sklearn.svm import SVR
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error
import time

svr_results = []

svr_configs = [
    {'C': 100, 'gamma': 'auto'},
    {'C': 1000, 'gamma': 'scale'}
]

for cfg in svr_configs:
    svr_pipeline = Pipeline([
        ('preprocessor', preprocessor),  # encoding + imputation + scaling
        ('svr', SVR(kernel='rbf', **cfg))
    ])

    start = time.time()
    svr_pipeline.fit(X_train, y_train)
    train_time = time.time() - start

    y_train_pred_svr = svr_pipeline.predict(X_train)
    y_test_pred_svr = svr_pipeline.predict(X_test)

    svr_results.append({
        "Model": f"SVR (RBF) C={cfg['C']} gamma={cfg['gamma']}",
        "Train R2": r2_score(y_train, y_train_pred),
        "Test R2": r2_score(y_test, y_test_pred),
        "MAE": mean_absolute_error(y_test, y_test_pred),
        "RMSE": mean_squared_error(y_test, y_test_pred) ** 0.5,
        "Training Time": train_time})

Although SVR performed well on training data, it showed signs of overfitting.

2.4 Decision Tree Regression

Decision Trees were tested with varying depths to balance bias and variance.

#Decision Tree Regression
from sklearn.tree import DecisionTreeRegressor

dt_results = []

for depth in [3, 5, 10, None]:
    dt_pipeline = Pipeline([
        ('preprocessor', preprocessor),
        ('dt', DecisionTreeRegressor(
            max_depth=depth,
            random_state=42))])

    start = time.time()
    dt_pipeline.fit(X_train, y_train)
    train_time = time.time() - start

    y_train_pred_dt = dt_pipeline.predict(X_train)
    y_test_pred_dt = dt_pipeline.predict(X_test)

    dt_results.append({
        "Model": f"Decision Tree depth={depth}",
        "Train R2": r2_score(y_train, y_train_pred),
        "Test R2": r2_score(y_test, y_test_pred),
        "MAE": mean_absolute_error(y_test, y_test_pred),
        "RMSE": mean_squared_error(y_test, y_test_pred) ** 0.5,
        "Training Time": train_time})

Deeper trees achieved perfect training scores but failed to generalize well.

Phase 3: Model Evaluation & Comparison

3.1 Model Comparison Table

All models were evaluated side-by-side using consistent metrics.

results_df = pd.DataFrame(
    [lr_results] + poly_results + svr_results + dt_results
)
results_df

The comparison revealed that Linear Regression achieved the best test performance, with the highest R² and lowest RMSE.

3.2 Predicted vs Actual Price Visualization

Predicted prices were plotted against actual prices to visually assess model accuracy.

import matplotlib.pyplot as plt
import numpy as np
from sklearn.metrics import r2_score

def plot_predicted_vs_actual(y_true, y_pred, model_name):
    errors = np.abs(y_true - y_pred)
    r2 = r2_score(y_true, y_pred)

    plt.figure(figsize=(7, 6))
    scatter = plt.scatter(y_true, y_pred, c=errors)
    plt.plot([y_true.min(), y_true.max()],
             [y_true.min(), y_true.max()],
             linestyle='--')
    plt.xlabel("Actual Price")
    plt.ylabel("Predicted Price")
    plt.title(f"{model_name} | R² = {r2:.3f}")
    plt.colorbar(scatter, label="Prediction Error")
    plt.show()

# Example usage
plot_predicted_vs_actual(y_test, y_test_pred, "Linear Regression")
plot_predicted_vs_actual( y_test, y_test_pred_poly,"Polynomial Regression")
plot_predicted_vs_actual(y_test, y_test_pred_svr, "Support Vector Regression")
plot_predicted_vs_actual(y_test, y_test_pred_dt, "Decision Tree Regression")

The closer the points are to the diagonal line, the better the model performance.

3.3 Residual Analysis

Residual plots and histograms were used to analyze prediction errors.

def residual_analysis(y_true, y_pred, model_name):
    residuals = y_true - y_pred

    # Residuals vs Predicted
    plt.figure(figsize=(7, 5))
    plt.scatter(y_pred, residuals)
    plt.axhline(0, linestyle='--')
    plt.xlabel("Predicted Price")
    plt.ylabel("Residuals")
    plt.title(f"{model_name} Residuals vs Predicted")
    plt.show()

    # Histogram of residuals
    plt.figure(figsize=(7, 5))
    plt.hist(residuals, bins=30)
    plt.xlabel("Residual")
    plt.ylabel("Frequency")
    plt.title(f"{model_name} Residual Distribution")
    plt.show()

# Apply to models
residual_analysis(y_test, y_test_pred, "Linear Regression")
residual_analysis(y_test, y_test_pred_poly, "Polynomial Regression")
residual_analysis(y_test, y_test_pred_svr, "SVR")
residual_analysis(y_test, y_test_pred_dt, "Decision Tree")

The residuals for Linear Regression were randomly distributed, indicating good model assumptions.

Phase 4: Model Selection & Business Application

4.1 Final Model Selection

After evaluating accuracy, overfitting risk, interpretability, and computational efficiency, Linear Regression was selected as the final model.

It achieved:

High test R² (≈ 0.935)
Lowest RMSE
Stable generalization
High interpretability for business users
Fast training and prediction times

This balance makes it ideal for real-world deployment.

4.2 Price Prediction for New Cars

The final model was used to predict prices for three hypothetical cars:

A budget Toyota sedan
A low-mileage BMW luxury sedan
An older Ford with accident history

import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline

num_cols = ["Year", "Mileage", "Engine_Size", "Horsepower", "Previous_Owners"]
cat_cols = ["Brand", "Fuel_Type", "Transmission", "Accident_History", "Service_Records"]

numeric_tf = Pipeline([
    ("imputer", SimpleImputer(strategy="median"))
])

categorical_tf = Pipeline([
    ("imputer", SimpleImputer(strategy="most_frequent")),
    ("encoder", OneHotEncoder(handle_unknown="ignore"))
])

preprocessor = ColumnTransformer([
    ("num", numeric_tf, num_cols),
    ("cat", categorical_tf, cat_cols)
])

Prediction Summary

Toyota (2015): ≈ $29,107
BMW (2020): ≈ $65,855
Ford (2012): ≈ $9,980

These results align well with market expectations.

Business Insights & Recommendations

Key Findings

Brand, mileage, vehicle age, accident history, and engine performance are the strongest drivers of car prices. Premium brands retain value better, while high mileage and accident history significantly reduce resale value.

Business Recommendations

Dealers should prioritize low-mileage, accident-free vehicles with complete service records. Pricing strategies should leverage model predictions to identify undervalued listings and apply data-driven price adjustments. Customers should be educated on how mileage and maintenance history affect long-term value.

Model Limitations

The linear regression model assumes linear relationships and does not capture complex interactions. It may struggle with rare brands, extreme market conditions, or heavily modified vehicles.

Future Improvements

Future work could involve advanced models such as Gradient Boosting or Random Forest, incorporation of real transaction prices, location data, and periodic retraining to maintain accuracy.

Conclusion

This project demonstrates how machine learning can be applied to solve a real business problem in the automotive industry. By combining solid data preprocessing, careful model evaluation, and business-focused interpretation, the resulting pricing system provides both technical reliability and practical value.

Data and Decisions: Building a Housing Price Prediction Model with Multiple Linear Regression

OGUNYEMI EZEKIEL TIMILEHIN — Thu, 08 Jan 2026 13:58:33 GMT

Introduction

Property valuation is one of the most critical tasks in real estate. Traditionally, this process relies heavily on human judgment and market intuition. While experience matters, data offers an opportunity to make pricing more consistent, transparent, and defensible.

In this project, I worked as a data scientist in a real estate company tasked with developing a machine learning model to predict house prices based on property characteristics. The goal was not just to predict prices accurately, but also to understand what truly drives housing prices and translate those insights into real business value.

This assessment was structured into four phases, moving from raw data understanding to actionable business recommendations.

Objective

The objective of this project was to apply core machine learning concepts end-to-end by building a multiple linear regression model that predicts house prices and then optimizing it through feature selection.

Dataset

File: Assessment-Dataset/housing_price_data.csv
The dataset contains information about house size, location, amenities, accessibility, and pricing.

Phase 1: Data Understanding and Preprocessing

The project began with a thorough understanding of the dataset.

Exploratory Data Analysis (EDA)

I first examined the dataset’s structure, size, and feature types to understand what kind of data I was working with. Statistical summaries were generated to observe the range, mean, and variability of numerical features. The distribution of the target variable (house price) was analyzed to check for skewness and outliers.

To understand relationships between variables, a correlation heatmap was created. This helped reveal which features had strong linear relationships with house prices and which ones contributed little signal.

Figure 1.0 Target Variable Distribution

Figure 1.1 Correlation heatmap

Data Quality Assessment

Next, I checked for missing values and inconsistencies. Any missing entries were handled appropriately to prevent bias. Potential outliers were reviewed to ensure they represented valid market cases rather than data errors. All observations and decisions taken during this stage were documented to maintain transparency.

df = pd.read_csv("/kaggle/input/data-preprocessing-week-14/housing_price_data.csv")
df.head()

df.shape

df.info()

df.describe()

Preprocessing Pipeline

Categorical variables such as Neighborhood, Garage, and Pool were encoded using dummy variables. To avoid the dummy variable trap, one category from each encoded variable was dropped.

The dataset was then split into 70% training data and 30% test data. Feature scaling was applied where necessary to ensure fair contribution of numerical variables to the regression model.

# Preprocessing Pipeline (Reusable)
def preprocess_data(df):
    df_encoded = pd.get_dummies(df,columns=["Neighborhood", "Garage", "Pool"],
        drop_first=True) 
    X = df_encoded.drop("House_Price", axis=1)
    y = df_encoded["House_Price"]

    return X, y

X, y = preprocess_data(df)

# Train-Test Split (70/30)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

Phase 2: Model Development

Two regression models were developed and compared.

Model 1: Multiple Linear Regression (All Features)

The first model included all available features. It served as a baseline to understand overall performance before optimization. The model was trained on the training set and evaluated on the test set using standard regression metrics.

Model 2: Optimized Multiple Linear Regression

To improve interpretability and reduce noise, I applied backward elimination with a significance level of 0.05. Using statistical p-values, features that did not contribute meaningfully to the model were removed iteratively.

Each elimination step was documented and justified based on statistical evidence.

# Feature Selection: Backward Elimination
X_sm = sm.add_constant(X.astype(float))
y_sm = y.astype(float)

X_opt = X_sm.copy()

while True:
    model = sm.OLS(y_sm, X_opt).fit()
    p_vals = model.pvalues
    max_p = p_vals.max()

    if max_p > 0.05:
        feature = p_vals.idxmax()
        X_opt = X_opt.drop(columns=[feature])
        print(f"Removed: {feature} (p = {max_p:.4f})")
    else:
        break

#Model summary
model.summary()

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 9.33e+04. This might indicate that there are
strong multicollinearity or other numerical problems.

Evaluation Metrics Used

Both models were evaluated using:

R² score
Adjusted R² score
Mean Absolute Error (MAE)
Mean Squared Error (MSE)
Root Mean Squared Error (RMSE)

Phase 3: Model Evaluation and Validation

First A model comparison table was created to clearly show performance differences between the initial and optimized models. The optimized model achieved better interpretability while maintaining strong predictive performance, with no clear signs of overfitting or underfitting.

# Model Comparison Table
comparison = pd.DataFrame({ "Metric": ["R²", "Adjusted R²", "MAE", "MSE", "RMSE"], "Full Model": metrics_full, "Optimized Model": metrics_opt})

comparison

Figure1.3 Model Comparison Table

Visual analysis played a key role in validating model behavior.

Figure 1.4 Predicted vs Actual price scatter plots (both models)

Figure 1.5 Residual plots

Figure 1.6 Feature importance bar chart

Phase 4: Business Insights and Recommendations

Model Interpretation

The optimized regression model indicates that house prices are largely driven by property size, neighborhood quality, and accessibility. House area emerged as the most influential variable, showing a strong positive relationship with price. This confirms a familiar market truth: larger homes consistently command higher prices.

Neighborhood quality, especially properties located in high-end or luxury areas, also had a significant positive influence. Buyers are clearly willing to pay a premium for better infrastructure, security, and social amenities.

Distance from the city center showed a negative effect on house prices. Homes located farther away tend to be less valuable due to reduced accessibility to economic and social opportunities. Property tax displayed a positive association with price, acting as a proxy for overall property value and municipal service quality. Bathrooms contributed moderately to price increases by improving comfort and functionality.

A notable and somewhat surprising finding was that the number of bedrooms had a weak impact once total house area was considered. This suggests buyers care more about usable space than room count. House age also showed minimal influence, implying that maintenance and location matter more than construction year. Amenities such as garages and pools add value, but not as strongly as size and location.

Business Applications and Recommendations

The real estate company can use this model as a decision-support tool for pricing properties more objectively. Agents can justify listing prices using data rather than intuition alone, increasing client trust. The model can also help identify undervalued properties with strong fundamentals, opening up profitable investment opportunities.

However, the model has limitations. It assumes linear relationships, which may oversimplify real housing markets. It does not account for macroeconomic factors such as interest rates, inflation, or income levels. Additionally, the lack of time-series data limits trend analysis.

Future improvements could include incorporating economic indicators, testing non-linear models like Random Forest or Gradient Boosting, and retraining the model periodically with updated data.

Sample Predictions and Explanations

To demonstrate real-world usage, three hypothetical houses were evaluated:

House 1: A small house far from the city center in a standard neighborhood. The model predicts a relatively low price due to limited size, average location quality, and reduced accessibility.
House 2: A medium-sized house in a luxury neighborhood at a moderate distance from the city center. The predicted price is higher than House 1, mainly driven by neighborhood quality.
House 3: A large house in a luxury neighborhood close to the city center. This house receives the highest predicted price because it combines all major value-driving features.

These predictions highlight how the model balances different features to arrive at realistic price estimates.

This project developed a housing price prediction model using multiple linear regression and feature optimization. The final model provides strong explanatory power and actionable insights. Property size, neighborhood quality, and proximity to the city center emerged as the most important pricing factors, making the model valuable for real estate valuation and strategic decision-making.

Appendix

These visualizations support the conclusions drawn and reinforce confidence in the model.

Useful Resources in This Article

Real-world housing price dataset for regression analysis
Step-by-step EDA with distributions and correlation heatmaps
Data preprocessing workflow (encoding, scaling, train–test split)
Reusable Python functions for preprocessing, training, and evaluation
Multiple Linear Regression and optimized model using backward elimination
Clear explanation of evaluation metrics (R², MAE, MSE, RMSE)
Model diagnostics: predicted vs actual plots, residuals, feature importance
Business insights translated from model results

cover photo credit: Pinterest(pngtree)

World Population Analysis

OGUNYEMI EZEKIEL TIMILEHIN — Thu, 18 Dec 2025 08:51:00 GMT

Introduction

Population plays a central role in shaping economies, infrastructure, healthcare systems, education planning, and long-term development strategies. While global population figures are often discussed as a single headline number, the underlying distribution across countries tells a much more detailed story.

Some countries account for a significant share of the world’s population, while many others contribute relatively small proportions. Understanding this imbalance is essential for effective planning, policy formulation, and sustainable development.

In this article, I explore global population distribution using publicly available data scraped from Wikipedia. Through data cleaning, analysis, and visualization, the goal is to uncover patterns in population concentration, examine how population declines by country rank, and highlight what these trends imply for stakeholders.

Data Source and Setup

The dataset was obtained from Wikipedia’s world population tables using web scraping techniques in Python.

Data Source

Wikipedia: World population by country URL: World population by country
Data type: Country-level population estimates
Coverage: Over 200 countries and territories

Tools Used

Python
Requests (for fetching HTML content)
Pandas (for data manipulation)
Matplotlib (for visualization)

Web scraping of Data

# requests + pandas.read_html used to scrape population tables
import requests
import pandas as pd
import matplotlib.pyplot as plt

url = "https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_population"

headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64)"}

response = requests.get(url, headers=headers, timeout=30)
response.raise_for_status()

tables = pd.read_html(response.text)
population_df = tables[0]   # main population table

population_df.head()

Data Cleaning and Preparation

To ensure accurate analysis, the following steps were applied:

Removed the global aggregate entry (“World”) from country-level analysis to avoid double counting.
Ranked countries by population size.
Created cumulative population metrics for further analysis.

These steps ensured the dataset was consistent and suitable for statistical exploration.

# keeping relevant column 
population_df = population_df[[ "Location", "Population", "% of world","Date"]]

population_df  = population_df[population_df["Location"] != "World"]

population_df = population_df.sort_values(by="Population", ascending=False)

population_df.head()
population_df.tail()

A population share visualization was created to show how the world’s population is distributed among the most populous countries.

Figure 1: Pie chart-Population Share of Top 10 countries

Figure 2: Bar Chart -Top 10 most populous countries

Insight

The visualization shows that a small number of countries account for a very large share of the global population. India and China alone represent a substantial proportion, while the remaining countries collectively make up the rest.

This highlights how population is far from evenly distributed across the globe.

Cumulative Population Contribution

To understand how population accumulates as more countries are considered, a cumulative population curve was plotted based on country rank.

Figure 3 : Line plot- Cumulative Global Population contribution

Insight

The curve rises sharply at the beginning and gradually flattens. This means that the top-ranked countries contribute most of the global population, while additional countries add smaller increments.

In practical terms, a relatively small group of countries determines most global population outcomes.

Population Decay by Country Rank

Population size was plotted against country rank to observe how quickly population declines as rank increases.

Figure 4 : Scatter Plot - Population Decay by Country Rank

Insight

The pattern shows a steep decline in population after the first few ranks, followed by a long tail of countries with much smaller populations. This type of distribution is common in large-scale systems and reflects structural imbalance rather than random variation.

Key Findings

From the analysis, several important points emerge:

Global population is highly concentrated in a small number of countries.
Most countries contribute only a small fraction to the total population.
Population distribution follows a consistent decay pattern by rank.
These trends are likely to persist without major structural changes.

Stakeholder Implications

Policymakers

Population-heavy countries require focused attention in infrastructure development, healthcare provision, and education planning. Broad policies that ignore population concentration risk inefficiency.

Development Organizations

Targeting high-population regions can improve the impact of development initiatives, while smaller countries benefit from tailored approaches.

Urban and Regional Planners

Population concentration increases pressure on cities and surrounding regions. Long-term planning must account for projected population trends.

Economists and Investors

Population size influences market potential, labor availability, and future growth. Demographic patterns provide valuable context for economic decision-making.

Limitations

Population figures are estimates and may change over time.
The analysis does not account for migration shocks, environmental factors, or sudden demographic changes.
Country-level data masks important regional and urban differences.

Conclusion

This analysis provides a clear view of how the world’s population is distributed and why that distribution matters. Rather than being evenly spread, global population is concentrated in a small number of countries, shaping economic, social, and developmental outcomes.

Understanding these patterns helps stakeholders make informed decisions and plan more effectively for the future. Population data, when examined closely, offers valuable insights into how societies are structured and how they may evolve.

Author’s Reflection

Working through this project reinforced the importance of looking beyond headline numbers. Seeing how quickly population declines after the most populous countries provided a deeper understanding of global imbalance.

This analysis showed that data is more than a collection of figures. It is a way to understand systems, challenge assumptions, and support better decision-making. Population trends, in particular, offer a powerful lens through which to view global development.

Useful Resources

In this article, readers will find:

A web-scraped global population dataset
Clear data cleaning and preparation steps
Multiple visualizations beyond simple bar charts
Insights into population concentration and inequality
Stakeholder-focused interpretation of results

Thank you for reading. If you found this analysis helpful or have questions, feel free to share your thoughts.

cover photo credit : Freepik.com

A Look at Nigeria's Mobile Phone Trade: Imports and Exports in 2023

OGUNYEMI EZEKIEL TIMILEHIN — Tue, 18 Nov 2025 22:05:03 GMT

Introduction

Every modern economy depends on the steady movement of goods across its borders. In today’s connected world, mobile phones are more than luxury items; they are essential tools for communication, business, education, and access to digital services. Understanding how a country trades mobile phones can reveal deeper insights into its technological landscape, market dependency, and global partnerships.

Using 2023 data from the United Nations Comtrade database, this project examines Nigeria’s trade activity in mobile phones, focusing on the product category HS Code 8517: telephone sets, including telephones for cellular networks or other wireless networks.

The analysis aims to answer key questions.
How much does Nigeria import and export? Who are the major trading partners? Which countries trade consistently? Are there bidirectional trading relationships? What monthly patterns exist? And how do Nigeria’s biggest import sources compare to global totals?

This study provides a comprehensive overview of Nigeria’s mobile phone trade ecosystem for the year 2023.

Data Source and Setup

All data was retrieved from the UN Comtrade database with the following parameters.

Type of Data: Goods
Reporter: Nigeria
Period: All months in 2023
Partners: All available
Trade Flow: Imports and Exports
Commodity: HS Code 8517
Frequency: Monthly

After downloading the dataset, it was loaded with:

df = pd.read_csv("Nigeria_MobilePhones_2023.csv", encoding="latin1")
df.head()

The dataset was then divided into import and export subsets:

imports = MobilePhone_countries[MobilePhone_countries["Trade Flow"] == "Import"]
exports = MobilePhone_countries[MobilePhone_countries["Trade Flow"] == "Export"]

This separation made it easier to explore each component of Nigeria’s trade individually.

Total Imports and Total Exports

Total Imports in 2023

Total imports were calculated using:

total_imports_by_year = imports.groupby("Year")["Trade Value (US$)"].sum()

Result:

Year	Total Imports (US$)
2023	655,014,300

Total Exports in 2023

Total exports were calculated using:

total_exports_year = exports.groupby("Year")["Trade Value (US$)"].sum()

Result:

Year	Total Exports (US$)
2023	2,104.21

Trade Balance

The trade balance is simply:

Balance = Exports − Imports
Balance ≈ 2,104 − 655,014,300
Balance ≈ −655 million USD

Nigeria’s mobile phone trade in 2023 shows a very large deficit. Imports dominate overwhelmingly, with practically no export activity.

Main Trade Partners

Top Import Partners

These are the countries Nigeria buys the most mobile phones from:

imports_by_country = imports.groupby("Partner")["Trade Value (US$)"].sum()
top_country = imports_by_country.sort_values(ascending=False).head(5)

Rank	Partner	Import Value (US$)
1	China	496,363,200
2	Sweden	56,484,860
3	Mexico	19,519,240
4	USA	19,346,690
5	China, Hong Kong SAR	10,074,570

China is the dominant participant in Nigeria’s mobile phone market. By itself, it accounts for the vast majority of all imports.

Top Export Partner

Nigeria’s exports are minimal and flow to only one country:

exports_by_country = exports.groupby("Partner")["Trade Value (US$)"].sum()
top_export_country = exports_by_country.sort_values(ascending=False).head(5)

Rank	Partner	Export Value (US$)
1	Zimbabwe	2,104.21

Mobile phone exports from Nigeria are almost nonexistent.

Regular Customers

Regular customers are partners that appear in the trade records every month of the year.

The calculation:

regular_customers = df.groupby("Partner")["Month"].nunique()
regular_customers = regular_customers[regular_customers == 12]

Result:

Partner
China
World

China remains the only consistent trading partner across all twelve months. This highlights how deeply Nigeria’s mobile phone market depends on Chinese imports.

Countries with Both Imports and Exports

To check if Nigeria had any two-way trade partners:

import_countries = set(imports["Partner"])
export_countries = set(exports["Partner"])
bidirectional = import_countries & export_countries

Result:

set()

Nigeria has no country from which it both imports and exports mobile phones in 2023.

Monthly Import and Export Trends

Monthly Import Totals

monthly_imports = imports.groupby("Month")["Trade Value (US$)"].sum().sort_values(ascending=False)

Month	Import Value (US$)
12	93,369,930
6	81,661,610
8	80,932,270
11	62,044,820
5	53,551,740
10	53,031,720
7	51,138,010
9	49,726,210
2	41,802,410
3	40,219,080
4	25,027,430
1	22,509,030

Imports rise toward December. The pattern suggests a seasonal cycle tied to end-of-year consumer demand, promotions, and increased market activity.

Monthly Exports

Exports appear only in the month of August:

Month	Export Value (US$)
August	2,104.21

Exports very low as Nigeria exported to just one country

Top Three Importers Compared to Total World Trade

total_world = imports["Trade Value (US$)"].sum()

top3 = imports.groupby("Partner")["Trade Value (US$)"].sum() \
              .sort_values(ascending=False).head(3)

A bar chart created with:

plt.bar(top3.index, top3.values)
plt.title("Top 3 Countries Exporting Mobile Phones to Nigeria (2023)")
plt.ylabel("Trade Value (US$)")
plt.show()

This visual emphasizes how far ahead China is compared to other suppliers. The gap is extremely wide, reflecting China’s dominance in Nigeria’s mobile phone supply chain.

Conclusion

This analysis provides a clear picture of Nigeria’s mobile phone trade landscape in 2023. The evidence shows:

Nigeria is primarily an importer, bringing in over 655 million dollars in mobile phones while exporting just slightly above two thousand dollars.
China plays an outsized role, supplying most of the devices sold in Nigeria and appearing in trade records every single month.
There are no countries with both import and export interactions, highlighting a one-directional trade pattern.
Imports show a noticeable seasonal trend with peaks toward the end of the year.
Exports are extremely rare, concentrated in only one month and one partner.

Nigeria’s mobile phone trade structure is heavily imbalanced. The country depends almost entirely on external suppliers, particularly China, for mobile communication technology. Understanding this pattern is valuable for policymakers, investors, technology analysts, and anyone interested in the structure of Nigeria’s digital economy. Expanded datasets or multi-year comparisons would add even deeper insights and help reveal long-term trends.

Author’s Reflection

Working with this dataset forced me to confront a reality that many of us in Nigeria already feel, even if we do not often quantify it. Our economy is heavily tilted toward consumption. We buy, import, distribute, and retail, but when it comes to producing the technology we use every day, the numbers reveal just how little activity exists on the domestic side.

Seeing more than six hundred and fifty million dollars flowing outward in phone imports while only two thousand dollars came in as exports was not just surprising, it was unsettling. It showed how disconnected our consumption appetite is from our production capacity. It also raised a deeper question. How sustainable is an economy where almost everything we depend on is manufactured elsewhere?

Exploring the data made the issue even clearer. China’s presence throughout the entire year illustrates how dependent Nigeria is on a single global player for essential technology. The absence of any partner that both buys from and sells to Nigeria shows the one-directional nature of our participation in the global mobile phone market. We import. We consume. And then the cycle repeats.

This reflection is not about criticizing Nigeria for what it lacks, but about acknowledging the opportunity that lies within these numbers. A country of more than two hundred million people with one of the largest youth populations in the world should not remain only a marketplace for other nations' innovations. The demand already exists. The market size is undeniable. What is missing is the foundation for local production: infrastructure, incentives, research, manufacturing ecosystems, and long-term investment in technology.

This analysis reminded me that data is not just a collection of values. It is a mirror that reflects how a society functions. By seeing the gap clearly, we gain a better understanding of what must change if Nigeria is to move from being a consumer-driven economy to one that creates, builds, and exports value to the world.

Useful Resources

In this analysis, readers should expect a clear and data-driven exploration of Nigeria’s mobile phone trade ecosystem using official monthly records from the United Nations Comtrade database. The goal is to provide both technical insight and practical understanding. Specifically, the analysis covers the following:

A breakdown of Nigeria’s total mobile phone imports and exports for 2023, showing the scale of trade activity.
Identification of Nigeria’s largest trading partners and how much each country contributes to overall imports and exports.
Discovery of consistent monthly trading partners to reveal long-term patterns.
Examination of whether Nigeria has any partners with two-way phone trade involving both imports and exports.
A month-by-month trend analysis to help readers understand seasonal patterns in import demand.
A comparison of Nigeria’s top three import sources with global totals to highlight how concentrated the market is.
A discussion of Nigeria’s significant trade imbalance and what it means for production capacity, consumption habits, and economic vulnerability.
Insightful data visuals that make the trends easier to interpret.

This section equips the reader with all the necessary context to appreciate the depth of the analysis and understand the larger story behind Nigeria’s dependence on imported mobile phones.

How Renewable Energy Impacts CO₂ Emissions: A Data-Driven Exploration

OGUNYEMI EZEKIEL TIMILEHIN — Tue, 11 Nov 2025 22:08:16 GMT

“Clean data can help drive a cleaner planet.”

Introduction

Breathing in Carbon: Why Renewable Energy Matters More Than Ever

Every breath we take tells a story, and lately, that story has been getting darker.
Air pollution kills more than seven million people each year, more than tuberculosis, malaria, and hepatitis combined. Those tiny particles released by cars, power plants, and factories do not just cloud the sky; they enter our lungs, our blood, and even our economy.

In 2023 alone, humanity lost over 512 billion work hours due to extreme heat, a direct result of a warming planet. Construction workers, farmers, and outdoor laborers, especially in low-income countries, are bearing the brunt of this crisis. The economic cost of climate disasters over the past decade has already surpassed two trillion dollars and continues to rise.

Yet, despite all this, the world continues to subsidize fossil fuels far more than renewables—about seven trillion dollars in 2022 compared to just 168 billion dollars for clean energy.
Our global budget still rewards pollution more than prevention.

There is a paradox here. On one hand, the technology to solve much of this already exists. Solar energy is now the cheapest source of electricity on Earth, tied with onshore wind. On the other hand, our energy systems remain stuck in the past, powered by the same carbon-heavy fuels driving our climate and health crises.

This led me to a question:
Are countries that use more renewable energy actually producing less CO₂?
Can we see a measurable, data-backed relationship between clean energy use and carbon emissions?

To explore this, I turned to data from the World Bank’s World Development Indicators for 2013. Using Python, I analyzed how renewable energy consumption relates to CO₂ emissions per capita across all available countries.
The results, both visual and statistical, offer a clear picture of where the world stood and what that tells us about our clean energy future.

Understanding the Question: Does Clean Energy Really Cut Emissions?

At the heart of every major environmental discussion lies one deceptively simple question:
If we use more renewable energy, will we emit less carbon?

It sounds obvious—solar and wind do not produce CO₂ when generating electricity, while coal, oil, and gas do. But the real world is rarely that straightforward. Industrial activity, population size, energy efficiency, and geography all play a part.
To truly understand this relationship, I turned to the data.

Objective

The goal of this project was to explore how renewable energy consumption influences CO₂ emissions across countries.
In plain terms: Do nations that rely more on renewables tend to have smaller carbon footprints per person?

By analyzing this relationship, we can get a clearer sense of how effective renewable energy adoption really is in reducing emissions and which countries are leading or lagging in the global transition to cleaner energy.

Data Sources

The analysis is based on data from the World Bank’s World Development Indicators (WDI), one of the most widely used and trusted global datasets for economic and environmental analysis.

Indicator	Code	Description
CO₂ emissions (metric tons per capita)	EN.GHG.CO2.PC.CE.AR5	The average amount of CO₂ emitted per person in a country.
Renewable energy consumption (% of total final energy consumption)	EG.FEC.RNEW.ZS	The percentage of energy derived from renewable sources such as solar, wind, hydro, and bioenergy.

Year analyzed: 2013
Countries included: All available with valid data.

The year 2013 provides a clear snapshot of global energy use before the post-2015 renewable boom, offering insight into baseline patterns before major international climate policies such as the Paris Agreement took effect.

Data Preparation Process

The datasets were prepared and merged in Python using the Pandas library.

Loaded both datasets into Pandas DataFrames.
Removed unnecessary columns such as indicator codes and years.
Merged them on country name.
Excluded regional aggregates (like “Sub-Saharan Africa”).
Dropped incomplete entries to maintain data integrity.

After cleaning, the final dataset showed each country’s renewable energy share alongside CO₂ emissions per capita, ready for exploration.

import pandas as pd

# Load datasets
co2 = pd.read_csv("/kaggle/input/pandas-dataset-2/CO2_2013_ready.csv")
renew = pd.read_csv("/kaggle/input/pandas-dataset-2/RENEW_2013_ready.csv")

# Drop 'year' column
co2 = co2.drop(columns=['year'], errors='ignore')
renew = renew.drop(columns=['year'], errors='ignore')

# Merge on 'country'
merged = pd.merge(co2, renew, on='country', how='inner')

# Drop missing or invalid values
merged = merged.dropna()

# Save and preview
merged.to_csv("CO2_RENEW_merged_2013.csv", index=False)
print(" Merged dataset saved as CO2_RENEW_merged_2013.csv")

merged.head()

Results and Visualization Insights

Once the data was cleaned and ready, I began the analysis.

Across all countries:

Average CO₂ emissions per person: approximately 4.8 metric tons
Average renewable energy share: approximately 29.4 percent

That might sound decent, but beneath the surface, differences between nations were enormous.



# summary statistics
merge_cleaned['renewable_energy_percent'] = pd.to_numeric(merge_cleaned['renewable_energy_percent'], errors='coerce')
merge_cleaned['co2_per_capita'] = pd.to_numeric(merge_cleaned['co2_per_capita'], errors='coerce')

# Basic summary stats
summary = merge_cleaned[['renewable_energy_percent', 'co2_per_capita']].describe()
median_values = merge_cleaned[['renewable_energy_percent', 'co2_per_capita']].median()


summary_table = summary.T  # transposed for better readability
summary_table['median'] = median_values.values
summary_table

What the Numbers Say

Some countries were already leading the way, powering their economies with more than 80 percent renewable energy.
Others had virtually none.

High-income nations generally had higher CO₂ emissions per person, even if they had started adopting renewables.
Developing countries tended to rely more on renewables, often out of necessity—via hydropower or biomass—rather than advanced policy.

The Highest Emitters in 2013

Country	CO₂ Emissions (t CO₂e per capita)	Renewable Energy (%)
Palau	103.37	0.0
Qatar	53.29	0.0
Trinidad and Tobago	27.82	0.4
Kuwait	26.56	0.0
Bahrain	25.64	0.0
United Arab Emirates	25.34	0.1
Saudi Arabia	19.81	0.0
Luxembourg	19.00	5.7
Brunei Darussalam	18.94	0.0
Oman	18.42	0.0

Most of these nations are oil-rich economies, where energy production and national revenue depend heavily on fossil fuels. Their renewable adoption rates are almost nonexistent, which directly explains their extremely high emission levels.

Observing the Pattern

from scipy.stats import spearmanr

# Spearman correlation
corr, p_value = spearmanr(merged_clean['renewable_energy_percent'], merged_clean['co2_per_capita'])
print(f"Spearman correlation: {corr}")
print(f"P-value: {p_value}")

renewableColumn = merged_clean['renewable_energy_percent']
co2Column = merged_clean['co2_per_capita']

(correlation, pValue) = spearmanr(renewableColumn, co2Column)

print('The correlation between Renewable Energy Consumption and CO₂ Emissions per Capita is', correlation)
if pValue < 0.05:
    print('It is statistically significant.')
else:
    print('It is not statistically significant.')

Spearman correlation coefficient: –0.573
P-value: 3.44 × 10⁻¹⁹

This indicates a strong and statistically significant negative correlation, meaning countries that use more renewables consistently produce less CO₂ per person.

![Scatter plot showing the negative correlation between renewable energy and CO₂ emissions]

The scatter plot visually confirms the statistical finding:
There is a strong inverse relationship between renewable energy use and CO₂ emissions per person.

However, it also highlights that renewables alone don’t tell the full story.
Economic structure, industrialization, and population all influence the exact position of each country on the plot.

Why It Matters

The analysis confirms what sustainability experts have long suspected.
Scaling renewable energy is one of the most effective ways to reduce per-capita emissions globally.

However, renewables alone are not enough.
Energy efficiency, industrial reform, and equitable policy changes all play vital roles.
Without these, even renewable-heavy countries can remain moderately carbon-intensive.

Conclusion

The Data Speaks and So Should We

The takeaway from this study is clear.
Countries investing in renewables emit less carbon per person.

With a correlation of –0.573, the relationship is strong and significant, demonstrating that increasing solar, wind, and hydro power directly reduces a country’s carbon intensity.

Still, while technology has advanced, policy and investment have not kept pace. Fossil fuels continue to receive the majority of global subsidies, making it harder for clean energy to compete.

The good news is that solar and wind are already the cheapest sources of electricity in history.
The future of clean energy is not a distant goal—it is something we are already building.

What We Can Do

Advocate for clean energy policies and equitable subsidies.
Support data transparency and open datasets.
Invest in research that combines data science with sustainability.
Promote energy efficiency alongside renewable adoption.

Data alone will not save the planet, but people who use data can.

Author’s Reflection: Why I Care About This

This project was more than a data analysis exercise. It was personal.
I have always been intrigued by how we can reduce CO₂ emissions, not just in theory but in practice. The danger it poses to our planet—from rising heat to air pollution—is no longer a distant threat. It is here, and it is accelerating.

Before this analysis, I worked on developing a prediction model for viable partial replacements of cement, one of the largest industrial contributors to CO₂ emissions. That experience taught me how deeply emissions are tied to the materials and systems we depend on, and how innovation in one sector can influence others.

It also showed me that data and sustainability are not separate fields. They are deeply connected, because understanding the data behind climate change is the first step to solving it.

Every dataset, every analysis, and every model brings us closer to a more sustainable world—one where we live not just on the planet, but with it.

Project Resources

Data source: World Development Indicators (World Bank)
Kaggle note book containing analysis : Renewable Energy Use and CO₂ Emission
Cleaned datasets:
- CO2_2013_ready.csv
- RENEW_2013_ready.csv
- CO2_RENEW_merged_2013.csv
Python scripts: For data cleaning, merging, and correlation analysis
Visualizations: Scatter plots, bar charts, and descriptive summaries

If you found this project insightful, consider leaving a comment, sharing it with your network, or connecting with me. I am always eager to discuss sustainability, data, and the power of clean innovation.

Finding the Perfect Summer Break with Data: A Weather Analysis of London ☀️

OGUNYEMI EZEKIEL TIMILEHIN — Sun, 02 Nov 2025 19:47:32 GMT

Introduction

We all love good weather, especially when it aligns with our plans. No one enjoys booking a long-awaited holiday just to spend it indoors because of constant rain. That thought guided this project. I wanted to use data to answer a simple human question:
When is the best time to go on a vacation in London?

In this project, I explored London’s 2023 weather data to find the most comfortable two-week stretch to take a summer vacation. The dataset came from Meteostat, a platform that provides open access to historical weather and climate data.

Data description

The dataset covered one full year of London’s daily weather from temperatures to rainfall, sunshine, and wind measurements.

Column	Meaning	Unit
date	Observation date	—
tavg	Mean air temperature	°C
tmin	Minimum air temperature	°C
tmax	Maximum air temperature	°C
prcp	Total precipitation	mm
snow	Snow depth	mm
wdir	Wind direction	° (empty for 2023)
wspd	Average wind speed	km/h
wpgt	Peak wind gust	km/h
pres	Sea-level air pressure	hPa
tsun	Sunshine duration	minutes

Each column provided a clue about London’s weather patterns and helped build a realistic idea of when the weather feels most pleasant.

Step 1: Data Preparation and Cleaning

Before bringing the dataset into Python, I made sure it looked right in Excel. At first, all the values were packed into one column, so I used:

Data → Text to Columns → Delimited → Comma ( , ) → Finish

That simple step split everything neatly into separate columns, making the file readable.

Once the data looked fine, I loaded it into Pandas for a proper clean-up. I used the parameter skipinitialspace=True while importing . This removed any extra spaces that appeared after commas in the CSV file.

Next, I converted the “date” column with pd.to_datetime() so Python could recognize it as an actual date instead of plain text. After that, I set the date as the index to make it easier to work with time-based operations like grouping by months or weeks.

By this point, the dataset was clean, organized, and ready for deeper analysis.

Here’s the code I used:

import pandas as pd

# Loading the dataset
london_2023 = pd.read_csv("London_2023.csv", skipinitialspace=True)

# Converting 'date' column to datetime format
london_2023['date'] = pd.to_datetime(london_2023['date'], errors='coerce')

# Settting 'date' as the index for time-based analysis
london_2023.set_index('date', inplace=True)

# Quick check
print(london_2023.info())
print(london_2023.head())

Step 2: Focusing on the Summer and Creating a Comfort Score

Since London is in the Northern Hemisphere, I decided to focus on June, July, and August the summer months.
The idea was to find the two-week window that combined warm temperatures, low rainfall, and good sunshine.

But how do you define “good weather”?
To make that measurable, I created what I called a comfort score a simple formula to rate each day based on three main factors:

Temperature (40%) – How close it was to the ideal 22°C.
Rainfall (30%) – Less rainfall means higher comfort.
Sunshine (30%) – More sunshine adds to comfort.

Each of these factors was normalized between 0 and 1, then combined with weights to calculate a daily comfort score.
Finally, I used a 14-day rolling average to find longer periods of consistently good weather.

Here’s the code that brought it all together:

import numpy as np

# Filtering for June, July, August
summer = london_2023[london_2023.index.month.isin([6, 7, 8])]
print(summer.head())

# Creating a copy of summer dataframe
summer = summer.copy()

# Ideal temperature for comfort
ideal_temp = 22

# Normalizing and calculating component scores
summer.loc[:, 'temp_score'] = 1 - (abs(summer['tavg'] - ideal_temp) / ideal_temp)
summer.loc[:, 'rain_score'] = 1 - (summer['prcp'] / summer['prcp'].max())
summer.loc[:, 'sun_score'] = summer['tsun'] / summer['tsun'].max()

# Cliping negative values to 0
summer.loc[:, ['temp_score', 'rain_score', 'sun_score']] = summer[['temp_score', 'rain_score', 'sun_score']].clip(0, 1)

# Weighted comfort score
summer.loc[:, 'comfort_score'] = (
0.4 * summer['temp_score'] +
    0.3 * summer['rain_score'] +
    0.3 * summer['sun_score'])

# Rolling mean (14 days)
summer.loc[:, 'rolling_comfort'] = summer['comfort_score'].rolling(window=14).mean()

# Finding the best 2-week window
best_start = summer['rolling_comfort'].idxmax()
best_end = best_start + pd.Timedelta(days=13)

print("Best vacation period:", best_start.date(), "to", best_end.date())

By combining temperature, rainfall, and sunshine into a single value, I could see which days felt the most balanced and pleasant overall.

Step 4: Visualizing the Findings

To make the results more meaningful, I created a line chart showing the comfort score over time. Then, I shaded the best two-week window that the data identified.

import matplotlib.pyplot as plt

plt.figure(figsize=(12, 6))

# Plot daily comfort score
plt.plot(summer.index, summer['comfort_score'], 
         label='Daily Comfort Score', color='skyblue', alpha=0.6)

# Plot 14-day rolling comfort score
plt.plot(summer.index, summer['rolling_comfort'], 
         label='14-day Rolling Average', color='blue', linewidth=2)

# Highlight best vacation period
plt.axvspan(best_start, best_end, color='green', alpha=0.3, label='Best Vacation Period')

# Labels and title
plt.title('Daily and Rolling Comfort Score – Summer 2023')
plt.xlabel('Date')
plt.ylabel('Comfort Score')
plt.legend()
plt.grid(True)

plt.show()

Step 5: Results

The analysis found that the best time to take a summer break in London was from June 27 to July 10, 2023.

During these two weeks, the temperature stayed between 20–23°C, rainfall was low, and sunshine hours were longer. It was the ideal balance of all three factors.

Here are the top three vacation periods based on the rolling comfort score:

Rank	Dates	Score
1	June 27 – July 10	0.833
2	June 26 – July 9	0.833
3	June 17 – June 30	0.825

These windows represent stretches of mild temperatures, minimal rain, and plenty of sunshine exactly what most people hope for during a London summer.

Challenges I Faced

Like most projects, this one didn’t go smoothly at first.

I started by downloading 12 separate CSV files from Weather Underground, one for each month of 2023. Merging them manually was stressful and prone to errors. After a few hours of frustration, I switched to Meteostat, which offered a better alternative.
I also had to think carefully about how to balance the comfort score giving too much weight to one factor could skew the results.
Finally, I had to manage missing or incomplete data while keeping the analysis realistic.

These challenges helped me understand how real-world data work it’s not perfect, and cleaning it up is as important as the analysis itself.

This week’s reflection

By week seven of my learning journey, I realized that data science isn’t just about accuracy or prediction.
It’s about understanding context and solving real problems, even simple ones like knowing when the weather might treat you kindly.

This project taught me how to turn data into a story one that can influence real-life decisions.

Maybe the code didn’t predict stock prices or train a neural network, but it did something equally valuable: it helped find the best time to pause, breathe, and enjoy life.

And maybe that’s what good data work should be about.

Helpful Resources

Dataset: https://meteostat.net/en/station/03772?t=2023-01-01/2023-12-31

Practical Data Analysis: My Experience with NumPy and Pandas

OGUNYEMI EZEKIEL TIMILEHIN — Sun, 26 Oct 2025 19:36:00 GMT

Introduction

The most interesting thing about NumPy and pandas isn’t their speed. It’s how they quietly handle chaos. You can feed them messy, inconsistent data from six cities, and they’ll help you turn it into something that makes sense.

That’s what I worked on this week ; exploring, cleaning, and analyzing real datasets using both tools.

Understanding the Tools

Before analysis, I wanted to understand what makes these two libraries work so well together.

NumPy handles numerical computation. It’s the foundation that powers most of data science in Python. It offers arrays, matrix operations, and mathematical functions that make large-scale calculations fast and efficient. For example, dividing every value in a list by 100 or slicing part of a 3D array takes one line.

Pandas builds on that foundation. It adds structure through Data Frames ; tables of rows and columns you can manipulate easily. Think of it as Excel with Python-level control. You can clean, group, and summarize data in a way that’s both logical and flexible.

In short:

NumPy is for numbers.
Pandas is for meaning.

Core Tasks I Worked On

I worked with six CSV files Beijing, Brasilia, Cape Town, Delhi, London, and Moscow. They looked uniform at first glance but weren’t. Once loaded into pandas, I found inconsistencies that made merging impossible without cleanup.

Here’s how I handled them.

Loading and Standardizing the Data

1. Encoding differences

When I tried loading the London dataset, it didn’t display correctly and showed a “bad delimiter” error. This usually happens when the file contains special characters that aren’t supported by the default encoding. To fix it, I reloaded the file using a different encoding format called Latin-1, which handles a wider range of characters. After that adjustment, the file loaded without issues.

pd.read_csv('London_2014.csv', encoding='latin1')

Once confirmed, I used this encoding only for that file.

2. Inconsistent column names

Some files had extra spaces, HTML tags, or inconsistent casing. I standardized all headers:

df.columns = (
    df.columns
    .str.strip()
    .str.replace(' ', '_')
    .str.replace('', '', regex=True)
)

3. Adding city identifiers

Each CSV represented one city. To track them after merging:

df['City'] = file.split('_')[0]

4. Combining all datasets

After cleanup:

all_data = pd.concat(dfs, ignore_index=True)

The merged dataset had 2,190 rows and 25 columns.

5. Verification

I confirmed the structure and data types before cleaning:

all_data.info()
all_data.head()

Cleaning and Preparing the Data

Real-world data always comes with missing or noisy values the same applied to this. I had some checks to know what to do with the data . The first check:

all_data.isnull().sum()

Findings:

CloudCover had 432 missing entries
Max_Gust_SpeedKm/h and GMT were mostly empty
Visibility columns had small gaps
Events had partial missing text

Fixes applied:

I removed the GMT and Max_Gust_SpeedKm/h columns because they were mostly empty and not useful for analysis. For the visibility columns I replaced the missing values with each column’s average to keep the data consistent. The Events column had missing text entries, so I filled those gaps with the word “None” to indicate no recorded event instead of leaving them blank.

all_data.drop(['GMT', 'Max_Gust_SpeedKm/h'], axis=1, inplace=True)
all_data['Mean_VisibilityKm'] = all_data['Mean_VisibilityKm'].fillna(all_data['Mean_VisibilityKm'].mean())
all_data['Max_VisibilityKm'] = all_data['Max_VisibilityKm'].fillna(all_data['Max_VisibilityKm'].mean())
all_data['Min_VisibilitykM'] = all_data['Min_VisibilitykM'].fillna(all_data['Min_VisibilitykM'].mean())
all_data['Events'] = all_data['Events'].fillna('None')

After this, the only remaining missing column was CloudCover. I left it unfilled since imputing it could distort results.

Exploring the Data

After cleaning the dataset, I grouped the data by City to understand weather patterns across locations. For each city, I calculated the average, maximum, and minimum temperatures, the total precipitation, and the average humidity. This summary gave a clear comparison of climate characteristics for all six cities in one view

summary = all_data.groupby('City').agg({
    'Mean_TemperatureC': 'mean',
    'Max_TemperatureC': 'max',
    'Min_TemperatureC': 'min',
    'Precipitationmm': 'sum',
    'Mean_Humidity': 'mean'
})
print(summary)

City	Mean Temp (°C)	Max Temp (°C)	Min Temp (°C)	Precipitation (mm)	Mean Humidity (%)
Beijing	13.36	42	-13	405.89	50.75
Brasilia	22.90	36	9	751.65	58.07
Cape Town	17.57	37	1	428.25	68.76
Delhi	13.71	38	-17	225.31	50.64
London	12.33	30	-4	503.10	73.80
Moscow	5.99	33	-26	0.00	73.38

My key Findings:

London and Moscow had the highest humidity.
Brasilia recorded the most rainfall.
Moscow showed zero precipitation likely a recording issue, not an actual dry year.

Correlation analysis

Next, I ran a correlation analysis to see how the numerical weather variables relate to each other. This step measures how changes in one variable (like temperature) correspond to changes in another (like humidity or visibility). By calculating correlations only for numeric columns, I could identify strong relationships. For example, temperatures showing high positive correlation among themselves, and humidity showing a moderate negative correlation with visibility

corr = all_data.corr(numeric_only=True)
print(corr)

My key Findings:

Temperature columns (Max, Mean, Min) were strongly correlated.
Visibility had a moderate negative correlation with humidity.
Cloud cover correlated negatively with both temperature and visibility.

Even without charts, this gave a clear sense of how weather variables interacted.

Practicing with NumPy and the WHO POP TB Dataset

NumPy element-wise comparison

I practiced NumPy’s element-wise comparison amongst othersto understand how it evaluates two arrays value by value. Using simple arrays of numbers, I compared them with operations like “greater than,” “less than,” and “equal to.” NumPy instantly returned Boolean results (True or False) for each position, showing how efficiently it can handle large-scale comparisons without loops. This helped me see how NumPy’s vectorized operations make numerical analysis both fast and intuitive.

a = np.array([2, 4, 9])
b = np.array([1, 5, 9])
print(np.greater(a, b))
print(np.greater_equal(a, b))
print(np.less(a, b))
print(np.less_equal(a, b))

These quick operations show how efficiently NumPy performs comparisons across arrays the same logic applies to larger datasets.

also;

To strengthen my understanding of indexing and conditional selection, The task had me practice with the WHO POP TB dataset. It includes country-level population and tuberculosis statistics.

# Display the 55th row
df.iloc[54]

# Display first 10 rows
df.head(10)

# Show first 8 rows of Country and TB deaths
df[['Country', 'TB deaths']].head(8)

# Find countries with TB deaths > 10,000
df[df['TB deaths'] > 10000]

# Find where population ≤ 50,000 or TB deaths ≥ 20,000
df[(df['Population (1000s)'] <= 50000) | (df['TB deaths'] >= 20000)]

This exercise helped reinforce selection logic and conditional filtering skills that make real-world data analysis smoother and faster.

Challenges I Faced

I ran into a few practical issues during the process. The London CSV file used a different encoding, which caused loading errors until I specified the correct one. Some column names had hidden spaces and HTML tags that made merging fail until I standardized them. The CloudCover column was incomplete and couldn’t be filled reliably without distorting results. I also noticed type mismatches when comparing numeric columns, which required conversions before analysis. These challenges might seem minor, but they show how unpredictable real datasets can be and why careful data handling is key to working effectively with pandas.

The London CSV used a different encoding.
Hidden spaces and HTML tags broke merges.
CloudCover was incomplete.
Type mismatches appeared when comparing numeric columns.

My Perspective and Takeaways

This week reinforced one main idea ; most data analysis happens before visualization or modeling.
The key steps are loading, cleaning, and validating the data. That’s where real insight begins.

A few lessons stood out:

Check encoding before merging files.
Standardize column names early.
Fill missing values only after understanding their meaning.
Use correlation to identify patterns quickly.

NumPy and pandas aren’t flashy tools, but they’re dependable ones.
Once your data is structured and consistent, finding insights becomes the easy part.

HELPFUL RESOURCES

Youtube: https://youtu.be/wUSDVGivd-8?si=d05zHtoyNTABnIj1

Improving Methods for Predicting Concrete Strength

OGUNYEMI EZEKIEL TIMILEHIN — Thu, 23 Oct 2025 11:26:19 GMT

Intoduction

This research proposal came from looking back at my final-year project. I had developed a strength prediction model for concrete using regression and ANOVA in Excel. The goal was to predict compressive strength from mix ratios and curing conditions.

The model worked within a narrow range. Once the material composition changed, accuracy dropped. That made me think deeper about prediction in concrete mix design and how traditional methods struggle when data doesn’t follow simple patterns.

Integrating Past Work into Future Vision

one of my recommendations was to “apply advanced computational models to improve strength prediction accuracy.” It was a brief note at the time, but it later shaped my new proposal:

A Machine Learning–Based Predictive Framework for Estimating the Compressive Strength of Hybrid Concretes Incorporating Supplementary Cementitious Materials.

The proposal combines civil engineering practice with data science. It focuses on:

Using open-source concrete datasets
Cleaning and preprocessing data
Training and testing several ML models with cross-validation
Using SHAP analysis to identify which inputs (binder ratio, curing age, water-to-binder ratio, etc.) influence strength the most

The aim is to build a data-driven framework that helps engineers design concrete mixes that use less cement while maintaining performance.

Why This Matters

Concrete production accounts for about 8% of global CO₂ emissions, mainly from cement. Reducing cement content without losing strength is a key engineering problem.

If machine learning can predict the right mix combinations, you can:

Cut down on laboratory trials
Save time and material costs
Design more efficient mixes
Reduce cement use and emissions

This approach doesn’t replace engineering judgment. It supports it with data. Engineers still make decisions, but with better insight from models that learn from large datasets.

Using ML in mix design could make research more data-oriented, less repetitive, and more sustainable.

Looking Back: What I Learned

In most mix design practice, engineers rely on:

Trial-and-error methods
Empirical formulas
Linear regression models

These methods are easy to apply but assume relationships between inputs and strength stay constant. In reality, materials change. Cement from different sources, variations in aggregates, and the use of supplementary materials all shift outcomes.

Regression models can’t always handle these changes. They assume straight-line relationships where the real world behaves differently.

Why Machine Learning Makes Sense

Machine learning (ML) doesn’t rely on fixed formulas. It learns directly from data. That means it can capture how multiple variables interact even when the relationship isn’t linear.

For example, in concrete with additives like fly ash or laterite, strength gain doesn’t increase uniformly with mix ratio. ML models such as:

Random Forest
XGBoost
Neural Networks

can find those patterns automatically.

Here’s the difference:

Regression fits one global equation to all data.
ML adapts to local variations and complex interactions.

That’s why ML suits concrete strength prediction it can work with the uncertainty that traditional models can’t explain.

What’s Next

My next step is to build the technical foundation for this work. I’m focusing on:

Python for analysis and scripting
Scikit-learn, XGBoost, and TensorFlow for model development
Data visualization tools like Power BI and Excel for results interpretation

The plan is to turn this proposal into a capstone project once I’m confident with these tools. It will cover everything from data preprocessing to feature analysis and model validation.

This direction connects my civil engineering background with data science. It’s about improving how concrete performance is predicted using data, not guesswork.

The future of construction depends on smarter design. Machine learning can help make that happen.

Closing Thought

Construction is evolving. The way we design, predict, and test materials evolve too. Machine learning offers a practical way to make that shift not by replacing engineers, but by giving them better tools to make data-backed decisions.

For me, this research isn’t just academic, it’s a direction. It connects what I’ve done before with where I want to go next: using machine learning to improve how we predict and design sustainable concrete.

Comprehensive Guide to Python Modules: JSON, Math, and Beyond

OGUNYEMI EZEKIEL TIMILEHIN — Sun, 28 Sep 2025 20:00:51 GMT

Introduction: When Coding Gets Real

There’s a point in every beginner’s coding journey where things stop feeling like simple toy problems and start resembling the real world. For me, that point came in Week 3 of my learning journey with Dataraflow.

Up until last week, I was happily experimenting with Object-Oriented Programming (OOP) — creating classes, inheriting attributes, and even trying my hands on polymorphism. But now the story shifted. Suddenly, I wasn’t just writing my own code; I was asked to step into Python’s toolbox and start using modules — powerful pre-built functionalities that make programming more efficient.

And let me be honest: at first, it felt overwhelming.
Math functions, date manipulation, JSON file handling, virtual environments — it was like walking into a supermarket for the first time and realising you need more than a basket. But as I dug deeper, I realised that these modules are the very shortcuts that transform you from a beginner writing “Hello World” into a problem-solver who can build real systems.

Theory: Modules, Packages, and Why They Matter

So, what exactly did I learn?

A module is basically a Python file containing code ( functions, classes, or variables ) that you can reuse in different programs. For example, the math module saves you the pain of writing formulas from scratch. A package, on the other hand, is a collection of modules organised neatly in directories, often with an __init__.py file that signals “Hey, I’m a package!” (think of libraries like numpy or pandas).

Why does this matter? Because as projects get bigger, no one has time to reinvent the wheel. You don’t want to write your own trigonometry functions or build your own date formatter. Instead, you borrow tools from Python’s rich ecosystem and focus on solving your unique problem.

Alongside modules, I met some equally powerful companions:

JSON (JavaScript Object Notation): A universal data format that makes Python dictionaries talk to the outside world (APIs, files, web apps).
Datetime: Working with dates and times. Suddenly, birthdays, deadlines, and countdowns became programmable.
Error Handling (try/except): A lifesaver that catches mistakes gracefully instead of letting your program crash.
Virtual Environments: My own coding bubble where dependencies don’t clash. Essential for serious projects.

It felt like moving from toy blocks to a real workshop full of tools.

Practical Tasks: Building with Modules

The week wasn’t just theory. I had hands-on assignments, and with each one, I discovered something new about myself as a programmer. some of these tasks are listed below with some boring codes just to show it’s practicality

Task 1: Math Module

What I Learnt: I didn’t need to memorise or code formulas, Python had my back.

import math

print("Square root of 144:", math.sqrt(144))
print("Factorial of 6:", math.factorial(6))
print("Pi constant:", math.pi)

This was my first “oh, oh!” moment. Instead of writing multiple lines to calculate factorials, math.factorial(6) gave me the result instantly. It made me feel like I was wielding a scientific calculator built into Python.

Task 2: Datetime Module

What I Learnt: Dates can be messy. Is it 09/12/2025 or 12/09/2025? Python’s datetime stripped away confusion and let me format time however I wanted.

from datetime import datetime

now = datetime.now()
print("Current Date & Time:", now)
print("Formatted:", now.strftime("%d-%m-%Y"))

I also practiced calculating the number of days until my next birthday. Suddenly, math + datetime became personal.

Task 3: JSON Module

What I Learnt: JSON is like a bridge between Python and the world.

import json

student = {"name": "Ezekiel", "age": 23, "grade": "A"}
student_json = json.dumps(student)   # dict → JSON
print("JSON String:", student_json)

parsed = json.loads(student_json)    # JSON → dict
print("Parsed Dict:", parsed)

When I first saw curly braces and strings, I thought: “Wait, isn’t this just a dictionary?” But then it clicked — JSON is the universal language of the internet. That’s how APIs talk. That’s how data travels between apps.

Task 4: Error Handling

What I Learnt: Errors don’t have to kill your code. They can be tamed.

try:
    num = int(input("Enter a number: "))
    print("100 divided by", num, "=", 100 / num)
except ZeroDivisionError:
    print("Oops! You can’t divide by zero.")
except ValueError:
    print("Please enter a valid integer.")

I remember the first time I divided by zero during practice. Instead of a scary “ZeroDivisionError” traceback, my program now politely said: “Oops! You can’t divide by zero.” That moment felt empowering — like I had just added safety rails to my code.

Task 5: Virtual Environments

What I Learnt: Every project deserves its own bubble.

python -m venv myenv
myenv\Scripts\activate    
pip install numpy
pip list

At first, I was annoyed , why couldn’t I just install everything globally? But then I understood. Virtual environments are like “separate kitchens” for each project. You don’t want to cook jollof rice in the same pot you used for egusi soup. Keeping dependencies isolated saves so much future headache.

Fun Project: Library Management System (Upgraded)

By combining OOP (Week 2) and Modules (Week 3), I upgraded my Library Management System.

Books and members were saved in a JSON file — so data persisted even after the program stopped.
I used datetime to record when a book was borrowed and calculate due dates.
Error handling prevented users from borrowing unavailable books.

Suddenly, my project wasn’t just a code exercise — it felt like the foundation of a real-world app.

Challenges I Faced (And How I Solved Them)

This week wasn’t smooth. Honestly, it was frustrating at times. But each struggle shaped my understanding.

The “Import Confusion” Trap
I often mixed up import module and from module import function. For example, I’d write sqrt(144) without math. and wonder why Python complained. My solution? I slowed down and mapped it in my notes: “import keeps the namespace intact; from imports directly into yours.”
The JSON Gibberish Moment
The first time I wrote a dictionary to a file, I opened it and saw gibberish. I thought my code was broken. Turns out, I hadn’t converted it properly with json.dumps(). Lesson: computers need structure, not assumptions.
Virtual Environment Chaos
Activating my virtual environment on Windows gave me endless errors. I typed commands exactly as tutorials showed , but nothing worked. The problem? I hadn’t run my terminal in the right directory. Once I understood the file paths, everything clicked.
Catching Everything (Too Much Error Handling)
In the beginning, my try/except was too broad. I caught every possible error, which made debugging impossible. Later, I learned to target specific exceptions like ValueError or ZeroDivisionError. It was like switching from a giant fishing net to a precise hook.

Each challenge taught me something deeper than syntax. It taught me how to think as a programmer: slow down, test, debug logically, and trust the process.

Reflection & Key Takeaways

This week shifted how I view programming.

Modules are the real magic. They extend Python beyond imagination.
Errors aren’t enemies. They’re guides pointing me to what I don’t yet understand.
Practice beats theory. Reading about JSON didn’t help until I actually tried saving and loading data.
Integration matters. Week 2’s OOP + Week 3’s modules gave me my first real taste of building something “bigger.”

Most importantly, I realised programming is not about memorising , it’s about learning how to use the right tool at the right time.

✨ That’s a wrap for Week 3!

At the end of this week, I’ve realised that programming isn’t about how much code you can write, but how smartly you can use what already exists. Python’s modules taught me that efficiency often comes from standing on the shoulders of giants ; using tools that others have perfected so I can focus on solving my own unique problems.

Week 2 gave me the foundation of building structures with OOP, and Week 3 handed me the toolbox of modules to bring those structures to life. Together, they’ve reshaped the way I see coding: not as a set of scary terms, but as a craft where ideas, tools, and persistence come together to create something meaningful.

And honestly? That’s when coding stops being intimidating and starts becoming fun.

Grasp Python OOP Effortlessly: My Breakthrough Moment

OGUNYEMI EZEKIEL TIMILEHIN — Sun, 21 Sep 2025 00:27:57 GMT

💡 “Classes, objects, inheritance, polymorphism, scope, and iterators — these are the pillars of Object-Oriented Programming (OOP) in Python. At first glance, they may seem intimidating, but once broken into small, practical steps, they become powerful tools for writing clean, reusable, and structured code. In this post, I’ll share exactly how I learned these concepts, the code that made each of them click, and the project that brought them all together.”

Clear-cut learning objectives— Week 2

At the glance of this week’s module OOP , I was super intrigued about the concept, more than just theory, there were clear cut objectives tailored to make me :

Understand the building blocks of Python OOP.
Solve practical core tasks to reinforce learning.
Attempt mini-projects and conceptual questions for deeper mastery.
Finally, bring it all together in a fun project — a Library Management System.

This publication documents my Week 2 journey: what I studied, the tasks I solved, the challenges I faced, and my reflections.

The How: Theory, Code, and Approach

Week 2, wasn’t just about memorizing Python concepts — It gave understanding to them in a way that felt real and practical. There was combination of theory, small coding exercises, and full projects. Each concept wasn’t just a definition, but something I tried to bring alive with code and practice.

Object-Oriented Programming (OOP) Basics

I came to see classes as blueprints, and objects as the actual things built from those blueprints. For example, a Car class is like the idea of a car, but my actual Toyota or Honda is the object.

Attributes felt like an object’s “identity card” — they describe its properties (like colour, name, brand).
Methods became the actions the object could perform, like drive() or stop().

This helped me stop seeing code as just commands and start viewing it as living objects that interact.

Inheritance in Python

This concept clicked for me when I realised it’s just like family traits. A child class inherits features from its parent class but can also have its own unique traits.

For example, a Dog class and a Cat class can both inherit from Animal but still make their own unique sounds.

It made me appreciate how inheritance saves time, avoids rewriting the same code, and keeps everything neat and structured — just like organising files into folders.

Scope and Encapsulation

This was tricky for me at first, because I kept mixing up which variable belonged where.

Scope taught me to respect boundaries:
- Local (inside a function),
- Global (outside everything),
- Non-local (inside nested functions).
Encapsulation showed me that not every detail should be exposed. By using private (__var) and protected (_var) attributes, I learned how to “hide” sensitive parts of a class.

Getter and setter methods then gave me a controlled way to safely access or change those hidden details. It was like locking my valuables in a drawer and giving out a key only when necessary.

Iterators in Python

At first, I just saw loops as something Python magically did. But iterators made me realize what’s happening under the hood.

They rely on two methods: __iter__() and __next__().
They are what make for loops work and allow us to create custom looping behavior.

The coolest part was creating my own iterator, like a countdown or even an infinite even-number generator. It showed me how flexible Python really is once you understand the mechanism.

Polymorphism

This one made me smile because it felt like the most “real-world” of them all.

Polymorphism means different classes can have the same method name, but each behaves differently.
For example:

A Car might have a .move() method that prints “Car drives”.
A Bicycle might also have .move(), but it prints “Bicycle pedals”.

The method name is the same, but the action depends on the object calling it. To me, it showed that code doesn’t have to be rigid — it can adapt to the situation, just like people do in different contexts.

Practical Application — Tasks & Solutions

1. Core Tasks (OOP, Inheritance, Scope, Iterators, Polymorphism)

OOP Basics – Car Class Example

What I Learnt: Classes are blueprints, objects are instances, and __init__ initializes them.

class Car:
    def __init__(self, brand, model):
        self.brand = brand
        self.model = model

    def details(self):
        return f"{self.brand} {self.model}"

car1 = Car("Toyota", "Corolla")
car2 = Car("Honda", "Civic")

print(car1.details())
print(car2.details())

✅ Output:

Toyota Corolla
Honda Civic

Inheritance – Animal Example

What I Learnt: A child class can override its parent’s behavior.

#parent class
class animal:
    def __init__(self, name,sound):
        self.name=name
        self.sound=sound
    def make_sound (self):
        print (f"{self.name} makes a sound !")
# child class 
class dog (animal):
    def make_sound (self):
        print (f"{self.name} says  {self.sound} !" )

class cat (animal):
    def make_sound (self):
        print (f"{self.name} says { self.sound} !" )
dogs= dog('Ariel', 'woof') #creating objects
cats= cat('Sophie', 'Meow')

dogs.make_sound() # calling their methods
cats.make_sound()

✅ Output:

Ariel says  woof !
Sophie says Meow !

Scope Example

What I Learnt: Variables may look the same but behave differently depending on scope.

x = 300
def myfunc():
    x = 200
myfunc()
print(x)  # prints 300 (global unchanged)

With global:

x = 300
def myfunc():
    global x
    x = 200
myfunc()
print(x)  # prints 200 (global changed)

Iterators – CountDown Example

What I Learnt: Both __iter__() and __next__() are needed to build custom loops.

class CountDown:
    def __init__(self, n):
        self.n = n

    def __iter__(self):
        return self

    def __next__(self):
        if self.n <= 0:
            raise StopIteration
        current = self.n
        self.n -= 1
        return current

for num in CountDown(5):
    print(num)

✅ Output:

Polymorphism – Car & Bicycle

What I Learnt: Polymorphism gives flexibility — same method name, different results.

class Car:
    def move(self):
        return "Car drives"

class Bicycle:
    def move(self):
        return "Bicycle pedals"

vehicles = [Car(), Bicycle()]
for v in vehicles:
    print(v.move())

✅ Output:

Car drives
Bicycle pedals

2. Mini-Projects

Counter class → Counted object creations.
Shape hierarchy → Practiced inheritance with Rectangle and Circle.
BankAccount system → Explored methods, subclasses, and data encapsulation.

3. Conceptual Assessments

I tested myself with theory questions:

Difference between attributes and methods.
Role of __init__ and __repr__.
Encapsulation’s role in OOP.
Python’s scope rules (LEGB rule).

Fun Project — Library Management System

Bringing OOP Concepts Together

This was the highlight of my week. I built a Library Management System that combined everything I learned:

Classes for Book, EBook, PrintedBook, Member, StudentMember, TeacherMember.
Inheritance, polymorphism, encapsulation, and iterators in action.

Key Features Implemented

Encapsulation: Protected _borrowed_books, private __library_name.
Inheritance: EBook and PrintedBook extend Book.
Polymorphism: Students vs teachers had different borrowing limits.
Iterators: Allowed iteration over books and borrowed items.

Lessons from the Project

OOP concepts don’t exist in isolation.
They come together naturally in real-world problems.
The project made me appreciate why OOP matters.

Challenges I Faced

Learning OOP was not smooth — I hit roadblocks. But each mistake taught me something valuable.

1. Confusion Between `init` and `repr`

I thought both were for printing.
Later I realized: __init__ initializes, while __repr__ represents.

✅ Solution: Practiced small examples until clear.

2. Scope and Global Keyword Errors

My variables didn’t update globally.
I was shadowing variables inside functions.

✅ Solution: Experimented with global and nonlocal until I understood how they worked.

3. Iterators Confusion

Got errors like:

  TypeError: 'CountDown' object is not iterable

Because I forgot __iter__().

✅ Solution: Learned both __iter__() and __next__() are required.

4. Inheritance and MRO (Method Resolution Order)

With multiple inheritance (Duck(Flyer, Swimmer)), I didn’t understand why super() only called one parent.
Printing Duck.__mro__ confused me.

✅ Solution: Learned that Python resolves methods left to right in the MRO chain.

How I Overcame my challenges-summary

Simplicity: Broke big problems into smaller steps.
Study Hours: Spent several hours reading books and watching YouTube to grasp key concepts.
Understanding first: I explained concepts in plain English before coding.
Practice: Wrote, tested, and rewrote code until it clicked.

My Perspective After Week 2

I now think in classes and objects, not just functions.
I understand how OOP principles connect to real systems.
I feel more confident tackling new problems.

My Key Takeaways

OOP organizes code and makes it reusable.
Encapsulation protects data and ensures safety.
Polymorphism and Iterators add flexibility.
Projects tie everything together and show the bigger picture.

✨ This is my Week 2 OOP journey — This mix of theory, code, and practice helped me not just understand the words, but really see how Python’s OOP concepts come alive. from learning the basics to insightful applications. excited for the journey ahead

Helpful Resources

Books : Python programming Bible 2024 (3 in 1), python all in one for dummies
peer- driven learning: Had a session with my group mate where we brainstormed and asked question
weekly-class: live session that explained the week’s module better and answered question https://youtu.be/1dadlcXFNKY?si=EBFYp3f8YDLAs7Ie
Youtube: https://youtu.be/wUSDVGivd-8?si=d05zHtoyNTABnIj1

A Step-by-Step Look at My Data Science Journey

OGUNYEMI EZEKIEL TIMILEHIN — Fri, 12 Sep 2025 23:53:48 GMT

🌱 Introduction

Foundation is the backbone of any structure built to last.
Dataraflow thought it expedient to lay that foundation — from the very basics of data science, to setting up a community that thrives through peer-driven learning (luckily for me, I'm in Group 6 — Data Raiders), and eventually kick-starting the lessons with the same tone: THE BASICS.

We started with a Python crash course — from data types all the way to real applications.
From writing print("Hello World") to building simple projects, this first week has been quite a ride.

This article is my honest reflection on what I learned, the struggles I faced, how I overcame them, and my perspective on coding as a beginner.

🛠 What I Learned

1. Python Basics

At first, I learned the building blocks:

Data Types
Variables (storing information)
Arithmetic operations like sum, difference, and product

For example, learning how to take two numbers and find their sum, difference, and product gave me confidence that I could actually “instruct” the computer.

2. Control Flow (if-else, loops)

Next, I explored decision-making in Python:

Writing conditions with if, elif, and else
Using loops (for and while) to repeat tasks

A good example was the grading system program.
I created a program that accepts marks for 5 subjects, calculates the average, and assigns a grade (A, B, C, or F).
This exercise helped me see how conditions control the flow of logic.

3. Data Structures (Lists & Dictionaries)

One of the most powerful things I discovered was how to organise data:

Lists helped me store multiple values, like marks.
Dictionaries were useful when I built a shopping cart program, where I stored item names and their prices.

These were my “aha!” moments — when I realised coding is really about structuring information.

4. Fun Projects

Some of my favourite projects were small but exciting:

Fibonacci Sequence Generator → taught me logic and loops

  a,b = 0,1
  fibonacci=[a,b]
  for i in range (8):
      c=a+b
      fibonacci.append(c)
      a,b=b,c
  print (f" The first 10 numbers of the Fibonacci sequence are {fibonacci} ")

Guessing Game → made coding interactive, with hints like “too high” or “too low”

  #1 computer randomly picks a number between 1 and 20
  import random
  secret_number = random.randint(1,20)
  print('welcome to the guessing game')
  print('I have picked a number between 1 and 20. can you guess it?')

  #2 Loops until user guesses correctly
  while True:
      guess = int(input('enter random number (1-20)'))
      if guess < secret_number:
          print('Too low! Try again.')
      elif guess > secret_number:
          print('Too high! Try again.')  
      else:
          print (f'Got it! the number was {secret_number}')
          break

⚡ Challenges I Faced

"Before this week, the code i’d often write was SQL — where indentation, case, and spacing didn’t really matter. Then came Python… and suddenly I was in a whole new world!"
Errors everywhere 😅 — missing colons, wrong indentation, or mis-typed variables.
At first, they frustrated me, but later I realised errors are simply feedback.
Understanding loops and conditions — I often got wrong results because I didn’t fully grasp the logic.
Practising with small examples helped me fix this.

🎯 My Perspective After Week One

Learning data science is not just about writing code — it’s about developing problem-solving skills and learning to think logically.

Errors are part of the process — not a sign of failure
Every bug I fix makes me better at programming
Consistency matters — even writing a few lines of code daily keeps the momentum alive
Real application is where true learning happens — the tasks, take-home exercises, and assignments showed me just how important these basics are
Projects (no matter how small) make learning fun and practical

🛠 How I Overcame the Struggles

✅ Practice, Practice, Practice — I wrote the same code multiple times until I could explain it in my own words.
✅ Debugging by Printing — I used print() to check what my variables were doing at each step.
✅ Asking for Help — I didn’t struggle in silence. I asked questions, watched videos, read resource material, and got explanations that taught me the why behind the code.

📝 Conclusion

Today, I may not be an expert, but I can confidently say I understand the core concepts of Python — and I can keep building from here.

This is just the beginning of my story here at Dataraflow, and I’m excited for what’s ahead 🚀.