🧹 Module 01

Data Wrangling
& Cleaning

Master data preparation with Pandas — handling dirty data, transformations, missing values, and normalization techniques.

🧼 Data Cleaning Simulator

Enter raw data with missing values, outliers, and invalid entries. See how it gets cleaned using IQR-based outlier removal and median imputation.

Raw dirty data (comma separated, use NaN or ? for missing)

python · data_cleaning.py

import pandas as pd
import numpy as np

# Replace common missing value placeholders with NaN
df = df.replace(['?', 'NA', ''], np.nan)

# Fill numeric missing values with median (robust to outliers)
numeric_cols = df.select_dtypes(include=[np.number]).columns
for col in numeric_cols:
    df[col].fillna(df[col].median(), inplace=True)

# Remove outliers using IQR method
def remove_outliers(df, column):
    Q1 = df[column].quantile(0.25)
    Q3 = df[column].quantile(0.75)
    IQR = Q3 - Q1
    lower = Q1 - 1.5 * IQR
    upper = Q3 + 1.5 * IQR
    return df[(df[column] >= lower) & (df[column] <= upper)]

🔄 Data Transformation Explorer

Compare different mathematical transformations and their effect on data distribution.

Input data

Transformation

Log Transformation

y′ = log(y + 1)

Compresses large values, stabilizes variance, reduces right-skew. The +1 prevents log(0) errors.

Square Root

y′ = √y

Milder compression than log. Useful for count data following a Poisson distribution.

❓ Missing Values Handler

Data with missing values (use NaN)

Strategy

Mean

Replace with column mean. Fast but sensitive to outliers.

Median

Replace with median. Robust to skewed distributions.

Interpolation

Estimate from neighboring values. Best for time series.

📏 Normalization Comparison

Data to normalize

Min-Max Normalization

x′ = (x − min) / (max − min)

Scales to [0,1]. Preserves relationships but sensitive to outliers. Preferred for neural networks.

Z-Score Standardization

x′ = (x − μ) / σ

Centers around 0 with unit variance. Robust, ideal for statistical models and SVM.

Decimal Scaling

x′ = x / 10ʲ

Divide by power of 10. Simple, maintains sign. j = ⌈log₁₀(max|x|)⌉

📊 Module 02

EDA & Visualization

Explore data relationships and create insightful visualizations using correlation analysis, distributions, and KDE plots.

📈 Correlation Matrix Explorer

Variable 1

Variable 2

Variable 3

📊 Distribution Explorer

Data values

Plot type

⚖️ Group Comparison

Group A

Group B

Group C

🔍 KDE: Actual vs Predicted

Actual values

Predicted values

Kernel Density Estimation

f̂(x) = (1/nh) Σ K((x − xᵢ)/h)

Non-parametric density estimation. Bandwidth h controls smoothness. Gaussian kernel is most common.

🤖 Module 03

Model Development
& Evaluation

Build, fit, and evaluate machine learning models with Scikit-learn — from simple linear regression to regularization techniques.

📉 Linear Regression

X values

Y values

Linear Regression

y = β₀ + β₁x + ε

β₀ = intercept, β₁ = slope, ε = error term. Minimizes sum of squared residuals.

R² Score

R² = 1 − (SS_res / SS_tot)

Proportion of variance explained by the model. Ranges 0→1, higher is better.

📐 Polynomial Regression

X values

Y values

Polynomial Regression

y = β₀ + β₁x + β₂x² + ··· + βₙxⁿ

Higher degree captures complex patterns but risks overfitting. Use cross-validation to select degree.

Bias-Variance Tradeoff

Error = Bias² + Variance + Noise

Low degree → high bias, low variance (underfit). High degree → low bias, high variance (overfit).

🏔️ Ridge Regression (L2)

Drag the alpha slider to see how L2 regularization shrinks model coefficients.

Alpha (regularization strength)

1.0

Ridge Cost Function

J(β) = Σ(yᵢ − ŷᵢ)² + αΣβⱼ²

L2 penalty shrinks all coefficients toward zero. Higher α = more regularization = simpler model. Never zeros out features completely.

📊 Model Evaluation Metrics

Actual values

Predicted values

MSE / RMSE

MSE = (1/n) Σ(yᵢ − ŷᵢ)²

MSE penalizes large errors heavily. RMSE = √MSE and is in original units, more interpretable.

MAE

MAE = (1/n) Σ|yᵢ − ŷᵢ|

Mean Absolute Error — robust to outliers, in original units, treats all errors equally.

📈 Module 04

Statistical Tests
& Inference

Perform hypothesis tests and understand statistical significance with interactive Chi-Square, T-Test, ANOVA, and Pearson calculators.

χ² Chi-Square Test

Test for independence between two categorical variables. Enter a 2-row contingency table.

Row 1 (e.g. Male: Like, Dislike)

Row 2 (e.g. Female: Like, Dislike)

Chi-Square Statistic

χ² = Σ((Oᵢ − Eᵢ)² / Eᵢ)

Oᵢ = observed frequency, Eᵢ = expected. df = (rows−1)(cols−1). Reject H₀ if p < 0.05.

Expected Frequency

Eᵢ = (Row Total × Col Total) / Grand Total

Expected values under null hypothesis of independence between the two categorical variables.

📝 Independent T-Test

Group 1

Group 2

T-Statistic

t = (x̄₁ − x̄₂) / √(s₁²/n₁ + s₂²/n₂)

Compares two group means. Degrees of freedom ≈ n₁+n₂−2. Reject H₀ if |t| exceeds critical value.

📊 One-Way ANOVA

Group 1

Group 2

Group 3

📐 Pearson Correlation

Variable X

Variable Y

📂 Module 05

Practice Datasets

Explore real-world datasets interactively — visualize different features and understand the data before running analysis.

💻 Laptop Pricing Dataset

Feature

Chart type

Practice Tasks

Suggested analyses:
1. Correlation between RAM and price
2. Average price by laptop category
3. Weight vs price scatter analysis
4. Build a price prediction model
5. Feature importance ranking

🏥 Medical Insurance Dataset

Analysis

🌺 Iris Dataset

Feature X

Feature Y

🚢 Titanic Dataset

Analysis

🏠 Housing Prices Dataset

Analysis

🛒 Sales Data

Analysis

🎯 Module 06

Practice Arena

Test your data analysis knowledge with concept questions and data challenges. Track your progress as you go.

Learning Progress 0 / 5 correct

1. Which Python library is primarily used for data manipulation and analysis?

2. What does the Chi-Square test determine?

3. Ridge Regression adds which penalty term to the cost function?

4. A p-value less than 0.05 typically indicates what?

5. Which metric is NOT appropriate for evaluating a regression model?

Data Wrangling& Cleaning

EDA & Visualization

Model Development& Evaluation

Statistical Tests& Inference

Practice Datasets

Practice Arena

Data Wrangling
& Cleaning

Model Development
& Evaluation

Statistical Tests
& Inference