v2.0 · Interactive Learning
🧹 Module 01

Data Wrangling
& Cleaning

Master data preparation with Pandas — handling dirty data, transformations, missing values, and normalization techniques.

🧼 Data Cleaning Simulator

Enter raw data with missing values, outliers, and invalid entries. See how it gets cleaned using IQR-based outlier removal and median imputation.

python · data_cleaning.py
import pandas as pd
import numpy as np

# Replace common missing value placeholders with NaN
df = df.replace(['?', 'NA', ''], np.nan)

# Fill numeric missing values with median (robust to outliers)
numeric_cols = df.select_dtypes(include=[np.number]).columns
for col in numeric_cols:
    df[col].fillna(df[col].median(), inplace=True)

# Remove outliers using IQR method
def remove_outliers(df, column):
    Q1 = df[column].quantile(0.25)
    Q3 = df[column].quantile(0.75)
    IQR = Q3 - Q1
    lower = Q1 - 1.5 * IQR
    upper = Q3 + 1.5 * IQR
    return df[(df[column] >= lower) & (df[column] <= upper)]
🔄 Data Transformation Explorer

Compare different mathematical transformations and their effect on data distribution.

Log Transformation
y′ = log(y + 1)
Compresses large values, stabilizes variance, reduces right-skew. The +1 prevents log(0) errors.
Square Root
y′ = √y
Milder compression than log. Useful for count data following a Poisson distribution.
❓ Missing Values Handler
Mean
Replace with column mean. Fast but sensitive to outliers.
Median
Replace with median. Robust to skewed distributions.
Interpolation
Estimate from neighboring values. Best for time series.
📏 Normalization Comparison
Min-Max Normalization
x′ = (x − min) / (max − min)
Scales to [0,1]. Preserves relationships but sensitive to outliers. Preferred for neural networks.
Z-Score Standardization
x′ = (x − μ) / σ
Centers around 0 with unit variance. Robust, ideal for statistical models and SVM.
Decimal Scaling
x′ = x / 10ʲ
Divide by power of 10. Simple, maintains sign. j = ⌈log₁₀(max|x|)⌉
📊 Module 02

EDA & Visualization

Explore data relationships and create insightful visualizations using correlation analysis, distributions, and KDE plots.

📈 Correlation Matrix Explorer
📊 Distribution Explorer
⚖️ Group Comparison
🔍 KDE: Actual vs Predicted
Kernel Density Estimation
f̂(x) = (1/nh) Σ K((x − xᵢ)/h)
Non-parametric density estimation. Bandwidth h controls smoothness. Gaussian kernel is most common.
🤖 Module 03

Model Development
& Evaluation

Build, fit, and evaluate machine learning models with Scikit-learn — from simple linear regression to regularization techniques.

📉 Linear Regression
Linear Regression
y = β₀ + β₁x + ε
β₀ = intercept, β₁ = slope, ε = error term. Minimizes sum of squared residuals.
R² Score
R² = 1 − (SS_res / SS_tot)
Proportion of variance explained by the model. Ranges 0→1, higher is better.
📐 Polynomial Regression
Polynomial Regression
y = β₀ + β₁x + β₂x² + ··· + βₙxⁿ
Higher degree captures complex patterns but risks overfitting. Use cross-validation to select degree.
Bias-Variance Tradeoff
Error = Bias² + Variance + Noise
Low degree → high bias, low variance (underfit). High degree → low bias, high variance (overfit).
🏔️ Ridge Regression (L2)

Drag the alpha slider to see how L2 regularization shrinks model coefficients.

1.0
Ridge Cost Function
J(β) = Σ(yᵢ − ŷᵢ)² + αΣβⱼ²
L2 penalty shrinks all coefficients toward zero. Higher α = more regularization = simpler model. Never zeros out features completely.
📊 Model Evaluation Metrics
MSE / RMSE
MSE = (1/n) Σ(yᵢ − ŷᵢ)²
MSE penalizes large errors heavily. RMSE = √MSE and is in original units, more interpretable.
MAE
MAE = (1/n) Σ|yᵢ − ŷᵢ|
Mean Absolute Error — robust to outliers, in original units, treats all errors equally.
📈 Module 04

Statistical Tests
& Inference

Perform hypothesis tests and understand statistical significance with interactive Chi-Square, T-Test, ANOVA, and Pearson calculators.

χ² Chi-Square Test

Test for independence between two categorical variables. Enter a 2-row contingency table.

Chi-Square Statistic
χ² = Σ((Oᵢ − Eᵢ)² / Eᵢ)
Oᵢ = observed frequency, Eᵢ = expected. df = (rows−1)(cols−1). Reject H₀ if p < 0.05.
Expected Frequency
Eᵢ = (Row Total × Col Total) / Grand Total
Expected values under null hypothesis of independence between the two categorical variables.
📝 Independent T-Test
T-Statistic
t = (x̄₁ − x̄₂) / √(s₁²/n₁ + s₂²/n₂)
Compares two group means. Degrees of freedom ≈ n₁+n₂−2. Reject H₀ if |t| exceeds critical value.
📊 One-Way ANOVA
📐 Pearson Correlation
📂 Module 05

Practice Datasets

Explore real-world datasets interactively — visualize different features and understand the data before running analysis.

💻 Laptop Pricing Dataset
Practice Tasks
Suggested analyses:
1. Correlation between RAM and price
2. Average price by laptop category
3. Weight vs price scatter analysis
4. Build a price prediction model
5. Feature importance ranking
🏥 Medical Insurance Dataset
🌺 Iris Dataset
🚢 Titanic Dataset
🏠 Housing Prices Dataset
🛒 Sales Data
🎯 Module 06

Practice Arena

Test your data analysis knowledge with concept questions and data challenges. Track your progress as you go.

Learning Progress 0 / 5 correct
1. Which Python library is primarily used for data manipulation and analysis?
2. What does the Chi-Square test determine?
3. Ridge Regression adds which penalty term to the cost function?
4. A p-value less than 0.05 typically indicates what?
5. Which metric is NOT appropriate for evaluating a regression model?