Data Preparation

Data Preparation

The quality of your data determines the quality of your model. Data preparation is often the most important step in a Machine Learning project.

The Data Pipeline

flowchart LR
    A[Raw Data] --> B[Collection]
    B --> C[Cleaning]
    C --> D[Transformation]
    D --> E[Splitting]
    E --> F[Train set]
    E --> G[Test set]
    style F fill:#22c55e,color:#fff
    style G fill:#3b82f6,color:#fff
  1. Collection — Gather raw data
  2. Cleaning — Handle missing values and anomalies
  3. Transformation — Encode, normalize, create new features
  4. Splitting — Divide into training and test sets

Handling Missing Values

import pandas as pd

df = pd.read_csv('data.csv')

# Check missing values
print(df.isnull().sum())

# Replacement strategies
df['age'].fillna(df['age'].median(), inplace=True)      # by median
df['city'].fillna('Unknown', inplace=True)               # by default value
df.dropna(subset=['target'], inplace=True)               # drop if target is missing

Feature Normalization

ML algorithms are sensitive to the scale of data. It is important to normalize.

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

Train/Test Split

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

Summary

Good data preparation is the key to a performant model. Clean, transform, and split your data before any training.