Data Preparation
Data Preparation
The quality of your data determines the quality of your model. Data preparation is often the most important step in a Machine Learning project.
The Data Pipeline
flowchart LR
A[Raw Data] --> B[Collection]
B --> C[Cleaning]
C --> D[Transformation]
D --> E[Splitting]
E --> F[Train set]
E --> G[Test set]
style F fill:#22c55e,color:#fff
style G fill:#3b82f6,color:#fff
- Collection — Gather raw data
- Cleaning — Handle missing values and anomalies
- Transformation — Encode, normalize, create new features
- Splitting — Divide into training and test sets
Handling Missing Values
import pandas as pd
df = pd.read_csv('data.csv')
# Check missing values
print(df.isnull().sum())
# Replacement strategies
df['age'].fillna(df['age'].median(), inplace=True) # by median
df['city'].fillna('Unknown', inplace=True) # by default value
df.dropna(subset=['target'], inplace=True) # drop if target is missing
Feature Normalization
ML algorithms are sensitive to the scale of data. It is important to normalize.
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
Train/Test Split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
Summary
Good data preparation is the key to a performant model. Clean, transform, and split your data before any training.