Module 1: Understanding Raw Data
Types of data: categorical, numerical, time-series
Common data sources (CSV, Excel, APIs)
Real-world data issues (duplicates, missing values, outliers)
Module 2: Importing and Loading Data
Reading from files and databases
Initial inspection using pandas.head(), .info(), .describe()
Encoding formats and data types
Module 3: Data Cleaning Essentials
Handling missing data (mean, median, drop)
Correcting invalid or inconsistent entries
Detecting and dealing with outliers
Module 4: Feature Engineering Basics
Creating new columns from existing data
Label encoding, one-hot encoding
Binning and feature scaling (normalization, standardization)
Module 5: Exploratory Data Analysis (EDA)
Distributions and central tendencies
Correlation matrices and pair plots
Visual exploration with matplotlib and seaborn
Module 6: Data Transformation Techniques
Log transforms, aggregations, and pivot tables
Datetime parsing and time-series formatting
Combining multiple datasets
Module 7: Data Integrity & Ethics
Avoiding data leakage
Bias in datasets and fairness
Best practices for clean, reproducible workflows
Module 8: Capstone Project โ Real Data Prep
Choose a dataset (e.g., healthcare, finance, marketing)
Clean, transform, and visualize it
Document your pipeline with markdown and visuals
Prepare for modeling or presentation