Module 2: Python for Data Work#

CodeVision Python Training

Contents#

  • Part 1: Pandas Fundamentals (Sections 1–4)

  • Part 2: Data Manipulation (Sections 5–8)

  • Part 3: Numerical & Statistical Computing (Sections 9–10)

  • Part 4: Data Visualisation (Sections 11–13)

  • Part 5: Advanced Data Sources (Sections 14–15)


Welcome to Module 2#

In Module 1, you learned how Python works. In Module 2, Python starts to feel useful.

Real data is rarely neat. It arrives incomplete, inconsistent, and spread across files, spreadsheets, databases, and APIs. This module teaches you how to handle that reality calmly and systematically.

By the end of this module, you will be confident loading, cleaning, transforming, analysing, and visualising data using Python’s most powerful libraries.


Part 1: Pandas Fundamentals#

1. What Is Pandas and Why We Use It#

Pandas is the most widely used Python library for data analysis. It provides powerful, high-level abstractions for working with tabular data (rows and columns).

Think of Pandas as Excel on steroids—but programmable, reproducible, and capable of handling millions of rows.

Key benefits:

  • Load data from CSV, Excel, JSON, SQL, and more

  • Clean and transform messy data

  • Filter, sort, and aggregate with simple commands

  • Integrates seamlessly with visualisation libraries

# Import pandas with the standard alias 'pd'
import pandas as pd

print(f"Pandas version: {pd.__version__}")
print("Pandas is ready to use!")
---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
Cell In[1], line 2
      1 # Import pandas with the standard alias 'pd'
----> 2 import pandas as pd
      4 print(f"Pandas version: {pd.__version__}")
      5 print("Pandas is ready to use!")

ModuleNotFoundError: No module named 'pandas'

2. Loading Data from Files#

Most data analysis starts with loading data from a file. Pandas makes this trivial with functions like read_csv() and read_excel().

Common file types:

  • .csv — Comma-separated values (most common)

  • .xlsx — Excel spreadsheets

  • .json — JavaScript Object Notation

  • .parquet — Efficient columnar storage

# Create sample data for demonstration
sample_data = """date,product,region,revenue,units
2024-01-01,Widget A,North,1500,30
2024-01-01,Widget B,South,2200,44
2024-01-02,Widget A,North,1800,36
2024-01-02,Widget B,South,1900,38
2024-01-03,Widget A,East,2100,42
2024-01-03,Widget B,West,,35
2024-01-04,Widget A,North,1600,32
2024-01-04,Widget B,East,2400,48"""

# Save to a CSV file
with open("sales.csv", "w") as f:
    f.write(sample_data)

# Load the CSV file into a DataFrame
df = pd.read_csv("sales.csv")
print("Data loaded successfully!")
print(f"Shape: {df.shape[0]} rows, {df.shape[1]} columns")

3. First Look at a Dataset#

Before analysing any dataset, you must inspect it. This helps you understand:

  • What columns exist

  • What data types are present

  • Whether there are missing values

  • Basic statistics about the data

Essential inspection methods:

  • df.head() — First 5 rows

  • df.info() — Column types and missing values

  • df.describe() — Summary statistics

# View the first few rows
print("=== First 5 Rows ===")
df.head()
# Check data types and missing values
print("=== DataFrame Info ===")
df.info()
# Summary statistics for numerical columns
print("=== Summary Statistics ===")
df.describe()

4. DataFrames and Series#

Pandas has two core data structures:

DataFrame — A 2D table with rows and columns (like a spreadsheet)

Series — A single column of data (1D)

When you select a single column from a DataFrame, you get a Series. When you select multiple columns, you get a DataFrame.

# Selecting a single column returns a Series
revenue_series = df["revenue"]
print(f"Type: {type(revenue_series)}")
print(revenue_series)
# Selecting multiple columns returns a DataFrame
subset_df = df[["product", "revenue"]]
print(f"Type: {type(subset_df)}")
subset_df

Part 2: Data Manipulation#

5. Filtering Data#

Filtering allows you to select rows that meet certain conditions. This is one of the most common operations in data analysis.

Syntax: df[condition]

The condition is a boolean expression that evaluates to True or False for each row.

# Filter rows where revenue is greater than 2000
high_revenue = df[df["revenue"] > 2000]
print("=== High Revenue Sales (> 2000) ===")
high_revenue
# Multiple conditions with & (and) or | (or)
north_high = df[(df["region"] == "North") & (df["revenue"] > 1500)]
print("=== North Region with Revenue > 1500 ===")
north_high

6. Handling Missing Data#

Real-world data often has missing values. Pandas represents these as NaN (Not a Number).

Common strategies:

  • df.isna() — Detect missing values

  • df.fillna(value) — Replace missing values

  • df.dropna() — Remove rows with missing values

# Check for missing values
print("=== Missing Values per Column ===")
print(df.isna().sum())
# Fill missing revenue with the column mean
df_filled = df.copy()
df_filled["revenue"] = df_filled["revenue"].fillna(df_filled["revenue"].mean())
print("=== After Filling Missing Values ===")
print(df_filled.isna().sum())

7. Transforming Data#

Data transformation means creating new columns or modifying existing ones. This is essential for feature engineering — creating useful variables for analysis or machine learning.

Common transformations:

  • Calculating derived values (profit = revenue - cost)

  • Converting data types

  • Applying functions to columns

# Create a new column: revenue per unit
df_filled["revenue_per_unit"] = df_filled["revenue"] / df_filled["units"]

# Convert date string to datetime
df_filled["date"] = pd.to_datetime(df_filled["date"])

# Extract day of week
df_filled["day_of_week"] = df_filled["date"].dt.day_name()

print("=== Transformed Data ===")
df_filled.head()

8. Grouping and Aggregation#

Grouping allows you to split data into categories and calculate summary statistics for each group.

Syntax: df.groupby(column)[value_column].aggregate_function()

Common aggregations: sum(), mean(), count(), min(), max()

# Total revenue by region
print("=== Total Revenue by Region ===")
revenue_by_region = df_filled.groupby("region")["revenue"].sum()
print(revenue_by_region)
# Multiple aggregations
print("\n=== Summary by Product ===")
product_summary = df_filled.groupby("product").agg({
    "revenue": ["sum", "mean"],
    "units": "sum"
})
print(product_summary)

Part 3: Numerical & Statistical Computing#

9. NumPy for Numerical Work#

NumPy is the foundation of scientific computing in Python. It provides fast, memory-efficient arrays and mathematical operations.

Pandas is actually built on top of NumPy, so understanding NumPy helps you work more effectively with data.

import numpy as np

# Create a NumPy array
revenues = np.array([1500, 2200, 1800, 1900, 2100, 1750, 1600, 2400])

# Basic statistics
print(f"Mean: {np.mean(revenues):.2f}")
print(f"Median: {np.median(revenues):.2f}")
print(f"Std Dev: {np.std(revenues):.2f}")
print(f"Min: {np.min(revenues)}, Max: {np.max(revenues)}")
# NumPy with Pandas
# Get the underlying NumPy array from a Series
revenue_array = df_filled["revenue"].values
print(f"Type: {type(revenue_array)}")
print(f"Sum using NumPy: {np.sum(revenue_array):.2f}")

10. Statistics with SciPy#

SciPy extends NumPy with advanced statistical functions, including hypothesis tests, distributions, and correlation analysis.

Key function: stats.pearsonr(x, y) — Calculates the Pearson correlation coefficient and p-value.

from scipy import stats

# Calculate correlation between revenue and units
correlation, p_value = stats.pearsonr(df_filled["revenue"], df_filled["units"])

print(f"Pearson Correlation: {correlation:.4f}")
print(f"P-value: {p_value:.4f}")

if correlation > 0.7:
    print("Strong positive correlation!")
elif correlation > 0.3:
    print("Moderate positive correlation.")
else:
    print("Weak or no correlation.")

Part 4: Data Visualisation#

11. Visualisation with Matplotlib#

Matplotlib is Python’s foundational plotting library. It gives you complete control over every aspect of a chart.

While it can be verbose, understanding Matplotlib helps you customise any visualisation.

import matplotlib.pyplot as plt

# Line chart of revenue over time
plt.figure(figsize=(10, 5))
plt.plot(df_filled["date"], df_filled["revenue"], marker="o", linewidth=2)
plt.title("Revenue Over Time")
plt.xlabel("Date")
plt.ylabel("Revenue ($)")
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
# Bar chart of revenue by region
plt.figure(figsize=(8, 5))
revenue_by_region.plot(kind="bar", color="steelblue", edgecolor="black")
plt.title("Total Revenue by Region")
plt.xlabel("Region")
plt.ylabel("Total Revenue ($)")
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

12. Visualisation with Seaborn#

Seaborn is built on Matplotlib but provides a higher-level interface for statistical visualisations. It produces beautiful charts with less code.

Seaborn excels at showing relationships and distributions in data.

import seaborn as sns

# Set the style
sns.set_style("whitegrid")

# Histogram of revenue distribution
plt.figure(figsize=(8, 5))
sns.histplot(df_filled["revenue"], bins=10, kde=True, color="coral")
plt.title("Distribution of Revenue")
plt.xlabel("Revenue ($)")
plt.tight_layout()
plt.show()
# Box plot by product
plt.figure(figsize=(8, 5))
sns.boxplot(data=df_filled, x="product", y="revenue", palette="Set2")
plt.title("Revenue Distribution by Product")
plt.tight_layout()
plt.show()
# Scatter plot with regression line
plt.figure(figsize=(8, 5))
sns.regplot(data=df_filled, x="units", y="revenue", scatter_kws={"alpha": 0.7})
plt.title("Revenue vs Units Sold")
plt.tight_layout()
plt.show()

13. Interactive Charts with Plotly#

Plotly creates interactive visualisations that users can zoom, pan, and hover over to see details. This is especially useful for dashboards and presentations.

Plotly Express (px) provides a simple interface similar to Seaborn.

import plotly.express as px

# Interactive bar chart
fig = px.bar(
    df_filled, 
    x="region", 
    y="revenue", 
    color="product",
    title="Revenue by Region and Product",
    barmode="group"
)
fig.show()
# Interactive scatter plot
fig = px.scatter(
    df_filled,
    x="units",
    y="revenue",
    color="region",
    size="revenue",
    hover_data=["product", "date"],
    title="Revenue vs Units (Hover for Details)"
)
fig.show()

Part 5: Advanced Data Sources#

14. Automated Data Profiling#

For quick exploratory analysis, ydata-profiling (formerly pandas-profiling) generates comprehensive HTML reports automatically.

This is incredibly useful when you first receive a new dataset and need to understand it quickly.

# Note: ydata-profiling may need to be installed
# !pip install ydata-profiling

# For large datasets, use minimal mode
# from ydata_profiling import ProfileReport
# report = ProfileReport(df_filled, title="Sales Data Report", minimal=True)
# report.to_notebook_iframe()

# For now, let's create a simple profile manually
print("=== Quick Data Profile ===")
print(f"Rows: {len(df_filled)}")
print(f"Columns: {len(df_filled.columns)}")
print(f"\nColumn Types:")
print(df_filled.dtypes)
print(f"\nMissing Values:")
print(df_filled.isna().sum())

15. Data from APIs and Databases#

Real-world data often comes from web APIs or databases rather than files.

APIs — Use the requests library to fetch data from web services

Databases — Use sqlite3 (built-in) or sqlalchemy for database connections

import requests

# Example: Fetching data from a public API
# Note: This is a demonstration - actual API may vary

# Simulated API response
api_response = {
    "status": "success",
    "data": [
        {"symbol": "AAPL", "price": 178.50},
        {"symbol": "GOOGL", "price": 141.25},
        {"symbol": "MSFT", "price": 378.90}
    ]
}

# Convert API response to DataFrame
stocks_df = pd.DataFrame(api_response["data"])
print("=== Stock Data from API ===")
stocks_df
import sqlite3

# Create an in-memory SQLite database
conn = sqlite3.connect(":memory:")

# Save our DataFrame to a SQL table
df_filled.to_sql("sales", conn, index=False)

# Query the database using SQL
query = """
SELECT region, SUM(revenue) as total_revenue
FROM sales
GROUP BY region
ORDER BY total_revenue DESC
"""

result = pd.read_sql_query(query, conn)
print("=== SQL Query Result ===")
result

Summary#

In this module, you learned to:

  1. Load and inspect data using Pandas

  2. Filter, clean, and transform datasets

  3. Aggregate data with groupby operations

  4. Perform numerical analysis with NumPy and SciPy

  5. Create visualisations with Matplotlib, Seaborn, and Plotly

  6. Work with APIs and databases as data sources

These skills form the foundation for machine learning and AI work in later modules.


Next Steps#

  1. Complete the Module 2 Quiz to test your understanding

  2. Work through the Module 2 Assessment using real financial data

  3. Experiment with your own datasets!

Launch Online: Click the Rocketship Icon at the top of this page to launch this notebook in JupyterLab, Jupyter Notebook, or Google Colab.