Automated Data Pipeline — Python ETL Showcase

1

Raw CSV Input

Input

⚠️ Source file contains inconsistent formatting, missing values, and duplicate entries that must be cleaned before analysis.

date,region,product,units_sold,revenue,sales_rep 2024-01-15,Northeast,Widget A,150,4500.00,J. Smith 2024-01-15,Southeast,Widget B,,NaN,R. Jones 2024/01/16,northeast,Widget A,200,6000.00,J. Smith 2024-01-16,West, Widget C,75,3750.00, 2024-01-17,Northeast,widget a,-10,-300.00,M. Lee 01/18/2024,Southeast,Widget B,320,9600,R. Jones

Issue	Column	Rows Affected	Severity
Missing values	units_sold, revenue	23	Medium
Inconsistent date format	date	1,247	Medium
Case inconsistencies	region, product	312	Low
Negative values	units_sold, revenue	8	Medium
Missing sales rep	sales_rep	45	Low

2

Python Processing (pandas)

ETL

        pipeline.py
        import pandas as pd
import numpy as np
from datetime import datetime

# Stage 2a: Load raw data
df = pd.read_csv("sales_raw.csv", parse_dates=["date"])

# Stage 2b: Clean missing values
df = df.dropna(subset=["units_sold", "revenue"])
df["sales_rep"] = df["sales_rep"].fillna("Unassigned")

# Stage 2c: Normalize and convert types
df["region"] = df["region"].str.strip().str.title()
df["product"] = df["product"].str.strip().str.title()
df["units_sold"] = df["units_sold"].astype(int)

# Stage 2d: Filter invalid records
df = df[df["units_sold"] > 0]
df = df[df["revenue"] > 0]

# Stage 2e: Load rep metadata and merge
reps = pd.read_csv("sales_reps.csv")
df = df.merge(reps, on="sales_rep", how="left")

# Stage 2f: Aggregate with groupby
summary = df.groupby(["region", "product"]).agg(
    total_units=("units_sold", "sum"),
    total_revenue=("revenue", "sum"),
    avg_price=("revenue", "mean"),
    transaction_count=("units_sold", "count")
).reset_index()
      

Processing Log

PASS Loaded 12,847 rows from CSV
WARN Dropped 23 rows (missing values)
PASS Normalized 312 case inconsistencies
WARN Removed 8 negative-value records
PASS Merged rep metadata (98.2% match)
PASS Grouped into 12 region-product combos

Data Quality

Input rows	12,847
Rows dropped	-31
Output rows	12,816
Completeness	99.8%
Processing time	1.8s

3

Clean Output

Output

        output
        # Export cleaned data
summary.to_csv("sales_clean.csv", index=False)
summary.to_excel("sales_clean.xlsx", index=False)
print(summary.to_string(index=False))
      

Region	Product	Total Units	Total Revenue	Avg Price	Transactions
Northeast	Widget A	4,280	$128,400	$30.00	856
Northeast	Widget B	2,150	$86,000	$40.00	430
Southeast	Widget A	3,640	$109,200	$30.00	728
Southeast	Widget B	3,920	$156,800	$40.00	784
West	Widget A	1,890	$56,700	$30.00	378
West	Widget B	1,420	$56,800	$40.00	284
West	Widget C	2,310	$115,500	$50.00	462

✅ All data validated: no nulls, no duplicates, consistent formatting. Exported to CSV, Excel, and JSON.

4

Visualization (matplotlib)

Charts

        charts.py
        import matplotlib.pyplot as plt
import matplotlib.ticker as mticker

# Revenue by Region - Grouped Bar Chart
fig, axes = plt.subplots(1, 2, figsize=(14, 6))

# Chart 1: Revenue by region and product
pivot = summary.pivot(
    index="region",
    columns="product",
    values="total_revenue"
)
pivot.plot(kind="bar", ax=axes[0], color=["#6366f1", "#0ea5e9", "#10b981"])
axes[0].set_title("Revenue by Region & Product", fontweight="bold")
axes[0].yaxis.set_major_formatter(mticker.StrMethodFormatter("${x:,.0f}"))

# Chart 2: Units sold distribution
summary.groupby("region")["total_units"].sum().plot(
    kind="pie", ax=axes[1], autopct="%1.1f%%",
    colors=["#6366f1", "#0ea5e9", "#10b981"]
)
plt.tight_layout()
plt.savefig("charts/revenue_analysis.png", dpi=150)
      

Revenue by Region & Product

NE

SE

West

Widget A

Widget B

Widget C

5

PDF Report Generation

Report

        report.py
        from fpdf import FPDF
from datetime import datetime

class SalesReport(FPDF):
    def header(self):
        self.set_font("Helvetica", "B", 16)
        self.cell(0, 10, "Q1 2024 Sales Report", ln=True)
        self.set_font("Helvetica", "", 9)
        self.cell(0, 6, f"Generated: {datetime.now().strftime('%Y-%m-%d %H:%M')}")

    def add_summary_table(self, data):
        self.set_font("Helvetica", "B", 10)
        for col in data.columns:
            self.cell(32, 8, col, border=1)
        self.ln()
        self.set_font("Helvetica", "", 9)
        for _, row in data.iterrows():
            for val in row:
                self.cell(32, 7, str(val), border=1)
            self.ln()

# Generate the report
pdf = SalesReport()
pdf.add_page()
pdf.add_summary_table(summary)
pdf.image("charts/revenue_analysis.png", x=10, w=190)
pdf.output("reports/Q1_2024_Sales_Report.pdf")
      

Report Preview

Q1 2024 Sales Report

Generated: 2024-04-01 09:15 | Ironshore Analytics LLC

Executive Summary

Total revenue of $709,400 across 3 regions and 3 product lines, representing a +12.3% increase over Q4 2023.

$709,400 Revenue 19,610 Units 3,922 Transactions 3 Regions

Regional Performance

Southeast region leads with $265,800 in revenue (37.5%), followed by Northeast at $214,400 (30.2%) and West at $229,000 (32.3%).

Attached Charts

1. Revenue by Region & Product (Bar) 2. Units Sold Distribution (Pie)

Automated report — generated by Python pipeline in 4.2s — 3 pages

📋 PDF report saved to reports/Q1_2024_Sales_Report.pdf — ready for executive distribution. Runs unattended via cron at 6:00 AM daily.