Python-Powered ETL: From Raw Data to Executive Reports
| Issue | Column | Rows Affected | Severity |
|---|---|---|---|
| Missing values | units_sold, revenue | 23 | Medium |
| Inconsistent date format | date | 1,247 | Medium |
| Case inconsistencies | region, product | 312 | Low |
| Negative values | units_sold, revenue | 8 | Medium |
| Missing sales rep | sales_rep | 45 | Low |
import pandas as pd
import numpy as np
from datetime import datetime
# Stage 2a: Load raw data
df = pd.read_csv("sales_raw.csv", parse_dates=["date"])
# Stage 2b: Clean missing values
df = df.dropna(subset=["units_sold", "revenue"])
df["sales_rep"] = df["sales_rep"].fillna("Unassigned")
# Stage 2c: Normalize and convert types
df["region"] = df["region"].str.strip().str.title()
df["product"] = df["product"].str.strip().str.title()
df["units_sold"] = df["units_sold"].astype(int)
# Stage 2d: Filter invalid records
df = df[df["units_sold"] > 0]
df = df[df["revenue"] > 0]
# Stage 2e: Load rep metadata and merge
reps = pd.read_csv("sales_reps.csv")
df = df.merge(reps, on="sales_rep", how="left")
# Stage 2f: Aggregate with groupby
summary = df.groupby(["region", "product"]).agg(
total_units=("units_sold", "sum"),
total_revenue=("revenue", "sum"),
avg_price=("revenue", "mean"),
transaction_count=("units_sold", "count")
).reset_index()
| Input rows | 12,847 |
| Rows dropped | -31 |
| Output rows | 12,816 |
| Completeness | 99.8% |
| Processing time | 1.8s |
# Export cleaned data
summary.to_csv("sales_clean.csv", index=False)
summary.to_excel("sales_clean.xlsx", index=False)
print(summary.to_string(index=False))
| Region | Product | Total Units | Total Revenue | Avg Price | Transactions |
|---|---|---|---|---|---|
| Northeast | Widget A | 4,280 | $128,400 | $30.00 | 856 |
| Northeast | Widget B | 2,150 | $86,000 | $40.00 | 430 |
| Southeast | Widget A | 3,640 | $109,200 | $30.00 | 728 |
| Southeast | Widget B | 3,920 | $156,800 | $40.00 | 784 |
| West | Widget A | 1,890 | $56,700 | $30.00 | 378 |
| West | Widget B | 1,420 | $56,800 | $40.00 | 284 |
| West | Widget C | 2,310 | $115,500 | $50.00 | 462 |
import matplotlib.pyplot as plt
import matplotlib.ticker as mticker
# Revenue by Region - Grouped Bar Chart
fig, axes = plt.subplots(1, 2, figsize=(14, 6))
# Chart 1: Revenue by region and product
pivot = summary.pivot(
index="region",
columns="product",
values="total_revenue"
)
pivot.plot(kind="bar", ax=axes[0], color=["#6366f1", "#0ea5e9", "#10b981"])
axes[0].set_title("Revenue by Region & Product", fontweight="bold")
axes[0].yaxis.set_major_formatter(mticker.StrMethodFormatter("${x:,.0f}"))
# Chart 2: Units sold distribution
summary.groupby("region")["total_units"].sum().plot(
kind="pie", ax=axes[1], autopct="%1.1f%%",
colors=["#6366f1", "#0ea5e9", "#10b981"]
)
plt.tight_layout()
plt.savefig("charts/revenue_analysis.png", dpi=150)
from fpdf import FPDF
from datetime import datetime
class SalesReport(FPDF):
def header(self):
self.set_font("Helvetica", "B", 16)
self.cell(0, 10, "Q1 2024 Sales Report", ln=True)
self.set_font("Helvetica", "", 9)
self.cell(0, 6, f"Generated: {datetime.now().strftime('%Y-%m-%d %H:%M')}")
def add_summary_table(self, data):
self.set_font("Helvetica", "B", 10)
for col in data.columns:
self.cell(32, 8, col, border=1)
self.ln()
self.set_font("Helvetica", "", 9)
for _, row in data.iterrows():
for val in row:
self.cell(32, 7, str(val), border=1)
self.ln()
# Generate the report
pdf = SalesReport()
pdf.add_page()
pdf.add_summary_table(summary)
pdf.image("charts/revenue_analysis.png", x=10, w=190)
pdf.output("reports/Q1_2024_Sales_Report.pdf")
Generated: 2024-04-01 09:15 | Ironshore Analytics LLC
Total revenue of $709,400 across 3 regions and 3 product lines, representing a +12.3% increase over Q4 2023.
Southeast region leads with $265,800 in revenue (37.5%), followed by Northeast at $214,400 (30.2%) and West at $229,000 (32.3%).
1. Revenue by Region & Product (Bar) 2. Units Sold Distribution (Pie)
reports/Q1_2024_Sales_Report.pdf — ready for executive distribution. Runs unattended via cron at 6:00 AM daily.