An end-to-end retail analytics case study β from raw data to an interactive Power BI dashboard.
Maintainer: Siddharth Kori
This repository documents my hands-on implementation of a retail customer-behavior analytics project, built as part of my learning journey toward a Data Analyst / Business Intelligence role.
I did not design the original business case or collect the dataset β both originate from a public learning project created by Siddharth Kori (see Credits & Attribution). What you'll find here is my own walkthrough: I cleaned and transformed the data myself in Python, wrote and ran every SQL query independently to validate the results, built the Power BI dashboard from scratch, and recorded the insights and recommendations in my own words based on what the numbers actually showed. I'm sharing it as a transparent, reproducible record of that process rather than as original research.
The project simulates a real-world analytics engagement for a retail business, structured around the three tools most commonly expected of a Data Analyst:
| Stage | Tool | What I Did |
|---|---|---|
| Data Preparation | Python (Pandas) | Cleaned, validated, and engineered new features in a 3,900-row retail transactions dataset |
| Business Analysis | SQL (PostgreSQL) | Loaded the cleaned data into a relational database and wrote 10 queries to answer specific business questions |
| Visualization | Power BI | Built a single-page interactive dashboard with KPI cards, slicers, and category/age-group breakdowns |
The goal was to go through the full analyst workflow β not just analyze data, but also interpret it and turn it into something a business stakeholder could act on.
"How can the company leverage consumer shopping data to identify trends, improve customer engagement, and optimize marketing and product strategies?"
A retail business wants to understand how its customers shop β across demographics, product categories, and purchase channels β in order to:
- Identify which factors (discounts, reviews, season, shipping preference, subscription status) actually influence spending and repeat purchases
- Segment customers meaningfully for targeted marketing
- Decide where discounting helps revenue versus where it simply erodes margin
- Surface which products and age groups drive the most revenue
The full brief is preserved in reports/business_problem_statement.pdf.
| Detail | Value |
|---|---|
| Source | Customer Shopping Trends Dataset (Kaggle, synthetic retail data) |
| Rows | 3,900 customer transactions |
| Columns | 18 |
| Missing values | 37 (in Review Rating only) |
Key fields:
- Demographics: Age, Gender, Location
- Transaction details: Item Purchased, Category, Purchase Amount (USD), Size, Color, Season
- Behavioral signals: Review Rating, Subscription Status, Discount Applied, Promo Code Used, Previous Purchases, Frequency of Purchases
- Fulfillment: Shipping Type, Payment Method
Raw data lives in data/customer_shopping_behavior.csv.
| Tool | Purpose |
|---|---|
| Python (Pandas, SQLAlchemy) | Data cleaning, feature engineering, and loading data into a SQL database |
| SQL (PostgreSQL) | Structured querying to answer 10 specific business questions |
| Power BI | Interactive dashboard for stakeholder-facing visualization |
| Jupyter Notebook | Documenting the cleaning and EDA process step by step |
All cleaning steps are documented in notebooks/customer_shopping_behavior_analysis.ipynb. Here's what I actually did and why:
- Initial inspection β used
.info()and.describe(include='all')to check data types, ranges, and overall structure before touching anything. - Missing value handling β found 37 nulls, all in
Review Rating. Rather than dropping rows or filling with a single global average (which would mask category-level quality differences), I imputed each missing rating with the median rating of that product's category. - Column standardization β renamed all columns to
snake_case(e.g.Purchase Amount (USD)βpurchase_amount) so the schema would behave consistently once it moved into SQL and Power BI. - Feature engineering β
age_groupβ bucketedageinto four quartile-based segments (Young Adult,Adult,Middle-aged,Senior) usingpd.qcut, so age-based revenue analysis didn't depend on arbitrary fixed cutoffs. - Feature engineering β
purchase_frequency_daysβ converted the categoricalfrequency_of_purchasesfield (e.g. "Weekly", "Fortnightly", "Quarterly") into a numeric day-interval equivalent, making it usable in quantitative comparisons rather than just as a label. - Redundancy check β tested whether
discount_appliedandpromo_code_usedwere duplicate signals. They matched on every single row, so I droppedpromo_code_usedto avoid double-counting the same effect in later analysis. - Database load β connected the cleaned DataFrame to a PostgreSQL database via SQLAlchemy so the SQL stage could run against a real relational table instead of a flat file.
Before jumping into business questions, I profiled the dataset to understand its shape and any biases I'd need to account for:
- Gender split is uneven: 2,652 male records vs. 1,248 female β important context, since raw revenue totals by gender are not directly comparable per customer.
- Category mix: Clothing (44.5%) and Accessories (31.8%) dominate transaction volume; Footwear (15.4%) and Outerwear (8.3%) make up the rest.
- Season and payment method are both fairly evenly distributed (no single season or payment type dominates), which told me seasonality and payment preference were unlikely to be major revenue drivers on their own.
- Average review rating is nearly identical across genders (3.74 vs. 3.75) β product satisfaction doesn't appear to differ by gender in this data.
- Subscription is a minority behavior: only ~27% of customers (1,053 of 3,900) are subscribed.
This profiling step shaped which SQL questions were worth digging into further.
These are the findings I drew directly from running the SQL queries in sql/customer_behavior_queries.sql against the cleaned dataset:
- Higher male revenue is a volume effect, not a spending-power effect. Male customers generated $157,890 vs. $75,191 from female customers β but once you normalize for the gender imbalance in the data, the average spend per customer is almost identical (~$59.5 male vs. ~$60.3 female). The takeaway isn't "men spend more," it's "there are more male customers in this dataset."
- Discount usage isn't a reliable signal of price sensitivity. 839 customers used a discount and still spent above the average purchase amount ($59.76) β a meaningful chunk of "discount shoppers" are not actually budget-constrained.
- Subscribers don't outspend non-subscribers. Average spend was virtually flat between subscribers ($59.49) and non-subscribers ($59.87). Subscription drives recurring engagement, not bigger basket sizes β which changes how a subscription program should be pitched internally.
- Customer segmentation is heavily skewed toward "Loyal." Using the New (1 purchase) / Returning (2β10) / Loyal (10+) split, 79.9% of customers already fall into "Loyal," while only 2.1% are "New." That's a strong signal the acquisition funnel is thin relative to the retained base β worth flagging as a possible new-customer acquisition gap rather than a loyalty success story.
- Repeat purchase behavior and subscription aren't strongly linked. Among customers with more than 5 previous purchases, only ~27.6% are subscribed β almost identical to the ~27% subscription rate across the entire customer base. Being a repeat buyer doesn't meaningfully predict subscription.
- Express shipping customers spend modestly more. Average purchase amount is $60.48 for Express vs. $58.46 for Standard shipping β a small but consistent ~3.5% premium associated with faster delivery.
- The highest-rated products are also some of the most discount-dependent. "Hat" tops the discount-rate list at 50% of its purchases involving a discount, yet still ranks in the top 5 by average rating (3.80). High satisfaction and heavy discounting aren't mutually exclusive β but it raises the question of whether the rating is partly a function of perceived deal value.
- Revenue is fairly evenly spread across age groups, with Young Adults contributing the most ($62,143) and Seniors the least ($55,763) β a gap of roughly 10%, not a dramatic skew.
The Power BI dashboard (dashboard/customer_behavior_dashboard.pbix) is a single, densely-informative page built around quick stakeholder scanning:
- KPI cards β Number of Customers, Average Purchase Amount, and Average Review Rating at a glance
- Subscription donut chart β share of customers by subscription status
- Revenue by Category and Sales (volume) by Category β clustered column charts side by side, so revenue and order count can be compared directly
- Revenue by Age Group and Sales by Age Group β clustered bar charts using the engineered
age_groupfield - Interactive slicers for Subscription Status, Gender, Category, and Shipping Type β so any stakeholder can filter the entire page down to a specific customer slice without touching the underlying data
Based on the insights above, here's what I'd recommend to the business:
- Don't read raw gender revenue totals at face value. Report per-customer averages alongside totals so leadership doesn't draw the wrong conclusion about gender-based spending power.
- Re-evaluate the discount strategy. Since a large share of discount users would likely have purchased anyway, consider tightening discount eligibility (e.g. first-time buyers, cart abandoners) rather than applying it broadly.
- Reframe the subscription pitch around retention, not basket size. Since subscribers don't spend more per order, marketing should emphasize convenience and exclusive access rather than implying bigger savings or higher-value purchases.
- Invest more deliberately in top-of-funnel acquisition. With "New" customers making up only ~2% of the base, the business may be retaining well but acquiring slowly β worth validating against actual marketing spend and channel data.
- Promote Express shipping at checkout. The modest spend premium associated with Express orders suggests upselling faster delivery could lift average order value with minimal friction.
- Use top-rated, low-discount-dependency products in campaigns (e.g. Sandals, Boots, Skirt) where customer satisfaction isn't being propped up by price cuts.
Things I'd like to explore if I extend this project further:
- Add cohort or time-series analysis β this dataset has no transaction date field, so all "frequency" analysis is categorical rather than truly time-based. A version with timestamps would allow real retention-curve analysis.
- Build a customer lifetime value (CLV) estimate combining purchase amount, frequency, and previous purchases into a single score.
- Add a second Power BI page dedicated to product-level performance (top products per category, discount-rate vs. rating scatter, etc.) instead of compressing everything into one page.
- Automate the Python β SQL load with a small script/Makefile instead of manual notebook cells, so the pipeline can be re-run end-to-end.
- Re-test the New/Returning/Loyal segmentation thresholds β the current cutoffs produce a heavily Loyal-skewed split, and revisiting them against actual business definitions of "loyalty" would make the segmentation more useful.
Customer-Trends-Analysis-SQL-Python-PowerBI/
βββ README.md
βββ LICENSE
βββ data/
β βββ customer_shopping_behavior.csv
βββ notebooks/
β βββ customer_shopping_behavior_analysis.ipynb # Data cleaning, feature engineering, EDA, DB load
βββ sql/
β βββ customer_behavior_queries.sql # 10 business-question queries
βββ dashboard/
β βββ customer_behavior_dashboard.pbix # Power BI dashboard
βββ reports/
βββ business_problem_statement.pdf
βββ customer_shopping_behavior_analysis_report.pdf
βββ customer_shopping_behavior_analysis_presentation.pptx
-
Clone the repository
git clone https://github.com/<your-username>/Customer-Trends-Analysis-SQL-Python-PowerBI.git cd Customer-Trends-Analysis-SQL-Python-PowerBI
-
Run the notebook β open
notebooks/customer_shopping_behavior_analysis.ipynbto walk through data loading, cleaning, feature engineering, and the database load step. Update the database credentials in the connection cell to match your local PostgreSQL/MySQL/SQL Server setup. -
Run the SQL queries β once the cleaned data is loaded into your database, open
sql/customer_behavior_queries.sqland run each query against thecustomertable to reproduce the business-question answers above. -
Open the dashboard β load
dashboard/customer_behavior_dashboard.pbixin Power BI Desktop and point the data source to your database (or refresh from the CSV directly) to explore the visuals interactively.
Siddharth Kori Data Analyst in training β building a portfolio in SQL, Python, and Power BI.
If you're also learning data analytics, feel free to fork this, try it on the same dataset yourself, and compare notes β that's exactly how I used it.