Project Portfolio

End-to-end analytics projects solving real healthcare business problems — from raw data to executive dashboards and measurable outcomes.

$2M+

Revenue Leakage Found

18%

Denial Rate Reduction

$500K+

Fraud Claims Flagged

120+

Providers Benchmarked

Case Study 01

Healthcare Data Pipeline on Azure

Medallion Architecture — ADLS Gen2 + Azure Data Factory

GitHub

32Parquet

The Problem

Healthcare claims arrive continuously as raw, untyped files that aren't query-ready. Teams need a pipeline that safely lands raw data, transforms it into a clean and trustworthy layer, and produces business-facing aggregates — reliably, on a schedule, and with no servers to manage.

Approach

▸Built a medallion data lake on ADLS Gen2 (hierarchical namespace) with bronze, silver, and gold zones
▸Authored an Azure Data Factory pipeline orchestrating two Mapping Data Flows on managed Spark
▸bronze→silver: cast and standardize raw CSV into typed, compressed Parquet; silver→gold: aggregate claims and paid amounts by provider
▸Chained the flows with a success dependency and a daily schedule trigger, with full run history in ADF Monitor

Results

✓End-to-end pipeline runs verified: raw CSV → typed Parquet (silver) → curated marts (gold)
✓Serverless transforms on managed Spark — no clusters to provision or babysit
✓Columnar Parquet shrank the dataset ~65% vs. raw CSV for faster, cheaper queries
✓Reproducible from source control — all ADF artifacts exported as JSON

AzureData FactoryADLS Gen2Mapping Data FlowsParquetSpark

Case Study 02

Medicare Claims Data Warehouse

Snowflake Cloud Warehouse — Star Schema, VARIANT (FHIR), Native CDC

GitHub

500K+$1.26B10

The Problem

Payers and providers sit on huge volumes of structured claims alongside deeply nested FHIR/JSON, arriving continuously. Traditional warehouses force a trade-off between performance and flexibility, require brittle pre-flattening of semi-structured data, and need heavy orchestration for incremental loads.

Approach

▸Architected a layered RAW → STAGING → MARTS warehouse in Snowflake with compute separated from storage (dedicated load vs. analytics warehouses)
▸Modeled 500K+ Medicare claims into a star schema (fact_claims with provider and drug dimensions) for fast, intuitive analytics
▸Ingested semi-structured FHIR/JSON natively using VARIANT and path queries — eliminating a pre-flattening pipeline
▸Built a serverless change-data-capture pipeline with Streams + Tasks, plus clustering keys, zero-copy clones, and Time Travel

Results

✓500K+ claims modeled; ~$1.26B in paid amounts analyzed
✓Native CDC pipeline loads only new records — no full refresh, no external orchestrator
✓Date-clustered fact table prunes micro-partitions for fast filtered queries
✓Zero-copy clones provide instant, prod-size dev environments at no extra storage cost

SnowflakeSQLStar SchemaVARIANTStreams & TasksClustering

Case Study 03

Healthcare Fraud Risk Analytics

Anomaly Detection & Provider Risk Profiling

GitHub Tableau

10,000+$500K+8

The Problem

Healthcare fraud costs the US healthcare system over $100B annually. Our organization needed a scalable approach to identify suspicious billing patterns across 10,000+ claims and flag high-risk providers before payments were issued.

Approach

▸Built an end-to-end fraud detection pipeline analyzing claims across 8 provider specialties and 5 insurance types
▸Developed a composite fraud risk score using statistical anomaly detection (Z-scores, IQR methods) on billing amounts, claim frequency, and service patterns
▸Created provider risk profiles comparing individual billing patterns against specialty benchmarks
▸Performed chi-square and t-test analyses to validate statistically significant fraud indicators

Results

✓$500K+ in suspicious claims flagged for review
✓Identified 3 provider specialties with fraud rates 2x above baseline
✓Reduced false-positive rate by 35% through multi-factor scoring
✓Built executive dashboard enabling real-time fraud monitoring

PythonSQLTableaupandasSciPyJupyter

Case Study 04

Revenue Integrity Analytics

CMS Medicare Charge-to-Payment Variance Analysis

GitHub

250K+$2M+12

The Problem

Revenue leakage from undercoded procedures, payer underpayments, and charge capture gaps was costing the organization millions annually. Leadership needed visibility into exactly where revenue was being lost and which provider contracts needed renegotiation.

Approach

▸Analyzed 250,000+ CMS Medicare provider-service records to identify charge-to-payment variance patterns
▸Built charge-to-payment ratio models comparing submitted charges against allowed and paid amounts across specialties and geographies
▸Identified high-variance HCPCS codes where reimbursement fell significantly below submitted charges
▸Segmented analysis by provider type, state, and service category to pinpoint systemic underpayment patterns

Results

✓$2M+ in annual revenue leakage identified
✓12 underpaying payer agreements flagged for renegotiation
✓40% reduction in charge capture errors through automated flagging
✓Executive dashboard adopted by VP-level stakeholders for quarterly reviews

PythonSQLTableaupandasNumPyJupyter

Case Study 05

Claims Efficiency Analysis

Denial Root-Cause Analysis & Payer Benchmarking

GitHub

8,500+18%90%

The Problem

Claim denial rates were trending upward, costing the organization in rework hours and delayed revenue. The team lacked visibility into which denial reasons were most prevalent, which payers were underperforming, and what the true cost of rework was.

Approach

▸Analyzed 8,500+ claims across 6 service lines and 4 payer types to build a comprehensive denial analytics framework
▸Performed root-cause analysis on denied claims, categorizing by denial reason (missing documentation, coding errors, authorization issues, timely filing)
▸Built payer benchmarking models comparing approval rates, processing times, and paid-to-billed ratios
▸Calculated rework cost impact including touch counts, processing days, and staff time allocation

Results

✓18% reduction in claim denial rate through automated pre-submission flagging
✓Identified 'Missing Documentation' as #1 denial driver (38% of all denials)
✓Report generation time cut from 8 hours to 45 minutes via automated pipelines
✓Payer scorecards adopted for annual contract negotiations

PythonSQLTableaupandasmatplotlibJupyter

Case Study 06

Provider Performance Analytics

Quality-Cost Benchmarking & Network Optimization

GitHub

120+1,800+2

The Problem

The provider network included 120+ providers but lacked standardized performance measurement. Network strategy decisions were based on anecdotal feedback rather than data, leading to inconsistent quality and cost outcomes.

Approach

▸Built provider scorecards tracking quality scores, cost efficiency, utilization metrics, and approval rates across 1,800+ monthly records
▸Developed peer percentile ranking (25th/50th/75th) within each specialty to enable fair benchmarking
▸Analyzed quality-cost correlation to identify providers delivering high quality at lower cost
▸Created network tier analysis comparing Standard vs. Premium tier outcomes

Results

✓120+ providers benchmarked with standardized scorecards
✓Identified top-decile providers with 95+ quality scores and below-median costs
✓Network strategy decisions now data-driven, informing contract renewals
✓Monthly automated reporting replaced quarterly manual reviews

PythonSQLTableaupandasSciPyJupyter

Case Study 07

End-to-End Healthcare Pipeline

Data Engineering: Ingestion to Executive Reporting

GitHub

50K+60%0%

The Problem

Healthcare data arrived from multiple sources in inconsistent formats with quality issues — missing values, duplicates, and schema drift. Analysts spent 60%+ of their time cleaning data rather than generating insights.

Approach

▸Designed a complete ETL pipeline: ingestion, validation, transformation, KPI modeling, and reporting
▸Built data quality checks including null detection, duplicate removal, schema validation, and referential integrity
▸Created transformation layer standardizing provider, claims, and patient data into analytics-ready formats
▸Automated KPI calculation and report generation with scheduling capabilities

Results

✓60% reduction in analyst data prep time
✓Zero manual data entry errors (previously 3-5% error rate)
✓Pipeline processes 50K+ records with full audit logging
✓Reusable framework adopted across 3 additional analytics projects

PythonSQLETLAzurepandasJupyter

Case Study 08

Healthcare Data Warehouse (dbt)

Analytics Engineering with dbt + PostgreSQL

GitHub

3100%10x

The Problem

Analytics queries were running against raw, unnormalized tables — slow, error-prone, and inconsistent across teams. The organization needed a proper warehouse with staging, intermediate, and mart layers following analytics engineering best practices.

Approach

▸Built a dbt project with PostgreSQL implementing a three-layer warehouse: staging (cleaned sources), intermediate (business logic), and marts (consumption-ready)
▸Defined dbt models with proper materializations (views for staging, tables for marts) and incremental refresh strategies
▸Implemented data tests (unique, not_null, accepted_values, relationships) across all layers
▸Created documentation with dbt docs for full data lineage visibility

Results

✓Query performance improved 10x on mart tables vs. raw sources
✓100% test coverage across all models with automated CI checks
✓Self-serve analytics enabled — business users query marts directly
✓Full data lineage from source to dashboard via dbt docs

dbtPostgreSQLPythonSQLData Modeling

Want to see more?

Check out my interactive dashboards on Tableau Public or explore the full code on GitHub.

Tableau Public GitHub Profile View Resume