Author: Ozkan Gelincik

  • From Raw CMS Claims to an Explainable Benchmarking Engine

    From Raw CMS Claims to an Explainable Benchmarking Engine

    Building a Medicare Provider Benchmarking Engine for expected-cost estimation, anomaly surfacing, provider tiering, and explainability

    Healthcare claims data is inherently adversarial: it is massive, noisy, and clinical variation often masks billing irregularities. For payers, the hard part isn’t finding outliers. It’s building a scalable, defensible way to separate expected clinical variation from patterns worth manual review. My capstone addresses this by transforming raw CMS claims into a statistically grounded benchmarking engine that surfaces over-expected behavior relative to a provider’s peer group.

    The result was the Medicare Provider Benchmarking Engine, an end-to-end system built on four years of real CMS Medicare provider-service data at HCPCS-level grain. For this version of the engine, I focused the cohort on oncology-related rendering provider specialties: Hematology-Oncology, Medical Oncology, Radiation Oncology, Gynecological Oncology, and Surgical Oncology.

    It combines:

    • robust exploratory data analysis,
    • supervised modeling for expected-cost estimation,
    • post-model guardrails,
    • row-level and provider-level anomaly surfacing,
    • provider tiering through clustering,
    • an interpretable logistic explainer model,
    • and an interactive Shiny for Python dashboard for navigating results.

    Project at a Glance:

    • The Problem: Healthcare payers need scalable ways to distinguish expected clinical variation from billing patterns requiring investigation.
    • The Solution: A routed machine-learning benchmarking engine using 4 years of CMS Medicare data to estimate expected costs at the provider-service level.
    • The Output: Four distinct, support-aware anomaly triage watchlists delivered through an interactive Shiny for Python dashboard.
    • Key Techniques: Supervised expected-cost estimation, unsupervised provider clustering, logistic explainer models, and strict post-model guardrails.

    Ultimately, this is a decision-support workflow: estimate expected cost, compare to peers, and surface over‑expected patterns in a cautious, explainable way. Importantly, this system does not label providers as fraudulent; instead, it builds a reproducible pipeline to flag over-expected behavior that warrants deeper investigation.

    Project goal

    The goal was to build a benchmarking system that could:

    • estimate expected standardized Medicare amount at the provider-service level,
    • compare observed versus expected behavior in a peer-relative framework,
    • surface rows and providers that repeatedly appear unusually over-expected,
    • group providers into defensible risk-style tiers based on anomaly burden,
    • and expose the full workflow in an interactive application.

    From the start, I optimized for downstream triage: stable scoring, defensible comparisons, and transparent reasons a row got flagged. 

    Table of contents

    The data foundation

    I built the project using CMS Medicare data from 2020 through 2023, scoped to oncology-related rendering provider specialties: Hematology-Oncology, Medical Oncology, Radiation Oncology, Gynecological Oncology, and Surgical Oncology, and engineered the final modeling dataset at the following grain: (Rndrng_NPI, HCPCS_Cd, Place_Of_Srvc, Year). This kept the benchmarking problem clinically focused while still preserving enough scale and diversity across provider types, HCPCS codes, and care settings to support robust modeling and anomaly surfacing.

    That modeling grain mattered for both prediction quality and anomaly interpretability. While I initially explored the broader RBCS (Restructured Betos Classification System) family level, the inherent clinical variance within those groupings was too high for precise benchmarking. Transitioning to the HCPCS-code level provided the clinical specificity required for true “apples-to-apples” comparisons, significantly reducing the likelihood of flagging complex legitimate case mixes as mere anomalies.

    Final HCPCS-level dataset

    The final dataset contained:

    • 1,210,275 rows
    • 21,591 unique NPIs
    • 1,742 HCPCS codes
    • 2 place-of-service categories
    • 4 years of data (2020 to 2023)
    Table-1: High-level HCPCS-level dataset summary
    Row grain*Total rowsUnique providers (NPI)Unique HCPCSUnique POSUnique Years
    (Rndrng_NPI, HCPCS_Cd, Place_Of_Srvc, Year)121027521591174224

    *Each row represented a provider’s behavior for a specific procedure or drug code, in a specific care setting, during a specific year.

    Exploratory data analysis: Why a one-size-fits-all model fails

    The first major lesson from EDA was that this problem could not be handled well by a single blunt modeling strategy. The data was large enough to be meaningful, but uneven enough to be dangerous. Support varied sharply across slices, cost behavior was highly skewed, and historical lag signal was strong when present but absent for many rows. Those three findings shaped the routed modeling architecture that followed.

    1. Coverage and mix

    The first thing I wanted to understand was the dataset’s scope, time coverage, and structural balance across years, place of service, and geography.

    Figure-1: rows per year
    Figure-2: Providers per year
    Figure-3: Total services per year
    Figure-4: Place-of-service distribution
    Figure-5: Place-of-service distribution by year
    Figure-6: Top-state by row
    Figure-7: Top-state by row

    A few patterns stood out immediately:

    • The dataset remained large and operationally meaningful across all four years.
    • Provider count rose modestly over time, even as row count and total services declined somewhat.
    • The data was dominated by office-based rows, with facility rows making up a smaller but still meaningful share.
    • A handful of states contributed a disproportionate share of rows and services.

    These patterns aren’t just descriptive. They determine who you can fairly compare, and where the statistics are trustworthy.

    1. Exposure and reliability

    A second lesson from EDA was that not all rows deserve equal trust. Some provider-service observations are backed by meaningful service volume and beneficiary count. Others are much thinner.

    Figure-8: Raw beneficiaries distribution
    Figure-9: log-transformed beneficiaries distribution
    Figure-10: Raw services distribution
    Figure-11: log-transformed services distribution
    Figure-12: Provider exposure distribution
    Figure-13: HCPCS slice support distribution for (HCPCS_Cd, Year)
    1. Cost behavior and lag structure

    The EDA also showed that standardized amount distributions were highly right-skewed and varied materially by HCPCS code. That made intuitive sense in oncology-related Medicare data, where some codes reflect relatively routine administration while others reflect expensive infused therapies or complex treatment activity.

    Bottom line: anomaly surfacing had to be support-aware. Extreme values in tiny peer slices shouldn’t carry the same weight as extremes in well-supported slices.

    Figure-14: Raw observed cost distribution
    Figure-15: log-transformed observed cost distribution
    Figure-16: Observed cost by top HCPCS codes
    1. Historical coverage structure

    A final structural feature of the dataset was historical coverage. Some provider-service-place-year rows had usable prior-year history, while others were true cold-start rows. That mattered because prior cost is a well-established predictor in Medicare spending and utilization modeling1,2, but it is not available uniformly across the full scoring universe. In parallel with this benchmarking work, I also spent substantial time experimenting with forecasting-style approaches built around lag structure, which reinforced how informative prior history can be when it is available. I discuss that forecasting work in more detail in the GitHub repository and plan to cover it more fully in future blog posts.

    Figure-17: Lag coverage by year

    The key EDA takeaway was structural: the benchmark universe mixes very different information environments (support depth, lag availability, and heavy skew). Some rows had rich prior-year history while others were true cold-starts. Some sat in massive, well-supported peer slices while others were sparse. And the target remained aggressively right-skewed throughout. That heterogeneity made it clear that a monolithic one-size-fits-all model would be a poor fit for the problem.

    Regression modeling and routed expected-cost estimation

    The modeling objective was to estimate the row-level expected standardized Medicare amount for each provider-service-place-year slice. But the deeper challenge was not simply to fit the best global model. It was to build a benchmarking engine that would remain useful for downstream anomaly surfacing and generalize to new providers the model had never seen before.

    That design choice shaped the evaluation framework from the start. I used provider-disjoint train/test splits and provider-aware cross-validation so the same NPI could not appear in both training and evaluation.

    To remain useful for downstream anomaly surfacing, the model had to be evaluated not just by global RMSE or R², but by how it behaved in the tail (i.e., services whose per unit cost is within the top 1% of all services provided in that specific year).

    Why tail behavior mattered

    The rows that drive action are the ones with big positive observed‑over‑expected gaps. If the model systematically underpredicted the upper tail, it would manufacture misleading anomalies, but if it over-smoothed true variation, it would hide useful signals.

    So, I evaluated models using both standard metrics and tail-aware diagnostics, including:

    • weighted RMSE,
    • tail prediction-to-actual ratio,
    • and tail bias.

    That requirement changed how I evaluated models: I cared as much about tail behavior as aggregate RMSE. 

    Baselines

    I compared several non-ML baselines and found that prior-year lag cost for the same provider, place of service, and HCPCS slice was one of the strongest simple predictors. That was not surprising. In healthcare cost forecasting and administrative claims settings, prior history is often one of the most informative signals available, so the baseline exercise served as both a benchmark and a structural diagnostic.

    The baseline results mattered because they confirmed something the EDA had only hinted at structurally: rows with usable prior history and rows without it were not equally easy prediction problems. Prior-year lag was a powerful signal when available, but a real benchmarking engine also had to cover true cold-start rows where no such anchor existed. That became one of the main reasons I treated lag-present and lag-absent rows differently in the final modeling architecture.

    Baseline A. Lag only

    • Predict the current row’s avg_mdcr_stdzd_amt using only lag1_avg_amt.
    • Evaluated only on rows where lag exists, because this baseline cannot score true cold-start rows.
    • Conceptually, this is the pure carry-forward baseline: “assume this provider-service-place slice looks like last year.”

    What it tests: how much predictive power is already present in simple historical carry-forward.

    Baseline B. Lag + backoff mean

    • Predict with lag1_avg_amt when lag exists.
    • If lag is missing, back off to the training-set mean at (provider_type, HCPCS_Cd, Place_Of_Srvc).
    • If that is unavailable, back off again to the HCPCS-only training mean, then finally to the global training mean.

    What it tests: whether a very simple hybrid of lag plus hierarchical means can provide broad coverage without machine learning.

    Baseline C. Provider-type + POS + state mean

    • Ignore lag entirely.
    • Predict using the training-set mean at (provider_type, Place_Of_Srvc, state).
    • If that mean is unavailable, fall back to the global training mean.

    What it tests: whether a coarse peer-group average can serve as a reasonable cold-start baseline.

    Baseline D. Full backoff ladder

    • Ignore lag entirely.
    • Predict using a more detailed hierarchy of training-only means:
      • (provider_type, HCPCS_Cd, Place_Of_Srvc, state)
      • then (provider_type, HCPCS_Cd, Place_Of_Srvc)
      • then HCPCS_Cd
      • then (provider_type, Place_Of_Srvc, state)
      • then (provider_type, Place_Of_Srvc)
      • then provider_type
      • then global mean

    What it tests: how far a carefully designed, no-ML hierarchical benchmark can go for true cold-start scoring.

    Table-2: Non-ML baseline summary 
    LabelTypeDescriptionR2 testMAERMSER2 test weightedMAE weightedRMSE weightedTail pred / yTail bias
    Baseline ABaselineLag only0.9425655.70792738.9465480.92943317.683063109.3791021.078064111.758282
    Baseline BBaselineLag + backoff0.9304218.61911055.0141370.94115627.599187136.1091300.975051-42.206414
    Baseline CBaselinePT + POS + state mean0.05886672.994691202.329669-0.007190187.663518563.1070990.087527-1543.616931
    Baseline DBaselineFull backoff ladder0.9319789.88992554.3948930.95068930.051029124.5966170.920057-135.238878

    The strongest baselines performed surprisingly well. That was a useful finding, not an inconvenience. It meant the final model architecture had to justify itself against serious alternatives, especially lag-only and lag-plus-backoff strategies.

    XGBoost regression families

    After establishing the non-ML reference points, I moved to supervised regression models. I initially tested ordinary least squares as a simple modeling baseline, but the structure of the data quickly made a tree-based approach more appropriate. The relationships between provider type, HCPCS code, place of service, patient risk mix, service volume, geography, prior history, and expected standardized amount were nonlinear and highly interactive.

    I used XGBoost regression because it could capture those nonlinear relationships while still fitting naturally into a reproducible scikit-learn-style pipeline with preprocessing, cross-validation, sample weighting, and artifact persistence.

    Across the XGBoost model packages, I varied several design choices:

    • whether lag features were included,
    • whether top-tail rows received higher sample weight,
    • whether the model predicted the expected amount directly or predicted a delta relative to lag,
    • and whether hyperparameters were selected with grid search or randomized search.

    All model packages were evaluated in a provider-disjoint new-provider framework. I used GroupShuffleSplit for the train/test split and GroupKFold for cross-validation so the same Rndrng_NPI could not appear in both training and evaluation.

    Direct level prediction

    The first XGBoost family predicted the expected standardized Medicare amount directly:

    y=avg_mdcr_stdzd_amty = \text{avg\_mdcr\_stdzd\_amt}

    These models asked the algorithm to learn the expected dollar-level amount from the available row features. I tested both with-lag and no-lag variants.

    The no-lag direct models were especially important because they represented the cold-start path. If a provider-service-place row had no valid prior-year history, the engine still needed a defensible way to estimate expected cost.

    This is where Model D became important. It did not use lag features, but it used service code, place of service, provider type, geography, case-mix proxies, and volume features to make a direct expected-cost prediction for true cold-start rows.

    Delta-target modeling

    For lag-available rows, I tested a second XGBoost family that reframed the prediction task. Instead of predicting the raw expected amount directly, the model predicted the log-relative change from the prior-year amount:

    ytarget=log(1+Current Amount)log(1+Lagged Amount)y_{target} = \log(1 + \text{Current Amount}) – \log(1 + \text{Lagged Amount})

    This changed the question from ‘What is the expected absolute amount?‘ to ‘How much should this row move relative to its own recent history?

    That formulation helped because it anchored prediction to a meaningful prior value. The model no longer had to relearn the entire static cost structure of the Medicare system from scratch. Instead, it could focus on explaining deviations from prior behavior using current row context.

    For lag-available rows, this was a better formulation than naive level prediction. It stabilized the right-skewed target and improved tail calibration, which mattered because downstream anomaly surfacing depended heavily on the model’s behavior in the upper tail.

    Model G emerged from this family. It used lag features, top-tail sample weighting, and a delta target. Its final dollar-level prediction was reconstructed by applying the predicted log-delta back to the prior-year lag amount.

    Table-3: XGBoost regressor model summary
    LabelTypeDescriptionR2 trainR2 testMAERMSER2 test weightedMAE weightedRMSE weightedTail pred / yTail
    Model AModelwt1 with lags, level, grid0.7150320.76897912.333052100.2442150.76897912.333052100.2442150.810185-321.107964
    Model BModelwt1 without lags, level, grid0.7156380.76092914.993705101.9759450.76092914.993705101.9759450.753707-416.650689
    Model CModelwt10 with lags, level, grid0.7597820.80476017.68895292.1548140.79816141.802460252.0799670.868736-222.057183
    Model DModelwt10 without lags, level, grid0.8830250.88689913.86793070.1401760.92030330.880678158.4005880.922783-130.626763
    Model EModelwt10 with lags, level, rs500.6695140.72595821.165937109.1798820.70824050.903769303.0736710.788339-358.064019
    Model FModelwt10 without lags, level, rs500.7488830.78928519.29445395.7373800.80422743.422870248.2627410.864591-229.068779
    Model GModelwt10 with lags, delta, grid0.9955630.9334163.49394043.0064210.9750156.75206667.7238530.995876-5.894139
    Model HModelwt10 with lags, delta, rs500.9868650.9302693.94118544.0108770.9729977.39314470.4052450.992898-10.149489
    Model IModelwt10 without lags, delta, rs500.9881220.9040794.63264351.6184340.9702158.52729573.9424160.999560-0.628872

    The model comparison made the final architecture clearer. No single standalone model solved every problem. The best lag-informed delta model was excellent where history existed, while the best no-lag direct model provided the cold-start path needed for full-universe scoring. That is what led to the final D/G routed architecture.

    Validation stability and why slightly higher test R² was not a red flag

    You might notice something odd: a few packages scored slightly higher on holdout R² than on training R². In a provider-disjoint setup, that can happen when the holdout provider mix is simply easier. Some held-out providers may have more stable HCPCS patterns, less extreme tail behavior, stronger representation in common service-code contexts, or fewer sparse slice combinations.

    Because the packages were evaluated in a provider-disjoint new-provider framework, the same NPI could not appear in both training and holdout data. That means the holdout set was not just a random sample of rows. It was a distinct provider subset. So I treated the pattern as a split-difficulty question rather than a leakage conclusion, then checked it against provider-disjoint cross-validation fold performance.

    The result was reassuring. For packages A, C, and E, the holdout test R² fell within the observed cross-validation fold range. For B, D, and F, the holdout score was only modestly above the best validation fold, which is more consistent with a somewhat easier holdout provider mix than with leakage.

    This strengthened the credibility of the package-level comparison. I did not rely on one split and assume the scores were stable. I explicitly checked whether holdout behavior was consistent with grouped validation behavior, which made the later selection of Models D and G more defensible.

    Table-4: Test vs CV R² summary
    PackageCV mean  CV minCV maxHoldout test R²Within CV fold range?Interpretation
    A0.7287820.6011760.8614770.768979YesConsistent with ordinary grouped split variation
    B0.6879190.6500200.7449880.760929NoHoldout likely slightly easier than average CV fold
    C0.7452220.6589630.8241300.804760YesConsistent with ordinary grouped split variation
    D0.8511900.8160090.8744990.886899NoHoldout likely slightly easier than average CV fold
    E0.6876020.5646340.8144120.725958YesConsistent with ordinary grouped split variation
    F0.7195880.6853130.7611110.789285NoHoldout likely slightly easier than average CV fold

    The final winning architecture

    The final winner was not a single standalone model. It was a routed architecture designed to solve two prediction problems inside the same scoring engine.

    The turning point was realizing that the best model on the lag-available subset could not be the whole system. A payer workflow still needs to score true cold-start rows, where no prior-year anchor exists. That is where the final D/G strategy emerged:

    • Model D for true cold-start rows through direct expected-cost prediction without lag features.
    • Model G for lag-available rows through lag-informed delta-target prediction, followed by reconstruction back to the expected dollar level.

    This routing gave the engine something a pure lag model could not: full-universe operational coverage. Hot-start and cold-start rows were not just different slices of the same problem. They had different information depth, so they needed different prediction strategies.

    This also changed the evaluation question. The individual D and G packages were developed and stress-tested in a provider-disjoint new-provider framework using GroupShuffleSplit holdouts and GroupKFold cross-validation. The final D/G system, however, was evaluated as a routed operational engine on the full evaluation universe. At that point, I was no longer asking only, “Which standalone model wins?” I was asking, “How well does the complete scoring engine behave once every row is routed to the appropriate prediction path?

    Table-5: D/G routing architecture
    ScenarioTrigger ConditionAssigned RouteModel DeployedPrediction Methodology
    Hot StartPrevious year cost data (Lag) is present and ≥ 0.hot_start_model_GModel G (_wt10_with_lags_delta_trgt)Predicts the log delta (change in cost). The final expected cost is reconstructed by applying this predicted change to the actual historical lag.
    Cold StartNo valid historical cost data exists.cold_start_model_DModel D (_wt10_without_lags)Predicts the level cost directly without utilizing lag features. Output is hard-clipped at 0 to prevent negative cost predictions.

    Leakage audit suite for the winning models D and G

    Because this project feeds downstream anomaly worklists, I did not want to rely on strong metrics alone. I wanted to stress-test the winning component models against the most plausible leakage pathways and confirm that their performance patterns still made sense under grouped validation and negative-control checks.

    Importantly, the leakage audits were run on the winning standalone packages Models D and G, because those were the learned components inside the final routed engine. The final D/G routing layer itself was deterministic and operationally transparent, but the learned parts of the system still needed to be audited carefully.

    For Model D, the leakage audit focused on classic new-provider leakage risks. I verified that the train/test split was fully provider-disjoint, that cross-validation was also provider-disjoint, that Model D excluded lag features and obvious target-adjacent payment features, and that preprocessing was fit using training-only information rather than full-dataset statistics. I also ran stronger sanity checks. Removing HCPCS_Cd materially degraded performance, and shuffling the training target caused model performance to collapse. That pattern supports the conclusion that Model D was learning real service-level structure rather than benefiting from hidden leakage.

    For Model G, the audit had to be stricter because it uses historical lag information by design. I again confirmed fully provider-disjoint train/test and cross-validation splits. Then I audited lag lineage by recomputing the lag feature from the full original EDA source and comparing it row by row to the persisted modeling feature. I also recomputed the delta target itself and verified that it matched the modeled target construction exactly. On top of that, I ran negative-control checks that should fail if the model is learning real signal. When I shuffled the delta target, delta-space performance collapsed. When I intentionally broke lag alignment, performance deteriorated materially.

    The leakage audit did more than check for contamination. It tested whether the observed performance patterns were logically consistent with genuine predictive structure. Lag-only was already a strong baseline, but Model G improved materially on tail calibration, tail bias, and tail error behavior, which is the pattern I would expect from real value-add rather than a shortcut created by leakage. Taken together, these checks support a more defensible conclusion: the D/G architecture performed well for structural reasons that survived provider-disjoint evaluation, target-lineage checks, preprocessing inspection, and negative-control stress tests.

    Table-6: Leakage audit summary
    ModelCheckResultInterpretation
    Model DTrain/test provider overlap0 overlapping providersPASS. GroupShuffleSplit train/test split is fully provider-disjoint.
    Model DCV provider overlap0 overlapping providers in every foldPASS. GroupKFold validation is fully provider-disjoint.
    Model DFeature auditNo lag features, no payment-side target-adjacent variablesPASS. Direct-cost model excludes the obvious leakage-prone feature families.
    Model DPreprocessing fit scopePackage-aligned imputer statistics match training dataPASS. Numeric and categorical preprocessing were fit on training data only.
    Model DHCPCS ablationR2 test dropped from 0.751685 to 0.584979Removing HCPCS_Cd from features leads to strong decline in model performance. 
    Model DNegative control with shuffled y_trainR2 test collapsed from 0.751685 to -0.054324Very strong anti-leakage evidence. Destroying target structure collapses predictive power.
    Model GTrain/test provider overlap0 overlapping providersPASS. GroupShuffleSplit train/test split is fully provider-disjoint.
    Model GCV provider overlap0 overlapping providers in every foldPASS. GroupKFold validation is fully provider-disjoint.
    Model GLag lineage check748,680 / 748,680 rows matched, 100.0%PASS. lag1_avg_amt exactly matches prior-row values recomputed from full original EDA source.
    Model GDelta target construction748,680 / 748,680 rows matched, 100.0%PASS. log_delta_cost is constructed exactly as log1p(current) - log1p(lag).
    Model GFeature auditIntentional lag features only, no payment-side target-adjacent variablesPASS. Uses historical lag by design, not current-row target proxies.
    Model GLag-only baselineR2 test = 0.857768, RMSE = 62.856022Strong baseline. Confirms real historical structure already exists in the problem.
    Model GModel G replicaR2 test level = 0.933416, RMSE level = 43.006421, R2 test delta = 0.565856Substantial gain beyond lag-only baseline.
    Model GNegative control, shuffled y_train_deltaR2 test collapsed from 0.565856 to -0.004571Very strong anti-leakage evidence. Destroying target structure collapses delta predictive power.
    Model GTail calibration versus lag-onlyTail ratio improved 1.089586 -> 0.995876, tail RMSE improved 402.46 -> 193.25Model G materially improves tail calibration and tail error behavior.
    Model GWrong-lag sanity testR2 test level dropped 0.933416 -> 0.896634, RMSE worsened 43.01 -> 53.58Strong evidence that the model relies on real lag alignment rather than leakage.

    Post-model failure analysis, guardrails and scoring governance

    Statistical excellence is the starting point, not the finish line. Even high-performing models can produce “pathological” signals in sparse or highly volatile slices. To move from a research-grade model to a production engine, I implemented a governance layer of auditable guardrails. These act as a conservative safety net for policy-sensitive edge cases, ensuring the system remains stable and defensible when scores propagate to actual investigation worklists.

    This was where the project matured from a good model into a more usable engine.

    Identifying modes of failure

    Table-7: Key thresholds 
    MetricTreshold Value
    abs_resid_q99108.230
    oe_q991.777
    resid_pos_q9988.954
    resid_neg_q01-35.166
    obs_q99918.900
    exp_q991009.589
    oe_underprediction_cut3.000
    oe_overprediction_cut0.333

    Using key thresholds to define failure flags

    Table-8: Statistical extremes (pure math triggers)
    Flag NameTrigger ConditionAnalytical Focus
    cat_abs_residual_q99abs_residual ≥ 99th percentileThe top 1% largest raw dollar errors (misses in either direction).
    cat_oe_q99oe_ratio ≥ 99th percentileThe top 1% largest relative mark-ups (Observed was massively higher than Expected).
    cat_large_positive_residualresidual ≥ 99th percentileThe top 1% largest raw dollar underpredictions (Model predicted too low).
    cat_large_negative_residualresidual ≤ 1st percentileThe top 1% largest raw dollar overpredictions (Model predicted too high).
    Table-9: Contextual & Pipeline failures
    Flag NameTrigger ConditionAnalytical Focus
    cat_hot_failurehas_lagTrue AND abs_residual ≥ 99th percentile“Model G Miss:” The provider had historical data, but the model still suffered a massive top-1% error.
    cat_cold_failurehas_lagFalse AND abs_residual ≥ 99th percentile“Model D Miss:” The provider had no history, and the cold-start prediction suffered a massive top-1% error.
    cat_cold_low_support_failureCold Row AND Support Tier is medium or low AND abs_residual ≥ 99th percentile“The Context Void:” A massive error on a new provider where we also lacked tight peer group data (relied on broad state/national averages).
    cat_large_error_high_supportSupport Tier is high or medium_high AND abs_residual ≥ 99th percentile“The Confident Miss:” A massive error despite having strong historical or tight peer-group data to anchor the prediction.
    Table-10: Operational extremes and narrative flags
    Flag NameTrigger ConditionAnalytical Focus
    cat_large_error_tail_rowRow is in the true top 1% of all costs AND abs_residual ≥ 99th percentile“The Whale Miss:” Massive prediction errors specifically occurring on the absolute most expensive claims in the system.
    cat_underpredictionobserved_cost ≥ 99th percentile AND oe_ratio ≥ 3.0“The Sneak Attack:” The final cost was astronomically high, but the model expected them to cost less than a third of that amount.
    cat_overpredictionexpected_cost ≥ 99th percentile AND oe_ratio ≤ 0.333“The False Alarm:” The model predicted a massive top-1% cost, but the actual cost came in at less than a third of expectations.
    cat_high_conf_anomalyhigh_confidence_anomaly_candidateTrue“The Investigation Target:” Inherits the primary anomaly flag (High Support + High Pos. Residual + High O/E).
    Table-11: The master filter
    Flag NameTrigger ConditionAnalytical Focus
    is_any_catastrophicIf ANY of the 12 flags above are TrueThe master Boolean used to filter the dataset down to only the problematic rows for failure analysis.

    Failure analysis summary

    Table-12: Failure bucket counts 
    BucketRow CountShare of All Rows
    is_any_catastrophic321860.026594
    cat_abs_residual_q99121030.010000
    cat_oe_q99121030.010000
    cat_large_positive_residual121030.010000
    cat_large_negative_residual121030.010000
    cat_large_error_high_support119950.009911
    cat_cold_failure110250.009109
    cat_high_conf_anomaly40120.003315
    cat_large_error_tail_row39750.003284
    cat_hot_failure10780.000891
    cat_overprediction7110.000587
    cat_cold_low_support_failure1080.000089
    cat_underprediction480.000040
    Table-13: Failure route bucket counts
    RouteCold StartHot Start
    cat_abs_residual_q99110251078
    cat_oe_q99100152088
    cat_large_positive_residual11316787
    cat_large_negative_residual91772926
    cat_large_error_high_support109171078
    cat_large_error_tail_row3555420
    cat_underprediction408
    cat_overprediction69615
    cat_high_conf_anomaly3860152
    cat_cold_low_support_failure1080
    cat_hot_failure01078
    cat_cold_failure110250

    Why guardrails were necessary

    Even strong models can fail in narrow but important ways:

    • under-supported cold-start slices,
    • code-specific uplift phenomena,
    • extreme observed-over-expected spikes,
    • or small, policy-sensitive regions where model error can explode.

    Rather than expecting the model to solve every edge case alone, I treated guardrails as a narrow, auditable governance layer.

    V1, V2, and V3

    The guardrail workflow evolved across multiple versions:

    • V1 introduced bounded corrections for specific underprediction problems.
    • V2 expanded targeted interventions for identifiable slice-level pathologies.
    • V3 consolidated the final policy set into a governed, frozen scoring engine.

    These targeted interventions included:

    • Global floor logic for fragile cold-start contexts.
    • Code-specific residual caps to contain extreme prediction errors.
    • Multiplicative uplift corrections where model underestimation was systematic and slice-specific.
    • Family-level adjustments (e.g., radiation planning cluster).
    • Bounded-by-observed logic to prevent the creation of mathematically impossible negative-residual behavior.

    One of the clearest examples was the C4a J2469 uplift fix, where a tightly scoped post-model correction reduced pathological observed-over-expected inflation in a known failure slice.

    Table-14: Final V3 guardrail component summary
    Policy NameTarget Scope (The “Who”)Mathematical Trigger (The “When”)The Fix / Math (The “What”)
    A1c (V0): Global Cold-Start FloorAll 2023 cold-start rows with no historical lag, especially fragile 2023-only code contexts.Fires when expected_cost falls below a tested low trigger, such as < 0.1, < 0.5, or < 1.0.Bounded floor override: raise expected cost toward a historical cold-start floor, using
    expected_guard=min(observed_cost,max(expected_cost,floor_prior))expected\_guard = \min(observed\_cost, \max(expected\_cost, floor\_prior))
    B2.2b: Top-Burden Dynamic Floor2023 cold-start rows for Q5126 and Q5127, the highest-burden volatile 2023-only codes.Fires when oe_ratiooe_q99, meaning the row is in the extreme O/E tail.Dynamic escape floor: compute the exact minimum expected cost needed to escape multiple catastrophic thresholds at once, then raise prediction to that level, bounded by observed cost.
    B2.2c: Q5127 Residual Cap2023 cold-start rows for Q5127 only that are still catastrophic after B2.2b.Fires only when the row remains flagged for cat_large_positive_residual after B2.2b.Residual cap: set the floor needed to keep residual just below the positive residual tail threshold, effectively,
    need_for_resid_cap=observed_costR_CAPneed\_for\_resid\_cap = observed\_cost – R\_CAP
    then expected_guard=min(observed_cost,max(current_expected,need_for_resid_cap))expected\_guard = \min(observed\_cost, \max(current\_expected, need\_for\_resid\_cap))
    C4a: J2469 Shock Fix (Multiplicative Uplift)Year 2020, J2469, medium_high tier, cold-start rows.Fires when oe_ratiooe_q99 inside that narrow slice.Multiplicative uplift: multiply the original expected cost by a fixed historical correction factor, about 2.010610, then cap at observed cost: expected_guard=min(observed_cost,expected_costuplift_factor)expected\_guard = \min(observed\_cost, expected\_cost * uplift\_factor)
    C4b: J2505 Volatility Clamp (Two-Sided Clamp)Years 2020 and 2021, J2505, medium_high tier, cold-start rows.Fires when the row hits any severe tail condition, including top 1% abs residual, top 1% positive residual, or extreme negative residual / overprediction behavior.Two-sided clamp: expected_guard=min(observed_cost,max(expected_cost,observed_costcap))expected\_guard = \min(observed\_cost, \max(expected\_cost, observed\_cost – cap)) This pulls severe underpredictions up and severe overpredictions down.
    C5.v2: Radiation Planning Cluster Residual CapCold-start rows for radiation planning family codes 77280, 77290, 77295, 77301, 77338, restricted to stable_all_years + medium_high tier.Fires when the row fails any of 5 severe metrics: abs residual tail, positive residual tail, negative residual tail, O/E tail, or high-confidence anomaly candidate.Family-level two-sided residual clamp: expected_guard=min(observed_cost,max(expected_cost,observed_costcap))expected\_guard = \min(observed\_cost, \max(expected\_cost, observed\_cost – cap))

    Note: All guardrails exercise surgical control and safety by masking (target isolation) and forced ceiling (min(obs,max(exp,raised))\min(obs, \max(exp, raised)))

    Before-vs-after guardrail diagnostic for J2469: C.4.a guardrail policy effectiveness

    Table-15: J2469 failure bucket mix by year (only buckets > 0%)
    YearHCPCS CodeBucketRate (%)
    2020J2469cat_oe_q9987.916344
    2021J2469cat_oe_q991.145038
    Table-16: Policy C4a results (J2469 2020 within medium_high cold_start cold)
    MetricValue
    n_rows2582
    baseline_cat_rate_%87.916344
    guard_cat_rate_%0.0
    delta_pp-87.916344
    new_cat_%0.0

    Final V3 guardrail diagnostics summary 

    Table-17: Overall summary
    Modelis_any_catastrophic Rate (%)Delta (% Change)
    D/G2.6593960.000000
    D/G+V32.002685-0.65671
    Table-18: Overall failure bucket snapshot (%, baseline vs V3) | ALL rows
    BucketRow CountBaseline Rate (%)V3 Rate (%)Delta (% Change)
    is_any_catastrophic12102752.6593962.002685-0.656710
    cat_large_positive_residual12102751.0000210.443370-0.556650
    cat_abs_residual_q9912102751.0000210.468654-0.531367
    cat_cold_failure12102750.9109500.379583-0.531367
    cat_large_error_high_support12102750.9910970.461961-0.529136
    cat_oe_q9912102751.0000210.624569-0.375452
    cat_large_error_tail_row12102750.3284380.109066-0.219372
    cat_large_negative_residual12102751.0000210.906736-0.093285
    cat_overprediction12102750.0587470.003222-0.055525
    cat_cold_low_support_failure12102750.0089240.006693-0.002231
    cat_underprediction12102750.0039660.0039660.000000
    cat_high_conf_anomaly12102750.3314950.3314950.000000
    cat_hot_failure12102750.0890710.0890710.000000
    Table-19: Master failure flag (is_any_catastrophic) delta (pp) | ALL years
    YearHCPCS Stability GroupRow CountBaseline Rate (%)V3 Rate (%)Delta (% Change)
    2020other463837.8180257.869771-29.948254
    2020stable_all_years3066135.5744543.820777-1.753676
    2021other546715.74904012.895555-2.853485
    2021stable_all_years3016291.2747451.126218-0.148527
    2022other56095.1702625.1702620.000000
    2022stable_all_years2943101.1270431.022731-0.104312
    20232023_only107617.10037211.059480-6.040892
    2023other36095.3754505.3754500.000000
    2023stable_all_years2873241.6180341.546338-0.071696

    The key principle throughout was conservatism:

    • guardrails were narrow in scope,
    • explicit in policy,
    • auditable,
    • and frozen as artifacts.

    That made the final engine much more defensible than simply trusting raw model output.

    Why I chose D/G + V3 over simpler baselines

    The baseline suite was intentionally strong. I did not want the final system to win simply because it was more complex. In fact, one of the clearest findings from the baseline work was that simple historical carry-forward plus hierarchical backoff could go surprisingly far.

    That made the final design choice more demanding, not less. The D/G + V3 architecture had to justify itself by doing more than fitting well in aggregate. It had to provide full-universe coverage, handle cold-start rows realistically, improve tail behavior where anomaly surfacing is most sensitive, and support downstream worklists without obvious pathological slices.

    The final D/G + V3 architecture met those requirements by combining three layers:

    • Model G for lag-available rows, where prior history was real and highly informative.
    • Model D for true cold-start rows, where lag was unavailable and direct expected-cost prediction was needed.
    • V3 guardrails for narrow, auditable corrections to known failure modes.

    The deciding factors were not just headline R² or RMSE. They were tail-weighted error behavior, bias control, cold-start coverage, support-aware anomaly surfacing, and the ability to produce governed worklists that a stakeholder could actually investigate.

    Table-20: D/G operational performance summary
    LabelTypeDescriptionR2 testMAERMSER2 test weightedMAE weightedRMSE weightedTail pred/yTail bias
    D/GEngine2-stage final engine0.9303278.15687460.3850030.94056219.510174152.5446670.935782-109.17795
    figure-18: Trimmed residual distribution
    figure-19: residual variance by lag presence
    figure-20: log observed to expected cost distribution

    From predictions to investigation worklists

    Once the scoring engine was stable, the next challenge was turning predictions into worklists that a stakeholder could actually use. A payer does not act on RMSE alone. A payer acts on prioritized review candidates.

    To ensure the engine provides actionable intelligence, I designed a dual-watchlist structure for distinct operational personas: Triage Analysts utilize row-level worklists (ANOM.2.a.1) to prioritize immediate, high-confidence review candidates. Program Integrity Leads utilize provider-level watchlists (ANOM.3.a.1) to monitor systemic, repeated over-expected behavior across multiple years and codes.

    Row-level products (For immediate triage):

    1. ANOM.2.a.1. Robust + size-aware row worklist

    This was the primary row-level product.

    It asks: ‘Which rows are unusually over-expected relative to comparable peers, while also clearing support-aware confidence filters?

    This product relied on:

    • row-level observed-over-expected behavior,
    • within-slice percentile ranking,
    • high-confidence gating,
    • and slice support awareness.

    The goal was to create a defensible worklist, not a sensational one.

    1. ANOM.2.b.1. Magnitude + size-aware row worklist

    This alternative row-level product emphasized severity. It answered a different operational question: ‘Which rows reflect the largest over-expected shocks, even if they are not the best repeat-offender examples?

    Provider-level products

    1. ANOM.3.a.1. Repeat-offender provider watchlist

    This provider-level product rolls up robust row-level anomaly events to the provider profile grain: (Rndrng_NPI, provider_type, state)

    It effectively asks: ‘Which provider identities repeatedly show up as robust row-level tail events across codes and years?’ This is the “who consistently pops?” view.

    1. ANOM.3.b.1. Shock provider watchlist

    This alternative provider-level product emphasizes magnitude-driven shock behavior rather than repeat frequency alone.

    That makes it useful for identifying providers whose anomaly burden may be concentrated in particularly severe events.

    Why I separated the anomaly products

    I did not want one ranking to carry too many jobs at once. A row-level anomaly and a provider-level pattern are useful for different reasons. A single extreme row might deserve immediate review, while a provider who appears repeatedly across multiple codes or years may deserve broader monitoring.

    That is why I split the outputs into four products. The row-level lists help identify specific provider-service-year observations that look unusually over-expected. The provider-level lists ask whether those signals accumulate into a repeated pattern for a provider profile.

    I also separated robust signal from shock severity. Some findings are interesting because they are consistent and well-supported. Others are interesting because the magnitude is unusually large. Keeping those lenses separate made the worklists easier to interpret and reduced the risk of flattening different kinds of evidence into one overloaded score.

    As throughout the project, these products are not fraud labels. They are structured review queues designed to help an analyst decide where to look next.

    Table-21: Row-level anomaly surfacing approach
    VariantRow Extremeness DescriptionScore TypeDirectional filterDenominator HygieneSlice-size Awareness
    ANOM.2.aSlice-relative tail within slice = [HCPCS_Cd, Year]: uses log_oe_pct_in_slice (pct-rank of log_oe) plus resid_pct_in_slicePercentile-weighted (0.5log_oe_pct+0.4resid_pct+0.1is_high_conf0.5 * log\_oe\_pct + 0.4 * resid\_pct + 0.1 * is\_high\_conf)log_oe > 0 AND residual > 0Optional expected_cost >= MIN_EXPECTEDNo
    ANOM.2.a.1 Slice-relative tail + slice validity:log_oe_pct_in_slice + resid_pct_in_slice, both within [HCPCS_Cd, Year], but only for sufficiently large [HCPCS_Cd, Year] slicesPercentile-weighted (0.5log_oe_pct+0.4resid_pct+0.1is_high_conf0.5 * log\_oe\_pct + 0.4 * resid\_pct + 0.1 * is\_high\_conf)log_oe > 0 AND residual > 0Optional expected_cost >= MIN_EXPECTED✅ slice_n >= 50
    ANOM.2.bSlice-relative tail + magnitude severity: keeps log_oe_pct_in_slice + resid_pct_in_slice (within [HCPCS_Cd, Year]) and adds a capped magnitude term log_mag = clip(log_oe, 0, LOG_CLIP_MAX) / LOG_CLIP_MAXMagnitude-aware(0.45log_oe_pct+0.35resid_pct+0.1log_mag+0.1is_high_conf0.45 * log\_oe\_pct + 0.35 * resid\_pct + 0.1 * log\_mag + 0.1 * is\_high\_conf)log_oe > 0 AND residual > 0Optional expected_cost >= MIN_EXPECTEDNo
    ANOM.2.b.1 Slice-relative tail + magnitude severity + slice validity: log_oe_pct_in_slice + resid_pct_in_slice (within [HCPCS_Cd, Year]) and adds a capped magnitude term log_mag = clip(log_oe, 0, LOG_CLIP_MAX) / LOG_CLIP_MAX, but only for sufficiently large [HCPCS_Cd, Year] slicesMagnitude-aware(0.45log_oe_pct+0.35resid_pct+0.1log_mag+0.1is_high_conf0.45 * log\_oe\_pct + 0.35 * resid\_pct + 0.1 * log\_mag + 0.1 * is\_high\_conf)log_oe > 0 AND residual > 0Optional expected_cost >= MIN_EXPECTED✅ slice_n >= 50
    Table-22: Provider-level anomaly surfacing approach
    VariantPeer Extremeness DefinitionScore TypeDirectional FilterConfidence RuleDenominator HygieneSlice-size Awareness
    ANOM.3.aoe_pct_in_slice >= 0.99Count + breadth: n_anom_rows+0.25n_unique_codes+0.25n_unique_yearsn\_anom\_rows + 0.25 * n\_unique\_codes + 0.25 * n\_unique\_yearslog_oe > 0 AND residual > 0is_high_conf OR legacy fast-trackOptional expected_cost >= MIN_EXPECTEDNo
    ANOM.3.a.1 oe_pct_in_slice >= 0.99, but only for sufficiently large [HCPCS_Cd, Year] slicesCount + breadth: n_anom_rows+0.25n_unique_codes+0.25n_unique_yearsn\_anom\_rows + 0.25 * n\_unique\_codes + 0.25 * n\_unique\_yearslog_oe > 0 AND residual > 0is_high_conf OR legacy fast-trackOptional expected_cost >= MIN_EXPECTED✅ slice_n >= 50
    ANOM.3.boe_pct_in_slice >= 0.99 + log_oe >= q(.995)Count + breadth + severity bump: n_anom_rows+0.25n_unique_codes+0.25n_unique_years+0.10p95_log_oe_magn\_anom\_rows + 0.25 * n\_unique\_codes + 0.25 * n\_unique\_years + 0.10 * p95\_log\_oe\_maglog_oe > 0 AND residual > 0is_high_conf OR legacy fast-trackOptional expected_cost >= MIN_EXPECTEDNo
    ANOM.3.b.1 log_oe_pct_in_slice >= 0.99log_oe >= q(.995), but only for sufficiently large [HCPCS_Cd, Year] slicesCount + breadth + severity bump: n_anom_rows+0.25n_unique_codes+0.25n_unique_years+0.10p95_log_oe_magn\_anom\_rows + 0.25 * n\_unique\_codes + 0.25 * n\_unique\_years + 0.10 * p95\_log\_oe\_maglog_oe > 0 AND residual > 0is_high_conf OR legacy fast-trackOptional expected_cost >= MIN_EXPECTED✅ slice_n >= 50

    Provider tiering through clustering

    Once the provider-level anomaly summaries were built, the next step was to move from ranking to segmentation.

    The question became: ‘Can we group providers into interpretable anomaly-burden profiles?

    Why cluster at all?

    A ranked list is useful, but it flattens everything into one dimension. Clustering lets us ask whether providers naturally separate into broader behavioral types.

    To achieve this, I built a provider feature layer utilizing:

    • Robust anomaly burden and breadth (tracking consistency across codes and years)
    • Shock severity metrics (capturing the magnitude of the anomalies)
    • Confidence and support features (ensuring statistical backing)
    • Service scale (contextualizing the volume)

    Important design choice: Not clustering by sheer size

    One of the most important clustering choices was to prevent Euclidean distance from being dominated by sheer provider size (i.e., the service volume). A naive clustering setup would simply separate large providers from small providers, which is not what I wanted.

    So, the feature engineering and scaling choices were designed to reduce clustering by size alone and instead emphasize behavioral burden.

    I also gated eligibility for clustering using minimum evidence depth:

    • enough provider rows,
    • enough services,
    • enough beneficiaries,
    • and a defensible feature set after winsorization and robust scaling.

    Final cluster interpretation

    The clustering yielded a compact segmentation that was easy to interpret:

    • cluster_0: typical cost behavior, no robust tail events
    • cluster_1: elevated anomaly burden, robust tail events present
    • ineligible: insufficient evidence for clustering
    Table-23: Cluster sizes (eligible only):
    Cluster IDCount
    05654
    11834
    no_cluster (ineligible)15878
    clustered7488
    Table-24: Distance-to-centroid (smaller = more typical within cluster):
    MetricValue
    count7488.000000
    mean1.051452
    std0.800772
    min0.019477
    50%0.872612
    90%1.881404
    95%2.396743
    99%3.758979
    max16.778678
    Figure-21: ECDF robust extreme-event counts by cluster
    Figure-22: Robust anomaly distribution rate by cluster
    Table-25: Cluster definitions and key feature contrasts
    cluster_labeln_providersmedian_n_anom_rows_robustmedian_anom_rate_pct_robustmedian_p90_log_oepct_with_any_shock_eventmedian_total_services_robcluster_definition
    cluster_056540.00.00.0870.0016371.0Typical cost behavior (no robust tail events)
    cluster_118341.01.00.1046.92154184.5Elevated anomaly burden (robust tail events present)

    What distance to centroid means

    For each provider in a cluster, distance to centroid represents how far that provider sits from the typical provider profile of that cluster in the transformed clustering feature space.

    That matters because providers deep inside cluster_1 are not just assigned there. They are strong representatives of the anomaly-burdened profile.

    How to read these cluster separation plots

    The ECDF plot shows that cluster_0 was essentially clean by construction: providers in this group had 0 robust anomaly rows. In other words, 100% of cluster_0 providers were at zero robust extreme events.

    Cluster_1 showed a very different pattern. This group represented providers with recurring robust anomaly burden. Roughly 70% of cluster_1 providers had at least 1 robust anomaly row, and roughly 90% had 2 or fewer robust anomaly rows. A smaller subset had higher repeated-event counts, extending out toward 10 robust anomaly rows.

    The anomaly-rate boxplot tells the same story from a rate perspective rather than a count perspective. For cluster_0, the robust anomaly rate was 0%, reflecting no robust tail events among eligible clustered providers. For cluster_1, the robust anomaly rate was meaningfully above zero. Most providers in this group had anomaly rates around roughly 0.5% to 1.6%, with the upper range extending above 3%. In practical terms, some cluster_1 providers had more than 3% of their provider-service rows showing robust over-expected anomaly patterns.

    Taken together, the two plots show that the clustering was not simply separating large providers from small providers. It was separating providers with no robust anomaly burden from providers with measurable recurring anomaly burden, using both event counts and event rates.

    Explaining the tiers with an interpretable classifier

    I did not want the clustering layer to remain a black box. After assigning the labels, I built a logistic regression explainer model designed purely for interpretation rather than production prediction. Its specific goal was to answer: What provider characteristics are strongly associated with membership in the anomaly-burdened cluster? This separation of concerns is vital—the unsupervised clustering created the segmentation, while the supervised logistic model explained it.

    I used a deliberately interpretable setup:

    • logistic regression,
    • selected provider-level covariates,
    • explicit reference categories,
    • and odds-ratio interpretation.

    I also checked coefficient stability using resampling logic rather than taking a single coefficient table at face value.

    Table-26: Logistic explainer performance summary
    ClassPrecisionRecallf1-score
    cluster_00.8610.6210.722
    cluster_10.3720.6910.484
    Figure-23: Confusion matrix
    Figure-24: ROC curve
    Figure-25: PR curve
    Table-27: Top positive drivers (higher => more likely cluster_1)
    FeatureCoeff (Log Odds)Odds Ratio
    log_total_services_base0.7464242.109444
    ruca_bucket_Rural0.1667731.181487
    p_diabetes0.1393601.149538
    ruca_bucket_Suburban0.1318401.140926
    provider_type_Medical Oncology0.1145111.121325
    p_htn0.0180321.018196
    p_copd-0.1132100.892963
    bene_avg_risk_score-0.1149050.891451
    p_ckd-0.2072710.812799
    p_cancer6-0.2074490.812655
    years_since_enumeration-0.2258050.797874
    provider_type_Radiation Oncology-0.2904320.747940
    provider_type_Gynecological Oncology-0.7756510.460404
    provider_type_Surgical Oncology-1.5642880.209237
    Table-28: Top negative drivers (lower => less likely cluster_1)
    FeatureCoeff (Log Odds)Odds Ratio
    provider_type_Surgical Oncology-1.5642880.209237
    provider_type_Gynecological Oncology-0.7756510.460404
    provider_type_Radiation Oncology-0.2904320.747940
    years_since_enumeration-0.2258050.797874
    p_cancer6-0.2074490.812655
    p_ckd-0.2072710.812799
    bene_avg_risk_score-0.1149050.891451
    p_copd-0.1132100.892963
    p_htn0.0180321.018196
    provider_type_Medical Oncology0.1145111.121325
    ruca_bucket_Suburban0.1318401.140926
    p_diabetes0.1393601.149538
    ruca_bucket_Rural0.1667731.181487
    log_total_services_base0.7464242.109444

    Baselines | ruca_bucket: Urban; provider_type: Hematology-Oncology

    The result was not intended as a production classifier. Instead, it answered a narrower and more useful question: Which provider attributes and contextual signals appear directionally associated with the anomaly-burdened cluster?

    That added interpretability and made the clustering outputs easier to discuss with stakeholders.

    From notebook to product: The interactive Shiny dashboard

    Finally, I wrapped the project into an interactive Shiny for Python application.

    The app is not just a demo. It is the interface that turns a notebook-heavy workflow into something navigable, auditable, and operationally usable.

    What the app does

    The dashboard allows users to explore:

    • the benchmarked scored universe,
    • row-level anomalies,
    • provider-level watchlists,
    • provider tiers,
    • explainer outputs,
    • provider profiles,
    • and artifact provenance.
    Figure-26: Overview tab
    Figure-27: Anomaly surfacing tab
    Figure-28: Provider profile tab

    One thing I cared about was provenance. The app surfaces the specific run folders and artifacts being used, so the output remains tied to concrete frozen objects rather than hidden notebook state.

    That is a small detail, but it signals something important: this project was built with reproducibility and transparency in mind.

    Note: To view a specific provider’s anomalous rows, navigate to the Provider Profile tab, enter the exact provider NPI, click Refresh, and scroll to the Provider’s anomalous rows table. For example, the screenshot below shows the provider profile workflow for NPI 1346215746.

    Figure-29: Example view showing ‘Provider’s anomalous rows’ table

    What this project demonstrates

    This project reflects the kind of data science work I want to do professionally: work that combines statistical modeling, operational judgment, governance, and product thinking. It also shows that I can move from raw administrative healthcare data to a reproducible, explainable decision-support workflow.

    It demonstrates my ability to execute across the full data product lifecycle:

    • Data Engineering: Extracting and engineering modeling datasets from large, messy, raw administrative healthcare sources.
    • Machine Learning & Governance: Comparing models using business-relevant evaluation criteria, segmenting entities with unsupervised learning, and explaining that structure with interpretable supervised models.
    • Operational Analytics: Designing guarded scoring systems (rather than raw prediction-only pipelines) and building anomaly surfacing logic that is support-aware and cautious.
    • Productization: Freezing and versioning artifacts, and deploying the final engine into a working, interactive application.

    Just as importantly, it shows something about how I think:

    • I do not stop at “best model wins.”
    • I care about how model behavior propagates downstream.
    • I treat edge cases seriously.
    • I prefer transparent systems over opaque hero numbers.
    • I try to design workflows that other people can inspect, rerun, and use.

    Final thoughts

    The Medicare Provider Benchmarking Engine started as a modeling problem, but it became a broader exercise in building trustworthy analytical infrastructure for decision support.

    In real-world settings, the challenge is rarely just prediction. It is deciding how predictions should be translated into action, how much confidence we should place in them, where to be conservative, and how to make the full workflow interpretable.

    That is the standard I wanted this project to meet, and that is the kind of data science and analytical product work I will continue building.

    If you are working on healthcare fraud detection, provider benchmarking, or machine learning governance, I’d love to connect.

    References

    1. Improving Health Care Cost Projections for the Medicare Population: Summary of a Workshop.
    2. Estimation and Prediction of Hospitalization and Medical Care Costs Using Regression in Machine Learning

    CMS data

  • From Legacy to Pro: How I Rebuilt My Finance Analytics App to Scale with DuckDB, S3, and Python Shiny

    From Legacy to Pro: How I Rebuilt My Finance Analytics App to Scale with DuckDB, S3, and Python Shiny

    My original Python Shiny finance app worked well locally. But as soon as I tried to deploy it with a larger, more realistic dataset, it broke. The reason was simple: the app assumed the full dataset should live in memory. That assumption became the bottleneck for everything, from multi-year backtests to event studies.

    The first version was a classic working prototype. It loaded a Parquet dataset into pandas, did the heavy lifting in memory, and powered the UI from one large DataFrame across three workflows:

    • Portfolio Simulator for buy-and-hold backtests
    • Sector Explorer for equal-weight sector indices
    • Event Study Studio for market reaction around SEC filings and split events

    This post is the story of how I turned that prototype into a “pro” app that can handle more years of data, answer richer user questions (like seasonality), and run an event study over a much larger set of events. The core driver behind everything: solving the dataset size problem by switching from pandas-first to DuckDB + S3 + query-on-demand.

    🚀 Try the Portfolio, Sector & Event Lab Pro Now!

    Legacy vs Pro at a glance

    Legacy: pandas in memory → UI

    Pro: UI → DuckDB → S3 Parquet

    FeatureLegacy AppPro App
    BackendpandasDuckDB
    Main data accessLocal / lighter runtime assumptionsS3-hosted Parquet
    Dataset strategySmaller in-memory sliceThin app-optimized dataset
    History window1 year3 years
    Event study robustnessLowHigh
    ScalabilityModerateStrong
    Deployment resilienceFragile with larger dataRobust

    The legacy architecture (why it was great. And why it hit the wall)

    The legacy version was simple by design:

    • store a Parquet dataset locally in outputs/
    • load it into pandas at startup
    • compute helper columns and UI choices from the full DataFrame
    • slice and pivot in pandas for the portfolio and sector workflows
    • build event windows in Python for the event study

    That design was great for prototyping. It was easy to reason about, easy to debug, and fast enough on a local machine. But it had a hard ceiling: memory.

    Bigger data meant massive RAM usage, sluggish cold starts, and inevitable crashes. The practical outcome was painful: deploying with a realistic, multi-year dataset caused fatal memory bottlenecks, artificially limiting the scope of analysis I could offer users.

    That last point became the real product problem.

    The user question that made the limitation obvious

    One of the clearest signals that a system needs to evolve is when a user asks a reasonable question that it cannot answer.

    A user told me he had vested META shares and asked:

    “Can you show more years of returns so I can see if there’s a pattern? I want to time my sale.”

    That was really a seasonality question. He was not asking for a prediction model. He wanted a longer historical view so he could inspect whether certain periods tended to be stronger or weaker.

    With the legacy app, I could not support that reliably. Expanding the dataset enough to answer the question made the deployed app unstable. That was the moment the dataset-size issue stopped being a technical annoyance and became the main reason to redesign the architecture.

    What changed in the pro rebuild (the new principles)

    I set a few goals that directly addressed the memory and scale bottleneck:

    • stop loading the full dataset into pandas at startup
    • host the dataset remotely so deployment is not tied to local files
    • query only what the user asks for instead of “load everything”
    • keep more years of data so users can explore seasonality and long-term patterns
    • make the event study more statistically robust by pulling from more events over longer history

    These goals pushed the pro version toward DuckDB + S3 and away from a load-everything design. That shift did more than improve performance. It expanded the range of questions the app could answer.

    What the pro dataset includes

    The pro dataset keeps only the fields the app actually needs: date, ticker, close, logret, sector, filing_form, is_split_day, is_reverse_split_day, market_cap, tidx (per-ticker trading-day index for ±k windows), year.  

    It is stored as Parquet in Amazon S3 and queried at runtime through DuckDB. 

    Why S3 matters (beyond deployment)

    Hosting the dataset on S3 did more than “make deployment possible.”

    It also made the app architecture clean:

    • the app is UI + query logic
    • the dataset is a durable remote artifact
    • updating the dataset becomes a pipeline problem, not an app problem

    It also made the app configurable via environment variables:

    • AE_S3_URI points to the S3 Parquet file
    • if not set, the app uses a safe default S3 URI

    So local dev and deployment share the same code. Only configuration changes.

    The core mechanism: query-on-demand with DuckDB

    UI input → DuckDB query → S3 parquet → filtered result → Plotly chart

    Once DuckDB is connected, the app creates a view:

    CREATE OR REPLACE VIEW ae AS
    SELECT * FROM read_parquet('s3://...');

    From there, each workflow is powered by a small, targeted query:

    • Portfolio Simulator: selected tickers + date range
    • Sector Explorer: selected sectors + date range
    • Event Study: selected event types + date range + window ±k

    The pro pattern is simple:

    • user selections become SQL filters
    • DuckDB returns only the slice needed
    • pandas handles light transformations
    • Plotly renders the result

    That is what eliminated the memory crashes. The app no longer has to load the full dataset before it can do anything useful. In other words, the app no longer starts by loading everything. It starts by asking, “What does the user need right now?”

    Deploying the pro app

    Moving to DuckDB + S3 solved the main memory bottleneck, but deployment still required some production-minded cleanup: excluding large local artifacts from the deployment bundle, keeping only required CSV and UI assets in the app directory, making the S3-hosted parquet readable at runtime, and deploying the pro version as a separate app instead of overwriting the legacy one.

    The “more years” payoff: seasonality and long-horizon questions

    Now the app can actually answer the META vesting question in a meaningful way.

    With more years of data, users can explore:

    • seasonality (does the stock tend to rally in certain months?)
    • year-over-year behavior (does the stock exhibit recurring cycles?)
    • drawdowns and recoveries across multiple regimes (bear, bull, rate hikes)

    Even without adding a dedicated “seasonality panel,” simply allowing multi-year return series makes those patterns visible and testable (as shown below for META). 

    Event Study gets better when the dataset gets bigger

    The event study studio benefits directly from longer history.

    In an event study, sample size means events. With only one year of data, you get fewer filings, fewer split events, fewer observations per event type, and noisier averages.

    With more years of data, the event study becomes more useful because it can:

    • pull from a larger pool of events
    • produce more stable averages
    • narrow confidence intervals, all else equal
    • make rarer event types more analyzable

    So, the same architecture change that solved memory also improved statistical quality. This improvement is not abstract. With the expanded dataset, the app captured 10,590 Form 144/A events, giving the Event Study workflow a much stronger empirical base for estimating cumulative return patterns in selected sector (see below).

    The “pro” event-window technique that made it reliable

    The pro event study uses tidx, a per-ticker trading-day index, that lets the app build ±k trading-day windows without worrying about weekends and market holidays.

    In SQL, the method is:

    • find each event date’s tidx
    • generate offsets from -k to +k
    • join back to the panel on (ticker, tidx)

    During debugging, I hit a DuckDB-specific column naming gotcha around range(). The fix that stabilized everything was:

    offsets AS (
      SELECT * EXCLUDE (range), range AS rel_day
      FROM range(-{k}, {k} + 1)
    )

    That small change made the offsets table produce the exact column shape the rest of the query expected. After this, the event study workflow became stable.

    What improved in the pro version (the outcomes)

    After the rebuild, the app is fundamentally transformed. By decoupling the UI from the dataset, the app is faster, deployment is perfectly stable, and most importantly, the statistical power of the event studies and multi-year backtests is vastly improved.

    Closing thought

    Scaling from pandas to DuckDB and S3 wasn’t just an infrastructure upgrade—it was a product upgrade. Overcoming the memory bottleneck transformed ‘more data’ from a deployment risk into a core feature, ultimately scaling the complexity of the questions the app can answer.

    🚀 Try the Portfolio, Sector & Event Lab Pro Now!

  • Mental Health in Tech: Which Workplace Policies Work?

    Mental Health in Tech: Which Workplace Policies Work?

    This post examines mental health in tech and which workplace policies are most associated with higher treatment-seeking among employees.

    TL;DR  

    When it comes to mental health, tech workers report a big trust gap: they expect negative consequences for discussing mental health far more than physical health.  As a result, they don’t even look into what the policy at their company is. That is a major problem because awareness is as critical as access. Across benefits, care options, resources, anonymity, and wellness talks, “Don’t know” responses track almost as poorly as “No.”

    A higher support score (visible benefits + clear options + resources + anonymity + wellness dialogue) is strongly associated with treatment-seeking, with big gains up to ~3 policies understood/visible.

    The business case is real: new research estimates burnout costs U.S. employers $4,000–$21,000 per employee per year(≈ $5M/yr for a 1,000-employee firm). Meanwhile, ~40%+ of tech workers have a high risk of burnout.

    The tech industry runs on attention and problem-solving—exactly what chronic stress erodes. I analyzed the Open Sourcing Mental Illness (OSMI) Tech Survey to quantify how visible, trustworthy workplace policies relate to help-seeking for mental health. The goal: turn anecdotes into actionable, ROI-aware guidance for teams and leaders.

    Data & Method (quick tour)

    • Dataset: OSMI Mental Health in Tech Survey (2014), 1,259 respondents, 27 variables (e.g., demographics; workplace policies: benefits, care options, wellness talks; perceived anonymity; and outcomes like treatment-seeking, work interference).
    • Key construct — Support Score: For each policy question, Yes = 1; No/Don’t know/Not sure = 0; sum across items → per-person support score.
    • Stats: Welch one-way ANOVA + pairwise Welch t-tests to compare treatment-seeking across policy categories and risk factors (age, gender, family history, work interference).

    What I found

    1. Big stigma & trust gap

    Employees fear negative consequences far more when discussing mental vs physical health with employers. That effectively suppresses early help-seeking and exacerbates presenteeism.

    (PH: Physical Health; MH: Mental Health)

    1. Awareness is leverage (and often missing)

    For benefitscare optionsresourcesanonymity, and wellness programs, large shares answered “No” or “Don’t know.” Those “Don’t know” groups repeatedly underperform “Yes”—often similar to “No.” In other words: communication gaps erase ROI.

    1. Support drives action

    Higher support scores correlate with higher treatment-seeking, with strong gains up to ~3 understood/visible policies (then a plateau). Practical takeaway: while you may not need every program, it is critical to make a few core policies visible and trusted. 

    One-way ANOVA followed by pairwise Welch’s t-test:

    Employees with support score ≥1 were more likely to seek mental health treatment than those with support score 0.

    • Benefits present → higher treatment-seeking. Lack of awareness can be worse than not having benefits.

    One-way ANOVA followed by pairwise Welch’s t-test:

    • Know your options → action spikes. When employees know what care options are available to them, they’re far more likely to seek help.

    One-way ANOVA followed by pairwise Welch’s t-test:

    • Resources provided → highest treatment-seeking (~59%). Practical guides matter.

    One-way ANOVA followed by pairwise Welch’s t-test:

    • Wellness talks → higher help-seeking. Silence signals risk; conversation signals safety.

    One-way ANOVA followed by pairwise Welch’s t-test:

    • Anonymity clarity beats ambiguity. “Not sure if I’m protected” depresses help-seeking more than a clear “No.”

    One-way ANOVA followed by pairwise Welch’s t-test:

    1. Who’s at risk?
    • Younger men (<25 and 25–34) were least likely to seek treatment. Rates among men improve after 45, approaching women’s rates. Targeted manager training and tailored messaging can help ameliorate the age and gender gap.

    One-way ANOVA:

    Age-group differences in treatment-seeking did not reach conventional statistical significance (Welch one-way ANOVA, p = 0.058), but the near-threshold p-value provides weak evidence consistent with a small age effect in this sample.

    • Family history of mental illness is linked to higher treatment-seeking—likely via awareness and recognition.

    Welch’s t-test:

    • Work interference is a strong signal of demand. Those reporting frequent interference seek care much more often(significant Welch tests). Capacity planning should account for this.

    Welch’s t-test:

    Why this matters (for teams & CFOs)

    • Human lens: The Yerbo Burnout Index reported ~42% of tech employees at high risk of burnout,a problem that predates and outlasts any hype cycle.
    • Business lens: A 2025 American Journal of Preventive Medicine study estimates $4,000–$21,000 per employee per year in burnout costs, indicating a significant loss of ~$5M/yr for a 1,000-person company. That’s real money you can reinvest in policy visibility, manager training, and privacy-preserving access to care.

    4 recommendations for tech business leaders

    1. Make 3 policies unmistakably visible. Prioritize benefitshow-to-access care, and anonymity protections. Repeat the message in onboarding, manager 1:1s, and quarterly refreshers.
    2. Train managers for safety signals. Equip them to open conversations without prying; normalize early help-seeking; route to resources.
    3. Measure trust, not just availability. Track “Don’t know” rates and anonymity confidence; treat them as leading indicators.
    4. Plan capacity where interference is high. Where work interference is common, expect higher demand for counseling/EAP and adjust bandwidth accordingly.

    Limitations & future work

    • Single-year, self-reported data (2014): The data is representative of its sample/time; not a full longitudinal view. Extending to later OSMI waves can validate trend stability and expand subgroup analysis.
    • Associational (not causal): Strong correlations guide where to experiment; organizations should A/B their policy visibility and communication strategies to test causal lift.

    References

    1. State of Burnout in Tech (2022, Yerbo): Estimates ~42% of tech employees at high risk of burnout.
    2. American Journal of Preventive Medicine (2025): “The Health and Economic Burden of Employee Burnout to U.S. Employers — estimates $4,000–$21,000 per employee per year; ≈ $5M/yr for a 1,000-employee firm.

    Footnotes

    This post summarizes the methods, figures, and results presented in my talk, “Mind Over Machines: Exploring Mental Health in Tech Workers — Exploratory Data Analysis with Python.”

  • Mental Health in Tech: 2014–2023 Trends That Matter

    Mental Health in Tech: 2014–2023 Trends That Matter

    This post extends my Mental Health in Tech analysis by harmonizing the 2014 OSMI baseline with waves from 2016–2023.

    What changes when we add in 2014?

    The central question this seeks to answer is: Do visible workplace supports- policy pillars (benefits, care_options, seek_help, anonymity, wellness_program)-still correlate with a higher likelihood of seeking treatment when we combine later OSMI waves with the 2014 baseline?

    TL;DR

    • Pooled OSMI waves (2014, 2016-2023) confirm and strengthen the Part-1 finding that visibility of supports correlates with higher treatment-seeking.
    • Bigger sample size tightens error bars and surfaces new significant contrasts between support groups.
    • The story didn’t change. Visibility is leverage, though the multi-year lens makes it louder and clearer.
    • If you raise support and make it impossible to miss, more people get help. That’s good for people and productivity.

    Quick observations from the pooled dataset

    • Demographics stable; age/gender mix comparable to 2014. 
    • Treatment-seeking higher than baseline in several later waves (2019 omitted).
    • Work interference grew (Sometimes/Often), but within-group treatment rates look comparable to the baseline.
    • Family history: “Don’t know” ~60% seeking treatment vs No ~30%, Yes ~80% -a practical flag.
    • Support score curve: steadier rise, plateau after ~3; more significant contrasts (0 vs >=1; 1 vs >=3; 2 vs >=3).

    Data & harmonization

    Single pipeline maps later-wave headers to 2014 semantics, recodes to Yes/No/Don’t-know, and builds the 0-5 support score. Data for 2019 lacks the treatment item. (Methods as in Part-1.). Stats: for >2 group comparisons, one-way Welch’s ANOVA followed by pairwise Welch’s t-test, and for 2 group comparisons Welch’s t-test was used. 

    What I found

    1. Treatment-seeking over time: The multi-year picture shows higher treatment-seeking rates pre-2019 than post-2019. It’s important to note the apparent “dip” in 2019 is not a measured dip. In 2019 the treatment question was omitted, so that year’s rate is missing rather than lower.
    1. Support score distribution: The distribution looks comparable to the one for 2014, but with a wider gap between scores 4 and 5. Again, most employees report receiving no or very little support.
    1. Support score → treatment (dose-response): The pooled curve shows a steadier increase in treatment-seeking as support score rises, then plateaus after ~3. With many more responses, the error bars are smaller, so we can confidently see differences between more support-score groups. Stats reveal new significant pairwise differences:

    Stats reveal new significant pairwise differences:

    • Previously: 0 vs 0 vs 0 vs 0 vs ≥ ≥ ≥ ≥1 1 1 1, 1 vs 3
    • Now: 0 vs ≥ 1, 1 vs ≥ 3, 2 vs ≥ 3 (plus several neighbors trending)
    1. Policy awareness proportions (five questions): Awareness is still low. Aside from a modest improvement in the first question (proportion of “Yes” increased), most either worsened or stayed flat. Here, worse means more “No” and “Don’t know” responses. 
    1. Policy pillars → treatment (five questions) The pooled data reproduces 2014’s shape and adds a few new significant gaps:
    • Benefits vs. treatment. Proportions show slight improvement toward “Yes”. However, treatment-seeking rates remain comparable to the baseline. Having mental healthcare access is a major driver of treatment-seeking.
    • Care options awareness vs. treatment. Proportions and treatment-seeking rates remain comparable to baseline. Knowing how to access care represents the largest policy lift in the pooled data.
    • Seek-help resources vs. treatment. Proportions remain comparable to baseline. Treatment-seeking rates reveal a new significant gap: “No” ≠ “Don’t know.” (In 2014 this difference wasn’t significant). Uncertainty is its own risk. Providing practical guides to seek help can make a crucial difference for those who would pursue the help they need if they knew how to go about it. 
    • Anonymity protection vs. treatment. Proportions are comparable to baseline. Treatment-seeking rates for “Yes” differ from “No.” (In 2014 this difference wasn’t significant). Privacy clarity seems to carry significant weight here.
    • Wellness program (discussion) vs. treatment. Proportions and treatment-seeking rates are comparable to baseline. Providing mental healthcare in combination with wellness programs drives treatment-seeking.
    1. Family history vs. treatment The pooled data surfaced an “I don’t know” family-history group that wasn’t visible in 2014. Notably, ~60% of this group reported having sought treatment-about 2x the “No” group (~30%) yet below “Yes” (~80%). Uncertainty is an elevated-risk signal and should prompt proactive outreach (education, screening, benefits navigation) rather than being treated like “No.”
    1. Age & gender vs. treatment The monotonic age trend (older → higher treatment-seeking) persists; in the pooled data. However, age groups are now significantly different overall, elevating age as a potential risk factor to manage.
    1. Work interference vs. treatment The proportion of respondents reporting “Sometimes/Often” interference grew substantially vs. 2014; within category treatment-seeking rates (e.g., among “Sometimes/Often”) stayed comparable to the baseline.

    5 recommendations for tech business leaders

    1. Make support unmissable. The top gains still come from visibility. Accordingly, it is important to put benefits, access steps, and anonymity in onboarding, on the company’s intranet, and in managers’ 1:1 playbooks.
    2. Don’t let “Don’t know” persist. It tends to behave like “No” for outcomes. If you only fix one thing, fix discoverability.
    3. Train managers. They are the distribution channel for seek-help instructions and privacy assurances.
    4. Watch work interference. The Sometimes/Often group is larger than in 2014. In order not to shortchange people who need help, budget EAP/counseling capacity to meet that demand. 
    5. Mind the age gradient. Younger employees (especially men) continue to seek treatment less-target communication and manager nudges accordingly. 

    Limitations

    • Observational data. Associations ≠ causation.
    • Instrument differences. 2019 lacks the treatment item; later waves have smaller N.
    • Self-report bias. Outcomes and exposures are reported by respondents. 

    Where this could go next (subject to time & data quality):

    • Try a multivariable model to see which factors most strongly contribute to treatment.
    • Test policy importance by recomputing the support score while dropping one policy at a time.

    Acknowledgments

    Thanks to OSMI for maintaining and sharing the longitudinal survey, and to the NYC Data Science Academy community for feedback on the harmonization and plotting pipeline.

    References

    1. Mental Health in Tech: Which Workplace Policies Work?
    2. Tech’s ongoing mental health crisis
    3. Asana Anatomy of Work Index 2022
    4. The Health and Economic Burden of Employee Burnout to U.S. Employers

  • Learn About Markets with a Python Shiny Stock App

    Learn About Markets with a Python Shiny Stock App

    In this post, I’ll walk through a python shiny stock app I built to learn about markets hands-on. It bundles three tools-a Portfolio Simulator, Sector Explorer, and Event Study-so you can experiment with real NASDAQ & NYSE data without writing new code each time. 

    Note that this is intended for educational purposes only and should not be taken as financial advice.

    Live app: Portfolio, Sector & Event Lab

    Why I built this

    I built this python shiny stock app to give investors a single place where they can quickly explore their ideas and make more informed buying and selling decisions. In one workflow, you can backtest simple buy-and-hold strategies, compare sector performance side by side, and quantify how stock prices react to real events like earnings or SEC filings.

    The whole design leans into learning-by-doing: minimal clicks, assumptions stated up front, and exportable outputs you can take with you for deeper follow-up analysis.

    Some real questions

    • “I’m holding a small basket of stocks—say AAPL, NVDA, and PLTR—and I want to see how a simple buy-and-hold over a recent window would have actually played out. What would my path and final return have looked like?” (See chart below.)
      • With the Portfolio Simulator, I enter the tickers, set a date window, choose Equal vs. Inverse-price weighting, and hit Run. The app buys once at t₀ (the start of the selected window), then shows portfolio wealth, total return, and per-ticker normalized lines. There’s no rebalancing and no transaction costs-just a clean view of how that specific buy-and-hold position would have played out over the chosen period. 

    So, the answer to that first “real question” is: over this specific recent window, a simple equal-dollar buy-and-hold in AAPL/NVDA/PLTR would have more than doubled your money, with PLTR doing most of the heavy lifting.

    • “I’m thinking about trading AAPL around 10-Q filings. Historically, have AAPL’s returns tended to react positively or negatively in the days around those filings?” (See chart below.)
      • With the Event Study, I choose the filing type (e.g., 10-Q), enter the ticker, set a date range and an event window in relative days, and hit Run. The app calculates and plots cumulative abnormal returns around each event, giving a compact view of how the stock has historically behaved in the days before and after those filings.

    So, the answer to the second “real question” is: in this sample, AAPL has on average drifted slightly down or sideways before 10-Q filings, then shows a modest positive abnormal move-around +5-6%-in the days following the filing (out to day +5), albeit with plenty of variability across events.

    What you can do (fast)

    1. Portfolio Simulator
      • Set tickers and dates; choose Equal (1/N) or Inverse-price (1/price₀); run.
      • Outputs: wealth curve, total return, optional daily returns, per-ticker normalized lines, CSV export.
    2. Sector Explorer
      • Pick sectors and a window; build equal-weight sector indices (base=1) and view returns/rolling stats.
      • Outputs: indexed curves + simple table of window returns (cap-weighting on the roadmap).
    3. Event Study
      • Enter one or more dates and a ± window (trading days).
      • Outputs: AR/CAR by event and an average path when multiple dates are provided; day-0 aligned plot; CSV. (Expected return = 0 → ARₜ = rₜ)

    Under the hood (transparent by design)

    I built the dataset myself to avoid rate-limit surprises and keep the python shiny stock app feeling snappy even when working with thousands of tickers. The notebook pulls NASDAQ + NYSE tickers, filters out non-common stocks, downloads price history via yfinance, and writes a tidy daily panel suitable for quick analysis.

    The pipeline saves both Parquet and CSV and is safe to re-run. It merges fundamentals, computes returns, and writes analysis_enriched.parquet/.csv. For the live app, I ship a compact outputs/sample.parquet slice (1-year window). That way the UI stays snappy, while the full 3-year panel remains in outputs/analysis_enriched.parquet.

    The presentation summarizes scope and scale-~3.8M+ rows across 31 columns-covering thousands of U.S. tickers with filing/split annotations. Visit my GitHub repo for more details on the dataset builder pipeline, sample builder and app scripts.

    Try it yourself

    1. Portfolio: AAPL MSFT PLTR QUBT (last 12 months) → compare Equal vs Inverse-price.
    2. Sectors: Technology, Healthcare, ETF/Fund → normalize to base=1; scan drawdowns.
    3. Events: Forms 144 and 144/A, Technology sector, 2024/09/30-2025/09/30, k=3 → inspect AR/CAR around day 0.

    Roadmap for the python shiny stock app

    • Market-cap and other weighting schemes. 
    • Event baselines (constant-mean / market-model / Fama-French).

  • The Moneyball of Real Estate: Ames housing price prediction

    The Moneyball of Real Estate: Ames housing price prediction

    This post walks through my Ames housing price prediction project, where I built and deployed a machine learning ensemble to estimate sale prices.

    I. Introduction

    The Flipper’s Dilemma In real estate, the line between a “deal” and a “money pit” is often invisible. Most investors rely on gut feel or “back-of-the-napkin” math to estimate a renovation’s return on investment (ROI). “If I add a garage, this house will surely sell for $30k more, right?”

    I wondered: What if I didn’t have to guess? How could I be assured that the improvement I make to the home will increase its sale value by the amount I spent or more? My goal was to move beyond the standard Kaggle-style objective of simply predicting a price. I wanted to build a Decision Support System an end-to-end machine learning pipeline that not only predicts fair market value but mathematically identifies the highest-ROI renovation opportunities in Ames, Iowa. Using a tech stack of Python, Scikit-Learn, XGBoost, CatBoost, and Shiny, I turned a static dataset into a dynamic investment engine. 

    II. Data Engineering

    The “None” Strategy Real-world data is messy, and the Ames dataset was no exception. Faced with over 80 columns of mixed quality data points, my first task was defining an imputation strategy that was robust enough for production.

    The dataset documentation suggested a manual approach -mapping specific NaN values to specific meanings (e.g., assuming a missing Pool Quality score meant “No Pool”). However, I decided to deviate from these manual rules to build a more generalized pipeline. Hard-coding specific assumptions for 80+ columns creates ‘technical brittleness’-a system prone to breaking if data schemas shift. I wanted a scalable architecture that applied consistent logic across the board, reducing the risk of human error and making the pipeline easier to maintain in production. 

    • The Strategy: Instead of manually hard-coding definitions for every column, I adopted a systematic imputation strategy.
      • Categorical Data: I filled all missing values with the string “None”. This treated “missingness” as its own distinct category, allowing the model to learn the signal behind the absence of data without me imposing assumptions.
      • Numerical Data: I imputed missing values with the Median.

    This decision paid off. Letting the model interpret the “None” category resulted in it learning that missing values in key columns like BsmtExposure were strong signals of lower value. While the data dictionary instructs us to manually map these NaNs to “No Basement,” my approach allowed the model to mathematically validate this relationship independently-effectively capturing the “lack of feature” penalty without requiring complex manual mapping scripts.

    To ensure the model remained lean, I performed a “Redundancy Audit.” Using a Correlation Threshold, I identified features that were essentially saying the same thing, By cutting the fat and removing multicollinear variables, I reduced the risk of overfitting and ensured that each coefficient in my final model had a clear, independent meaning.

    III. The Model Battle Royale

    Glass Box vs. Black Box I approached modeling as a competition between two philosophies: the interpretable “Glass Box” and the high-performance “Black Box.”

    Phase 1: The Glass Box (Linear Models)

    I started with OLS, Lasso, Ridge, and ElasticNet. These models are the “strict accountants” of machine learning. They are highly interpretable, allowing me to see exactly how many dollars a fireplace adds to the price tag. But they struggle with nuance.

    • The Finding: While ElasticNet tried to be the “pragmatist” by taming the Living Area coefficient (reducing it from Lasso’s >0.12 to ~0.10), Lasso emerged as the clear winner, achieving the highest mean Cross-Validation R2 (0.8899) of all the linear models. Lasso proved to be “Size Obsessed,” assigning massive value to Ground Living Area, but it also provided the most sensible prioritization of the fundamentals: Overall Quality (fit and finish), Overall Condition (maintenance), and Year Built (age). It effectively zeroed out “noise” like Pool Area, telling me: “Ignore the fluff. The safest way to add value is to maintain the property and make it bigger.”
    • The Lesson: Trust the coefficients, not just the score. My OLS baseline produced a competitive R2, but inspecting the coefficients revealed it was “fitting the noise.” Due to multicollinearity, OLS made illogical trade-offs. For example, it assigned a positive value to YearBuilt but a negative penalty to SaleType_New, despite these features describing the same “newness.” 

      Regularization (Lasso/Ridge) didn’t just tune the model; it acted as a “Logic Filter,” dampening these contradictions to ensure the model’s advice was not just statistically accurate, but practically sound.

    Phase 2: The Black Box (Tree Models & SVR)

    Next, I moved to the non-linear powerhouses: SVR, Random Forest, XGBoost, and CatBoost. This wasn’t just about trying different algorithms; it was about finding the right architectural fit.

    • The Honorable Mentions (SVR & Random Forest):
      • SVR (Linear Kernel): While stable and reliable in its feature attributions, SVR hit a performance ceiling with a CV score of ~0.87. It simply lacked the capacity to capture the complex, non-linear interactions required to break the 0.90 barrier.
      • Random Forest: This model revealed a fatal flaw I call “The Monolith.” It assigned a staggering 55% feature importance to a single variable-Overall Qual. This made the model unbalanced and dangerously sensitive to a subjective rating, failing to capture the nuance of the rest of the dataset.
    • The Champions (XGBoost & CatBoost): The Gradient Boosters changed the game, with both models exceeding 0.91 CV scores-the highest of any individual models tested.
      • The Finding: Unlike Random Forest, they offered a balanced worldview. XGBoost emerged as the “Amenities Flipper,” ranking GarageCars as a top driver (even higher than living area), proving that premium features can outweigh raw square footage.
    • The Synergy: I selected both for the final ensemble because they are strong learners that “think” differently. XGBoost grows asymmetrically (leaf-wise) to capture specific, deep interaction rules, while CatBoost grows symmetrically (oblivious trees) to act as a regularizer. This structural diversity meant they could cover each other’s blind spots rather than repeating the same errors.
    • The Lesson: Don’t flatten the hierarchy. One-Hot Encoding destroys the natural order of real estate data (e.g., Excellent > Good > Poor), treating them as unrelated buckets. By switching to Ordinal Encoding, I preserved this mathematical rank, enabling the trees to make logical splits based on “Quality Thresholds” (e.g., Condition ≥7) rather than memorizing isolated features. This allowed the model to capture the “tipping points” where value accelerates.

    Phase 3: The “Super Model” (Voting Regressor)

    Why choose one worldview? I built a Voting Regressor that combines the disciplined baseline of Lasso (Weight: 1) with the aggressive precision of XGBoost and CatBoost (Weight: 2 each).

    • The Strategy: This hybrid architecture was designed to balance the “Square Footage” fundamentals (Lasso) with the “Luxury Amenity” premiums (Trees). The linear model provided a stable pricing floor, while the boosters captured the non-linear value ceilings that simple math missed. 
    • The Pipeline Feat: The real engineering challenge was data routing. I built a custom pipeline that simultaneously fed One-Hot Encoded data to the Lasso branch and Ordinal Encoded data to the Tree branches. This ensured every model received the data in its optimal format-preventing the “dilution” of the trees while satisfying the “math” of the linear model-ultimately driving the ensemble to a Test R2 of 0.933.

    IV. Productionizing

    From Lab to Factory A model in a notebook is a prototype; a model in an app is a product. The transition required solving the “Well, it works on my machine” problem. To productionize the Ames housing price prediction pipeline, I packaged preprocessing and inference behind a Flask REST API and a Shiny for Python dashboard.

    • The Unified Pipeline: I wrapped my entire preprocessing, imputation, and modeling logic into a single Scikit-Learn Pipeline. This means raw user input-JSON data-goes in, and a prediction comes out, without any manual data prep. 
    • Technical Spotlight: Custom Feature Engineering: Tο ensure the pipeline was robust and portable, I built custom classes-Feature Engineer and Correlation Threshold-that inherit directly from Scikit-Learn’s BaseEstimator and TransformerMixin. This architecture allowed me to bake complex logic directly into the pipeline object: 
      • Automated Engineering: My Feature Engineer class standardizes the creation of high-signal features like TotalSqFt, HouseAge, and TotalBath.
      • Geospatial Intelligence: If specific neighborhood data is provided, the class leverages the geopy package to fetch precise Latitude and Longitude coordinates, capturing micro-location value that simple labels might miss.
    • Artifact Management: I didn’t just pickle the model. I created a synchronized “Artifact Factory” that exports ames_model_defaults.pkl (to auto-fill missing user inputs) and ames_model_options.pkl (to populate the dashboard’s dropdowns dynamically). 
    • The Testing Suite: I built a three-tier testing framework to ensure reliability:
      • The Mega-Mansion Test (Safety): I fed the API a 200,000 sq ft house. Instead of crashing or predicting $500 Million, my tree-clamping logic kept the prediction grounded.
      • The Ghost House (Stability): I tested inputs with 0 sq ft to ensure the model threw appropriate errors rather than nonsensical prices.
    • The Renovation Check (Logic): I wrote scripts to verify that Price(Renovated) > Price(Base). If adding a garage didn’t increase the price estimate, the model failed the logic test.

    V. The “House Flipper Pro” Dashboard (Shiny)

    Finally, I visualized my engine using Python Shiny. I didn’t want a static report; I wanted a Deal Simulator. 

    Try the Live App Here: House Flipper Pro

    • The Control Center: I built an interface that lets me tweak every variable-Neighborhood, Quality, Square Footage to hunt for “Unicorn” deals in real-time.
    • The Renovation Toggles: I added one-click buttons to simulate specific upgrades (e.g., “Add Central Air,” “Finish Basement”). This allows me to instantly see if a $20k garage adds $20k in value (Spoiler: usually not).
    • The Deal Logic: I included a “Purchase Discount” slider and a dynamic Waterfall Chart. This enforces the discipline of the model: showing visually that true profitability usually comes from the buy, not just the fix.

    VI. Strategic Takeaways: The Renovation Reality

    The most exciting part of this project wasn’t the code; it was the real estate strategy the model revealed. 

    • The “AC Arbitrage”: My model identified a consistent arbitrage opportunity in “Old Town.” Older homes there are heavily penalized for lacking Central Air. But I discovered a hidden multiplier: Square Footage. The data reveals that the market punishes the lack of central air far more severely in large homes than in small ones, creating a non-linear opportunity for value creation. 
      • The “Unicorn”: A large (2,500+ sq ft), Pre-1960s home without AC is the most profitable renovation target in the dataset. The market punishes these “obsolete” giants severely, so bringing them up to modern standards creates an outsized value lift compared to the fixed renovation cost.
    • The Garage Trap (vs. The Quality Leap): Conversely, the model warned me against adding garages without doing the math first. While many flippers assume a garage is a guaranteed “value-add,” the data tells a more complicated story.
      • The Findings: As the simulation above shows, adding a 2-car garage in a top-tier neighborhood does increase value for larger, high-quality homes—generating a $7,769 profit (2.8% ROI) on a $20,000 investment. However, in lower-tier neighborhoods, this math often flips to a net loss. This surprise finding highlights the danger of “Over-Improvement.” Budget-conscious neighborhoods often have invisible price ceilings, meaning buyers simply cannot pay the premium for a brand-new garage, causing the fixed cost of construction to exceed the value added.
      • The Better Play: My “Super Model” (see SHAP summary below) revealed that Overall Quality is often a stronger, safer driver than amenities. Instead of risking $20k on a garage for a 2.8% return, the data suggests a higher and safer ROI comes from cosmetic upgrades (flooring, paint, fixtures) that bump a house from “Average” to “Good” condition—capitalizing on the model’s heavy weighting of the OverallQual feature.

    VII. Future Roadmap

    While the application is currently live and stable, the engineering journey continues. The next major step is Dockerization-containerizing the application to ensure it runs identically in any environment, eliminating the remaining “system dependency” risks and making the deployment truly cloud-agnostic.

    VIII. Conclusion

    The Golden Rule: Perhaps the most humbling finding from my Shiny app was this: Renovations rarely pay for themselves if you pay full price for the house. The “Purchase Discount” slider suggested that the best ROI doesn’t come from granite countertops; it comes from finding properties that are underpriced to begin with. My model confirms the old real estate adage: You make your money when you buy, not when you sell.

    By combining rigorous data engineering with a “Super Model” ensemble, I turned a static housing dataset into a dynamic tool for finding value. I didn’t just predict prices; I decoded the market’s behavior, proving that with the right data, you really can “Moneyball” real estate. 

    Launch the App