ML Portfolio · Eduardo Rdgz-Á

Catch 8 out of every 10 frauds by reviewing only the model’s alerts.

A ranking model that helps the manual review team inspect the most suspicious transactions first. Trained on 1M anonymized bank transactions.

See the live demo → Technical memo (PDF)

Dataset: BAF · NeurIPS 2022 · view paper ↗

Abstract wave-particle visualization — a representation of data — Jonakoh · Unsplash

The problem

Why fraud detection isn’t an ordinary classification problem

Why it matters

Banks process millions of transactions every day. Only about 1% are fraudulent, but the cost of missing them — chargebacks, regulatory fines, lost customer trust — adds up fast. Every fraud caught in time is money that stays inside the bank.

Why it’s hard

When fraud is this rare, traditional metrics are misleading. A model that called every transaction "not fraud" would score 99% accuracy and be completely useless. The real challenge isn’t black-and-white classification, it’s ranking transactions by risk so the manual review team knows what to look at first.

How this model tackles it

Each transaction gets a fraud probability between 0 and 1, and the review team inspects the highest ones. At the default decision threshold, the model catches around 80% of fraud by flagging only the transactions above it — a balance you can tune live in the next section.

Next: how we measure whether the model keeps that promise, how it behaves as you move the decision threshold, which features carry the most weight, and three real cases scored by the deployed model.

Section 01 · How good it is

The tuned model catches more fraud with fewer mistakes

Evaluated on 200,000 transactions the model never saw during training. Two versions of the same algorithm are compared: an untuned baseline and the final version tuned with Optuna.

Discrimination power — before

—

AUC-ROC baseline

Discrimination power — after

—

Ranking quality — before

—

AUC-PR baseline

Ranking quality — after

—

Sensitivity–specificity curve ROC curve

The closer the curve hugs the top-left corner, the better the model tells legitimate and fraudulent transactions apart. The tuned model improves by +0.055 over the initial version.

Source: 200,000 test transactions, BAF dataset (NeurIPS 2022).

Precision–sensitivity curve precision-recall curve

Shows how precise the model stays as its fraud coverage grows. The curve declines gradually — the model keeps useful precision even while capturing a larger share of total fraud.

Source: 200,000 test transactions, BAF dataset (NeurIPS 2022).

Section 02 · Product decision

Tune the balance between coverage and workload

The threshold is the point at which the model declares a transaction suspicious. There’s no "right" value — it depends on how much manual review capacity the team has. Raising it cuts false alarms but lets fraud slip through; lowering it catches more fraud but creates more work. Move the slider to see the effect live across all 200,000 test transactions.

Threshold 0.50

Error matrix confusion matrix

—

Derived metrics

Recall

—

Of all fraud, how much did the model catch?

Precision

—

Of every alert, how many were actual fraud?

F1 score

—

Balance between precision and coverage.

False positive rate

—

% of legitimate transactions flagged as suspicious.

Source: 200,000 test transactions, BAF dataset (NeurIPS 2022). Default threshold: 0.50.

Note: the default threshold (0.5) is a neutral reference, not an optimal value. In production, the threshold depends on the relative cost of a missed fraud versus a false alarm — a business decision, not a model one.

Section 03 · Why it predicts what it predicts

The model pinpoints exactly what triggered the suspicion

A model that only outputs a probability isn’t enough for the review team: they need to know what raised the flag in order to investigate. SHAP attributes to each variable how much it pushed the decision toward "fraud" or "legitimate." Shown here are the 20 variables that weigh the most on average across 2,000 test transactions.

Global feature importance mean SHAP value

Housing status and the device operating system are the most decisive signals. Recent-behavior variables — request frequency, branch visits — weigh more than static demographics like the customer’s age.

Source: SHAP TreeExplainer over 2,000 test transactions from the BAF dataset.

Section 04 · The model in production

Run a transaction through the model and watch the detection

Pick a case to send it to the model deployed on this server. Each case is a real transaction from the test set: the model never saw it during training.

Source: BAF dataset (NeurIPS 2022). Prediction and SHAP attributions computed live by the model deployed on this server.

Section 05

Project resources

Documentation written for three different audiences. The full source code, the design decisions with their rationale, and the step-by-step exploratory analysis.

Tech stack

Python 3.12 scikit-learn Optuna FastAPI Astro TypeScript Plotly.js Docker GitHub Actions Quarto XGBoost SHAP