Skip to main content

Risk Scoring

Prerequisites

Before implementing risk scoring, understand:

TL;DR
  • Risk score = 0-100 number indicating fraud likelihood; use for auto-approve / review / auto-decline tiers
  • Rules: Email domain, velocity, geo mismatch—transparent, fast to deploy, catches known patterns
  • ML: Complex patterns, novel fraud—needs data science expertise and labeled training data
  • Best approach: Combine both—rules for known fraud + ML for subtle patterns
  • Threshold tuning: Run A/B tests; calculate cost of false positives vs. cost of fraud; adjust quarterly

A risk score is just a number. The question is: does it help you make better decisions?

Your thresholds are bets. You're trading false positives (blocking good customers) for true positives (blocking fraud). The "right" threshold depends on your margins, your fraud rate, and your tolerance for customer complaints.

Experiment to Run: Score Threshold Sweep

Run 3 cutoffs in parallel on small slices of traffic:

  • Segment A: Score under 30 auto-approve, 30-60 review, over 60 auto-block
  • Segment B: Score under 40 auto-approve, 40-70 review, over 70 auto-block
  • Segment C: Score under 50 auto-approve, 50-80 review, over 80 auto-block

Metrics: Fraud loss + review cost + estimated false positive cost (use average order value × block rate × estimated good customer %)

Run length: 4 weeks (you need time for chargebacks to materialize)

Decision: Pick the cutoff with lowest total cost. Probably not the tightest one.

What Is a Risk Score?

A risk score is a number assigned to each transaction indicating the likelihood that it's fraudulent. Higher scores mean higher risk.

Common scales:

  • 0-100 (higher = riskier)
  • 0-1000 (more granular)
  • 0-1 probability (true probability)

How it's used:

Score 0-30:   Auto-approve
Score 31-70: Manual review
Score 71-100: Auto-decline

The thresholds depend on your risk tolerance, margins, and operational capacity. These aren't magic numbers - they represent a trade-off between catching fraud (true positives) and wrongly declining good customers (false positives).

Rules-Based Scoring

Rules are explicit conditions that add or subtract from a transaction's risk score.

How Rules Work

Each rule evaluates a condition and applies a score adjustment:

IF email_domain = "tempmail.com" THEN +30
IF shipping_country != billing_country THEN +15
IF customer_has_previous_orders > 5 THEN -10
IF device_seen_on_fraud_before = true THEN +50
IF amount > $500 THEN +10

Final score = base score + sum of all triggered rules.

Types of Rules

Identity rules:

  • Email validity (deliverable, disposable domain, recently created)
  • Phone number validation
  • Name consistency across data points

Transaction rules:

  • Order amount (high value = higher risk)
  • Product category (high-fraud categories)
  • Shipping method (expedited = higher risk)
  • Billing/shipping mismatch

Behavioral rules:

  • Time to complete checkout (too fast = bot)
  • Session behavior (copy-paste vs. typing)
  • Multiple failed attempts before success

Velocity rules:

  • Orders per IP per hour
  • Cards per email per day
  • Shipping addresses per card per week

Device/network rules:

Rules: Pros and Cons

Pros:

  • Transparent - you know exactly why a transaction was flagged
  • Controllable - adjust instantly for new patterns
  • Explainable - easy to justify decisions to customers, banks, auditors
  • No training data required

Cons:

  • Reactive - must manually add rules for new fraud patterns
  • Brittle - fraudsters learn and adapt to your rules
  • Maintenance burden - rule sets grow unwieldy over time
  • Limited pattern recognition - can't catch subtle correlations

Machine Learning Scoring

ML models analyze historical transaction data to identify patterns that predict fraud, including patterns too complex for humans to define as rules.

How ML Scoring Works

  1. Training: Model is fed historical transactions labeled as fraud/legitimate
  2. Learning: Model identifies features and patterns correlated with fraud
  3. Scoring: For new transactions, model outputs fraud probability
  4. Feedback loop: New fraud outcomes are fed back to improve the model

Types of ML Models

Supervised learning:

  • Learns from labeled examples (this was fraud, this wasn't)
  • Most common approach for fraud scoring
  • Requires clean, labeled historical data

Unsupervised learning:

  • Identifies anomalies without labels
  • Useful for catching new fraud types
  • Higher false positive rate

Neural networks:

  • Can find complex, non-linear patterns
  • "Black box" - harder to explain why a score was assigned
  • Requires large amounts of data

ML: Pros and Cons

Pros:

  • Adaptive - learns from new fraud patterns automatically
  • Scalable - handles millions of transactions without manual rule updates
  • Pattern recognition - catches subtle correlations humans miss
  • Continuous improvement - gets better with more data

Cons:

  • Black box - hard to explain individual decisions
  • Data requirements - needs substantial labeled data to train
  • Cold start problem - poor performance until enough data collected
  • Can learn biases from historical data

Combining Rules and ML

The best fraud prevention systems use both approaches:

Transaction arrives

Rules evaluate (known patterns)

ML model evaluates (complex patterns)

Scores combined

Decision + explanation

Why both?

  • Rules catch known, obvious fraud patterns instantly
  • ML catches emerging patterns and subtle signals
  • Rules provide explainability when ML triggers
  • ML reduces rule maintenance burden

Cold Start Strategy

When launching or with limited data:

  1. Use rules more heavily at launch - they work immediately without training data
  2. Slowly let ML carry more weight as you accumulate labeled outcomes
  3. Don't turn off rules just because you add ML - use your rule hits as labeled inputs to the model
  4. Feed chargeback/fraud outcomes back to improve ML over time

Example Combined System

Rule: Shipping to known fraud address     → +70 points
Rule: Email domain is disposable → +20 points
Rule: Customer has 3+ successful orders → -15 points
ML score: 0.35 (35% fraud probability) → +35 points
___________
Final score: → 110 points → DECLINE

Setting Thresholds

Your threshold strategy depends on:

FactorLower Thresholds (stricter)Higher Thresholds (looser)
MarginLow margin (can't absorb fraud)High margin (can absorb some fraud)
ProductPhysical goods (lost forever)Digital (can revoke access)
Chargeback ratioNear network thresholdsComfortable buffer
Customer experienceLess importantCritical to business
Review capacityLarge review teamLimited/no review team

The Trade-Off Curve

Your thresholds represent a choice along the ROC (Receiver Operating Characteristic) curve:

  • Lower threshold = catch more fraud, but also decline more good customers
  • Higher threshold = approve more good customers, but also let through more fraud

There's no "correct" threshold - it depends on what your business can tolerate. Optimizing metrics like AUC, precision, recall, or F1 score helps you find the right balance, but ultimately it's a business decision.

Three-Tier Strategy

Tier 1: Auto-approve (low scores)

  • Fast customer experience
  • No manual intervention
  • Accept some fraud slippage

Tier 2: Manual review (middle scores)

  • Human evaluates ambiguous cases
  • Can request additional verification
  • Higher operational cost

Tier 3: Auto-decline (high scores)

  • Block obvious fraud
  • Some false positives (lost good customers)
  • Can offer alternative payment methods

Finding Your Thresholds

Those "approve below 40, decline above 70" recommendations are someone else's guess. Here's how to find yours:

1. Calculate your cost of false positive:

Average order value × Gross margin × Probability customer never returns

If your AOV is $100, margin is 30%, and 50% of blocked customers never return: $100 × 0.3 × 0.5 = $15 per false positive

2. Calculate your cost of fraud:

Average fraud amount + Chargeback fee + Operational cost

If average fraud is $150, CB fee is $25, ops cost is $10: $185 per fraud

3. Find the break-even: At what threshold does the cost of false positives equal the cost of fraud prevented?

4. Test your hypothesis: Set thresholds based on your calculation. Run for 30 days. Measure actual costs. Adjust.

Where This Can Fool You
  • Score calibration: A score of 80 should mean 80% of those transactions are fraud. Check if yours does. Many vendor scores aren't well-calibrated.
  • Score drift: Model performance degrades over time. Re-test quarterly.
  • Feedback loops: If you never tell the model what was actually fraud, it gets stale. Make sure chargeback outcomes flow back.

Key Metrics

Fraud detection rate (True Positive Rate / Recall): What percentage of actual fraud did you catch?

Fraud detected / Total fraud × 100

False positive rate: What percentage of good transactions were wrongly declined?

Good transactions declined / Total good transactions × 100

Precision: Of transactions you flagged as fraud, how many actually were?

True fraud flagged / All transactions flagged × 100

Review rate: What percentage of transactions go to manual review?

Transactions in review / Total transactions × 100

Ideal: High detection rate, low false positive rate, manageable review rate.

Building vs. Buying

Build your own:

  • Full control over rules and models
  • Can optimize for your specific fraud patterns
  • Requires data science expertise
  • Ongoing maintenance burden

Buy a solution:

  • Faster to implement
  • Vendor has consortium data (sees fraud across many merchants)
  • Less control over scoring logic
  • Per-transaction costs

Hybrid:

  • Use vendor for ML/consortium data
  • Layer your own rules on top
  • Best of both worlds for many merchants

Vendor Landscape

Note: This space evolves constantly. Evaluate vendors based on your specific stack, geography, and risk profile.

CategoryExamples
Standalone fraud platformsForter, Riskified, Signifyd, SEON
Processor-integratedStripe Radar, Adyen Risk, Checkout.com FDP
Identity/deviceKount, ThreatMetrix, BioCatch
Rules enginesSplunk, Datadog (DIY)

Next Steps

Just getting started with scoring?

  1. Use your processor's built-in scoring → Stripe Radar, Adyen Risk, etc.
  2. Define three buckets → Auto-approve, review, auto-decline
  3. Track your false positive rate → Customer complaints are the signal

Tuning your thresholds?

  1. Run the threshold sweep experiment (see top of page) → Data beats intuition
  2. Segment by transaction type → Different thresholds for different products
  3. Track fraud rate AND false positive rate → Optimize the tradeoff, not just one metric

Building custom scoring?

  1. Review rules vs. ML tradeoffs → Know when to use which
  2. Start with rules on known patterns → ML for novel detection
  3. Invest in feature engineering → Good features beat complex models

See Also