Inside a BNPL fraud score: the 300 ms budget and where it goes

A BNPL checkout approval has one constraint that shapes every architectural decision: the latency budget. The user tapped “Pay Later.” Their thumb is already moving toward the confirm button. Between 150 and 300 milliseconds from now, a spinner that stays visible too long stops feeling like “loading” and starts feeling like “something is wrong.”

Inside that window the risk engine must fetch a feature vector, score it, apply rule vetoes, and commit a decision your compliance team can defend twelve months later.

#Where the 300 ms goes

Step	Budget	Bottleneck
Feature fetch (Redis)	5–12 ms	Network RTT + serialisation
Feature fetch (Postgres)	8–25 ms	Index scan + connection pool
Model inference	2–8 ms	Serialised feature vector
Rule engine	< 1 ms	In-memory
Audit write (async)	off critical path	Kafka producer
Network + overhead	10–20 ms	Service mesh, TLS
Total p99 budget	< 300 ms

The model is not the bottleneck. The feature fetch is.

#The feature staleness tradeoff

A Redis lookup costs ~1–3 ms and returns a feature vector that is only as fresh as the upstream stream processor. If the user made a transaction 4 seconds ago that would change their velocity feature, and the Kafka consumer lag is 6 seconds, the model scores a stale vector.

A Postgres lookup costs 8–25 ms and returns fresh data — but at p99, under connection pool pressure, it can spike to 80 ms and breach the SLO.

The decision: which features are read from Redis (fast, slightly stale) and which from Postgres (slow, fresh) is the primary engineering decision in a real-time risk system. It is not a data science decision. It is a latency-vs-staleness tradeoff that only the engineer who has seen both p99s can make correctly.

The math is not complicated:

\text{expected loss from staleness} = P(\text{fraud} \mid \text{stale feature}) \times \text{loss per fraud} - P(\text{fraud} \mid \text{fresh feature}) \times \text{loss per fraud}

If the staleness window is 6 seconds and the fraud velocity feature changes meaningfully in 6 seconds for fewer than 0.1% of sessions, the expected loss is smaller than the p99 latency cost of going to Postgres.

#What the audit trail requires

Every decision must be auditable: which model version, which features, which rules fired, what the score was, what the outcome was. This sounds obvious until you realise that if you store the feature vector at decision time, you own it — and if a regulator asks why a user was declined eighteen months later, you need to reconstruct the exact state of the world at that moment.

The audit write is off the critical path (Kafka producer, fire-and-forget with at-least-once delivery). The consumer writes to an append-only Postgres table. The feature snapshot is stored as JSONB alongside the decision.

This is a working note. A longer post with the full data model, the rule engine architecture, and the false-positive economics is in progress.

Fintech

#bnpl #fraud-detection #risk-engine #feature-store #latency

Math is substrate Previous

ScyllaDB vs Cassandra: what the p99 actually looks like at fintech scale Next