We ship transaction categorization that hits 99% accuracy on production data. This post explains how it works under the hood. Fair warning: it is written for engineers and curious power users.
The naive approach (and why it fails)
The obvious solution is to take a bank transaction's merchant name, send it to GPT-4 with "What category is this?", and store the answer. We tried that. It works for about 85% of transactions and falls apart on the long tail.
Generic merchant names like "PAYPAL *MERCHANT" or "SQ *CORNER STORE" carry almost no signal. Regional chains like HEB, Wawa, and Publix are invisible to a model that has not seen enough US-specific small chains. Recurring charges like "ZELLE TRANSFER TO J SMITH" need context the model does not have on its own.
So we built three layers.
Layer 1: Merchant alias cache
The first time a user re-categorizes a transaction, we store a row in MerchantAlias:
{
"merchantPattern": "PAYPAL *FOO BAR",
"categoryId": "food-delivery",
"userId": "..."
}
The next time any transaction matches that pattern (regex or prefix), we skip the AI entirely. After a few months of normal usage, more than 70% of transactions never touch the AI. They hit the cache.
That single optimization made per-user AI cost economical at the price points we wanted to charge.
Layer 2: Multi-provider routing
We run three engines and pick one per transaction.
Claude Sonnet handles nuanced merchant names where reasoning matters. "Trader Joe's #347 ANYTOWN" should map to Groceries even though "347" looks like a random code.
GPT-4o covers the long tail with broad world knowledge for regional chains.
DeepSeek is the fast, cheap fallback when the other two are slow or rate-limited.
The router picks based on provider availability (health checks every 30 seconds), merchant name length (very short goes to DeepSeek; very long goes to Claude), and historical accuracy for similar patterns.
When all three are available, we A/B-test confidence scores in the background to keep the router calibrated.
Layer 3: Confidence scoring and bulk review wizard
The AI returns a category and a confidence score from 0 to 1. We treat them as four buckets:
| Confidence | What we do |
|---|---|
| 0.9 and above | Auto-apply and surface in the audit log |
| 0.7 to 0.89 | Auto-apply and prompt for review on the next session |
| 0.4 to 0.69 | Stage for the bulk review wizard |
| Below 0.4 | Mark "needs human" and do not categorize |
The bulk review wizard is a four-step UI.
Step 1 is pre-flight: show what is about to be processed. Step 2 is the high-confidence batch: skim and override if needed. Step 3 is the low-confidence batch: this is where attention pays off. Step 4 is confirm and commit.
Most users move quickly through steps 1 and 2 and spend real time on step 3. That is exactly where human judgment matters most.
What went wrong (lessons)
Do not trust raw merchant strings. Banks format merchant names wildly differently. Normalize aggressively before sending to the AI: strip leading zeros, collapse whitespace, lowercase everything.
Cache aggressively but invalidate carefully. When a user re-categorizes a merchant alias, all of their historical transactions matching that alias should be re-evaluated. We learned that the hard way.
Multi-provider matters more for uptime than accuracy. One great model is usually as accurate as three voted models. The win is that we never go down just because one provider had an outage.
Cost
Per-user monthly AI cost at full usage stays under $0.30. Merchant alias caching brings real AI calls down to about 30% of total transactions.
What this looks like to the user
See it in the demo. You will land on Transactions — click "AI Categorize" to run the pipeline above; suggestions come back in two to three seconds for a typical batch.
Built on Claude (Anthropic), GPT-4o (OpenAI), and DeepSeek. Prisma and Postgres for the alias cache. Next.js for the wizard UI.
Try WIMM today
The demo loads with realistic data and no signup. See what this article describes in action.