Why expense tracking is an AI-first category now

Expense tracking used to be a manual chore. With vision models, a user snaps a receipt and gets back the merchant, total, date, and category in under two seconds. This single flow is the entire retention loop — if the scan works, the app gets used. If the scan is slow or wrong, the app dies.

I have shipped vision-based input in another category (menu parsing) and the reliability tricks transfer directly.

What you actually need to build

Camera capture: Snap a receipt. Autocrop to the receipt edges — this alone improves OCR accuracy a lot.
OCR + parsing: One API call to a vision model returning structured JSON: merchant, amount, date, category. Use structured output mode, not free text.
Expense list: Reverse chronological. Filter by month and category.
Auto-category with override: The LLM picks a category. User taps to change. Store the override and weight the next call toward that category for the same merchant.
Monthly summary: Total spent, per-category breakdown, a bar chart.
Manual entry: For the 10% of cases where the receipt is unreadable.

No bank sync in v1. Plaid is a different business.

The stack I use

React Native + Expo.
expo-image-picker with autocrop.
Supabase — expenses table, merchants table (for the learned-category cache), monthly summaries view.
NestJS — the OCR endpoint. Takes the image, sends to the vision model, returns structured JSON. Per-user quota, cached on image hash so the same receipt uploaded twice only costs once.
Claude Code + 11 AI agents — scaffold the camera + expense list screens.

Real build time

With the boilerplate, 3 weekends.

Camera + autocrop + upload: ~6 hours.
OCR endpoint + structured output parsing: ~6 hours.
Expense list + filter + manual entry: ~6 hours.
Category override + merchant cache: ~4 hours.
Monthly summary + chart: ~4 hours.
Store submission: ~4 hours.

About 30 hours.

Where people get stuck

Receipt photos at weird angles: If you skip autocrop and perspective correction, OCR accuracy drops to about 60%. With autocrop it is 90%+ on my tests. Use an open-source autocrop lib before the OCR call.
Vision model costs: Each receipt scan costs real money. Cache aggressively on image hash — a user scanning the same receipt twice should not cost twice. Compress to 1024px long-edge before upload; accuracy is the same and the cost is a third.
Category drift: An LLM will put "Uber" in transport on Monday and travel on Tuesday. Store the last human override per merchant and prepend it to the prompt. Drift drops hard.

Skip the setup

Camera + autocrop + upload pipeline, OCR endpoint with structured output and hash cache, expenses schema — pre-wired. The 11 AI agents scaffold the capture and list screens.

See pricing

Build an Expense Tracker with AI — Receipt OCR + Auto Categorize