Why a podcast app needs ASR before it needs an algorithm

The AI feature that matters in a podcast app is not recommendations. It is the transcript. A searchable transcript with AI-generated chapter highlights turns a 60-minute episode into a 2-minute skim, and that is the feature people pay for. Everything else is a standard audio player.

I have shipped audio playback inside a meditation-adjacent experiment and the background audio plumbing is shared. ASR is the new piece, and it is the cost center.

What you actually need to build

Podcast feed parsing: Import via RSS. Same RSS story as a news reader, different endpoint.
Audio player: Play, pause, seek, speed, background mode, lock-screen controls.
Transcript panel: Show transcript synced with playback. Tap a line to seek.
AI highlights: 5 bullet-point highlights per episode, generated once from the transcript, cached globally per episode.
Download for offline: Users listen on planes and subway. Background download.
Subscribe + new episode push: One push per subscribed show when a new episode drops.

Do not build a discovery algorithm in v1. A search bar and a "trending" list from episode counts beats a homegrown recommender.

The stack I use

React Native + Expo.
react-native-track-player for audio playback with full lock-screen controls.
Backend RSS ingest + audio URL extraction.
ASR: Whisper API (OpenAI) or a cheaper provider. One call per episode, globally cached. Batch so that any repeat request hits cache, not the ASR.
NestJS — transcript endpoint, highlights endpoint. Both cached on episode_hash.
Claude Code + 11 AI agents — scaffold the player, transcript, and highlights screens.

Real build time

With the boilerplate, 4 weekends.

RSS ingest + episodes table: ~6 hours.
Player + background audio + lock screen: ~10 hours.
Transcript panel + tap-to-seek: ~6 hours.
ASR pipeline + highlights generation + cache: ~8 hours.
Download-for-offline: ~4 hours.
Store submission: ~4 hours.

About 38 hours.

Where people get stuck

ASR cost runaway: Transcribing every episode on demand scales with DAU. Batch ASR calls so the same episode is only transcribed once globally. At Whisper prices that is roughly $0.36 per hour of audio, paid once per episode, not per listener. Cache hit ratio on popular shows trends toward 99%.
Lock-screen controls showing nothing: If you use expo-av alone you get barebones lock-screen controls. react-native-track-player gives you proper artwork, title, and scrubbing.
Transcript drift after edits: Podcasters re-upload edited episodes with the same URL. Store a hash of the audio file and invalidate the transcript when the hash changes.

Skip the setup

RSS ingest, player + lock-screen scaffold, ASR pipeline with global cache, transcript and highlights endpoints — pre-wired. The 11 AI agents scaffold the listening and transcript screens.

See pricing

Build a Podcast App with AI — Transcripts + AI Highlights