Give an AI agent a browser and your tokens disappear instantly.
One button click via Playwright MCP: 12,891 characters consumed. One screenshot: 15,000 tokens gone. There are reports of a 5-hour token allocation wiped out after just a few automation steps.
Why does this happen? And how did Vercel solve it?
Here's the backstory on Vercel's open-source tool agent-browser — why it exists, what philosophy drives it, and how it works.
agent-browser is an open-source CLI from Vercel. It's built for AI agents to control a browser directly.
Traditional tools like Playwright and Puppeteer are designed for humans writing scripts. CSS selectors, XPath, DOM structure — the human understands the page and writes the code.
agent-browser has a different premise entirely. The user isn't a human — it's an AI agent. Claude Code, Cursor, GitHub Copilot, OpenAI Codex — AI coding agents that control a browser with a single shell command.
One line summary: a browser remote control built for AI agents.
Legacy
Playwright
Puppeteer
User
Human (developer)
Write scripts manually
with CSS selectors, XPath
Agent-First
agent-browser
User
AI agent
Control the browser directly
with a single shell command
When an AI agent builds a frontend, someone needs to verify the result. Open a browser, click around, check it works. If a human does this, it's a bottleneck.
That's where the concept of a self-verifying loop came from. AI writes the code, opens a browser to test it, fixes problems on its own. A closed loop.
To run that loop, the AI agent needs a browser automation tool. The first attempt was Playwright MCP.
The problem was severe.
Playwright MCP returns the entire accessibility tree of the page for every action — click, type, scroll. A typical webpage has 3,000+ DOM nodes. Click one button and all 3,000 nodes with every attribute come back in the response.
The numbers are clear.
GitHub issue #889 also reported a 6x token usage increase between Playwright MCP versions.
An AI agent's context window is finite. If browser automation eats up the context, there's no room left for actual reasoning. The self-verifying loop burns through tokens in one or two cycles.
Root cause: taking a tool built for humans and forcing it on AI.
Before solving this, Vercel ran an interesting internal experiment.
While building D0, a text-to-SQL agent, they tested the relationship between number of tools and performance. The result was counterintuitive.
More Tools
17 tools
Specialized toolset
80%
Success rate
102K
Avg tokens
Less Is More
2 tools
General-purpose toolset
100%
Success rate
61K
Avg tokens
Cut tools from 17 to 2. Success rate went from 80% to 100%. Token usage dropped 37%.
Vercel's conclusion was clear.
"We were constraining reasoning because we didn't trust the model to reason."
The assumption that more tools means better AI performance — wrong. More tools means AI spends tokens deciding "which tool should I use?" instead of actually solving the problem.
This philosophy runs through agent-browser's entire design.
Done — 4 charactersThe token efficiency isn't magic. The principle makes complete sense.
When a browser renders a page, it builds two trees internally.
The DOM Tree is what we know — the full HTML structure.
<div class="nav-wrapper mx-4 flex items-center">
<button class="btn-primary-v2 px-6 py-3 rounded-lg
text-white font-semibold hover:bg-blue-600"
id="sign-in-btn-2024"
data-testid="auth-cta"
aria-label="Sign In">
Sign In
</button>
</div>
Class names, IDs, data attributes, style info — everything needed for visual rendering.
The Accessibility Tree is the tree used by screen readers for visually impaired users. It extracts only meaning from the DOM.
button "Sign In"
15 characters. Class names, ID, styles all gone. Just "this is a button and it says Sign In."
From an AI agent's perspective this difference is decisive. To click "the login button," you don't need to read 200 characters of DOM. Role (button) and name (Sign In) is enough.
There's another important advantage. CSS selectors break when the UI changes. Accessibility tree is meaning-based — it doesn't break when UI changes. If btn-primary-v2 gets renamed to button-main, the accessibility tree still shows button "Sign In".
Originally built for the visually impaired. Turns out it's also the optimal web representation for AI agents. Both can't see the screen — same situation, same solution.
agent-browser doesn't use the accessibility tree as-is. It compresses one step further. That's the Snapshot + Refs system.
Step 1: Extract Accessibility Tree
Call Playwright's page.accessibility.snapshot() API to pull the browser's internal accessibility tree.
Step 2: Filter interactive elements only
From the full tree, keep only elements that can be clicked or typed into. Elements with onclick handlers, cursor: pointer styles, tabindex attributes.
Step 3: Assign refs
Give each filtered element a short reference ID: @e1, @e2, @e3...
- button "Sign In" [ref=e1]
- textbox "Email" [ref=e2]
- textbox "Password" [ref=e3]
- link "Forgot Password" [ref=e4]
Step 4: Cache the ref map
Store this mapping in the daemon's memory (BrowserManager.refMap).
@e1 → { role: "button", name: "Sign In" }
@e2 → { role: "textbox", name: "Email" }
When click @e1 arrives, the daemon looks up @e1 in refMap, converts it to Playwright's getByRole("button", { name: "Sign In" }), and clicks the element.
A11y Tree
Extract
page.accessibility
.snapshot()
Interactive
Filter
onclick, pointer,
tabindex detection
Ref
Assignment
@e1, @e2, @e3...
assigned in order
RefMap
Cache
Stored in BrowserManager
memory
click @e1
→Done
Result: instead of the full page DOM, only a 4-line snapshot goes into context. Click response is one word: Done. That's where the 93% token reduction versus Playwright MCP comes from.
The full architecture splits into three layers.
Rust CLI
~1ms
Command parsing
Flag extraction
IPC connection
Node.js Daemon
Persistent process
Command validation (Zod)
Ref resolution
Session/state management
Chromium
Browser engine
Page rendering
JS execution
A11y Tree generation
AI agents run hundreds of browser commands per task. Launching a fresh browser each time costs 2–5 seconds. 100 commands = 200–500 seconds just on browser startup.
So the daemon keeps the browser alive. The CLI only handles command parsing. First command starts the daemon automatically. Subsequent commands connect to the already-running daemon.
Run agent-browser click @e1 and here's what happens internally.
Step 1: Rust CLI parsing (~1ms)
A native binary written in Rust parses the command. Extracts click as the action and @e1 as the selector, connects to the daemon via Unix socket (macOS/Linux) or TCP (Windows).
Step 2: JSON message sent
Converts the parsed command to JSON and sends it through the socket.
{
"action": "click",
"selector": "@e1",
"options": {}
}
Step 3: Daemon processing
The Node.js daemon receives the JSON, validates it with a Zod schema, looks up @e1 in refMap to get the Playwright locator, and runs the actual click.
Step 4: Response returned
Done
4 characters. Playwright MCP would retransmit the entire accessibility tree — 12,891 characters. agent-browser doesn't do that. If the agent wants to see the current state, it runs snapshot explicitly. Fetch only when you need it.
$ agent-browser click @e1
Rust CLI parsing
Extract click + @e1, connect socket
JSON sent
{ action: "click", selector: "@e1" }
Daemon processing
Zod validate → refMap lookup → Playwright click
Response returned
Result only, no full tree retransmission
Output
Done
Browser automation frequently needs to handle logged-in state. agent-browser provides three levels of state persistence.
| Level | Method | Storage scope |
|---|---|---|
| Ephemeral | Default, no storage | Gone when command ends |
| Session | --session-name option |
Cookies, localStorage saved |
| Profile | --profile option |
IndexedDB, service workers — everything saved |
Log into Google once. Next command, still logged in. That's the critical difference in real usage.
The real value isn't the technical spec. It's what you can actually do. Everything a human did in a browser, an AI agent can now do instead.
agent-browser has a built-in dogfood skill. An AI agent navigates your web app like a real user, finds bugs, takes screenshots, and produces a markdown report with reproduction steps.
Exploratory testing that takes a human half a day — the agent does it autonomously.
Session and profile features mean you can scrape sites that require login. Authenticate once, state persists. Pull data from admin dashboards, access content that's only visible after login.
Writing Gmail emails, checking Slack messages, pulling reports from internal tools — repetitive browser tasks you did manually can be delegated to an agent.
agent-browser's slack skill lets you browse Slack channels and read messages without an API token — purely browser-based. The electron skill extends automation to desktop apps like VS Code and Figma.
The core principle is simple. If a human can do it in a browser, an AI agent can do it.
QA Automation
Web Scraping
Workflow Automation
Looking at agent-browser, one thing becomes clear.
AI's interface isn't a screen — it's text. AI doesn't need button colors or layout. Just role and name. Meaning. Yet we've been wrapping human UIs with automation layers and handing them to AI.
What agent-browser shows: tools for AI need to be designed for AI from the start. Accessibility tree instead of DOM. Compact text instead of full JSON. Fetch on demand instead of returning everything. That design produced 93% token reduction.
This won't stop at browser automation. As AI agents interact with more services, services that skip the human UI and connect via API directly will become more common. MCP (Model Context Protocol) is already emerging as a standard, and various services are starting to offer AI agent-native interfaces.
Humans get the screen. AI gets text. Same service, different interface. That era is coming.