Why you shouldn't give AI a Playwright MCP

Give an AI agent a browser and your tokens disappear instantly.

One button click via Playwright MCP: 12,891 characters consumed. One screenshot: 15,000 tokens gone. There are reports of a 5-hour token allocation wiped out after just a few automation steps.

Why does this happen? And how did Vercel solve it?

Here's the backstory on Vercel's open-source tool agent-browser — why it exists, what philosophy drives it, and how it works.

What is agent-browser

agent-browser is an open-source CLI from Vercel. It's built for AI agents to control a browser directly.

Traditional tools like Playwright and Puppeteer are designed for humans writing scripts. CSS selectors, XPath, DOM structure — the human understands the page and writes the code.

agent-browser has a different premise entirely. The user isn't a human — it's an AI agent. Claude Code, Cursor, GitHub Copilot, OpenAI Codex — AI coding agents that control a browser with a single shell command.

One line summary: a browser remote control built for AI agents.

Legacy

Playwright
Puppeteer

User

Human (developer)

Write scripts manually
with CSS selectors, XPath

Agent-First

agent-browser

User

AI agent

Control the browser directly
with a single shell command

Why it was built — the Playwright MCP problem

When an AI agent builds a frontend, someone needs to verify the result. Open a browser, click around, check it works. If a human does this, it's a bottleneck.

That's where the concept of a self-verifying loop came from. AI writes the code, opens a browser to test it, fixes problems on its own. A closed loop.

To run that loop, the AI agent needs a browser automation tool. The first attempt was Playwright MCP.

The problem was severe.

Playwright MCP returns the entire accessibility tree of the page for every action — click, type, scroll. A typical webpage has 3,000+ DOM nodes. Click one button and all 3,000 nodes with every attribute come back in the response.

The numbers are clear.

Action

Playwright MCP

agent-browser

Homepage snapshot

8,247 chars

280 chars-96.6%

Button click response

12,891 chars

6 chars-99.9%

Average response size

3,112 chars

328 chars-89.5%

GitHub issue #889 also reported a 6x token usage increase between Playwright MCP versions.

An AI agent's context window is finite. If browser automation eats up the context, there's no room left for actual reasoning. The self-verifying loop burns through tokens in one or two cycles.

Root cause: taking a tool built for humans and forcing it on AI.

"Less Is More" — the philosophy behind agent-browser

Before solving this, Vercel ran an interesting internal experiment.

While building D0, a text-to-SQL agent, they tested the relationship between number of tools and performance. The result was counterintuitive.

More Tools

17 tools

Specialized toolset

80%

Success rate

102K

Avg tokens

Less Is More

2 tools

General-purpose toolset

100%

Success rate

61K

Avg tokens

Cut tools from 17 to 2. Success rate went from 80% to 100%. Token usage dropped 37%.

Vercel's conclusion was clear.

"We were constraining reasoning because we didn't trust the model to reason."

The assumption that more tools means better AI performance — wrong. More tools means AI spends tokens deciding "which tool should I use?" instead of actually solving the problem.

This philosophy runs through agent-browser's entire design.

Instead of loading 26 tool definitions via MCP server: one CLI
Instead of returning the full DOM: compress to essentials
Instead of retransmitting the entire tree after a click: Done — 4 characters

The core solution — Accessibility Tree and Snapshot + Refs

The token efficiency isn't magic. The principle makes complete sense.

Accessibility Tree: the AI's view of the page

When a browser renders a page, it builds two trees internally.

The DOM Tree is what we know — the full HTML structure.

<div class="nav-wrapper mx-4 flex items-center">
  <button class="btn-primary-v2 px-6 py-3 rounded-lg
                 text-white font-semibold hover:bg-blue-600"
          id="sign-in-btn-2024"
          data-testid="auth-cta"
          aria-label="Sign In">
    Sign In
  </button>
</div>

Class names, IDs, data attributes, style info — everything needed for visual rendering.

The Accessibility Tree is the tree used by screen readers for visually impaired users. It extracts only meaning from the DOM.

button "Sign In"

15 characters. Class names, ID, styles all gone. Just "this is a button and it says Sign In."

DOM Tree

→

200+ chars 15 chars

From an AI agent's perspective this difference is decisive. To click "the login button," you don't need to read 200 characters of DOM. Role (button) and name (Sign In) is enough.

There's another important advantage. CSS selectors break when the UI changes. Accessibility tree is meaning-based — it doesn't break when UI changes. If btn-primary-v2 gets renamed to button-main, the accessibility tree still shows button "Sign In".

Originally built for the visually impaired. Turns out it's also the optimal web representation for AI agents. Both can't see the screen — same situation, same solution.

Snapshot + Refs: the compression core

agent-browser doesn't use the accessibility tree as-is. It compresses one step further. That's the Snapshot + Refs system.

Step 1: Extract Accessibility Tree

Call Playwright's page.accessibility.snapshot() API to pull the browser's internal accessibility tree.

Step 2: Filter interactive elements only

From the full tree, keep only elements that can be clicked or typed into. Elements with onclick handlers, cursor: pointer styles, tabindex attributes.

Step 3: Assign refs

Give each filtered element a short reference ID: @e1, @e2, @e3...

- button "Sign In" [ref=e1]
- textbox "Email" [ref=e2]
- textbox "Password" [ref=e3]
- link "Forgot Password" [ref=e4]

Step 4: Cache the ref map

Store this mapping in the daemon's memory (BrowserManager.refMap).

@e1 → { role: "button", name: "Sign In" }
@e2 → { role: "textbox", name: "Email" }

When click @e1 arrives, the daemon looks up @e1 in refMap, converts it to Playwright's getByRole("button", { name: "Sign In" }), and clicks the element.

A11y Tree
Extract

page.accessibility
.snapshot()

Interactive
Filter

onclick, pointer,
tabindex detection

Ref
Assignment

@e1, @e2, @e3...
assigned in order

RefMap
Cache

Stored in BrowserManager
memory

click @e1

→

Done

Result: instead of the full page DOM, only a 4-line snapshot goes into context. Click response is one word: Done. That's where the 93% token reduction versus Playwright MCP comes from.

How it works — 3-layer architecture

The full architecture splits into three layers.

Rust CLI

~1ms

Command parsing

Flag extraction

IPC connection

JSON → ← Socket

Node.js Daemon

Persistent process

Command validation (Zod)

Ref resolution

Session/state management

CDP → ← Playwright

Chromium

Browser engine

Page rendering

JS execution

A11y Tree generation

Why separate CLI and daemon

AI agents run hundreds of browser commands per task. Launching a fresh browser each time costs 2–5 seconds. 100 commands = 200–500 seconds just on browser startup.

So the daemon keeps the browser alive. The CLI only handles command parsing. First command starts the daemon automatically. Subsequent commands connect to the already-running daemon.

The journey of one command

Run agent-browser click @e1 and here's what happens internally.

Step 1: Rust CLI parsing (~1ms)

A native binary written in Rust parses the command. Extracts click as the action and @e1 as the selector, connects to the daemon via Unix socket (macOS/Linux) or TCP (Windows).

Step 2: JSON message sent

Converts the parsed command to JSON and sends it through the socket.

{
  "action": "click",
  "selector": "@e1",
  "options": {}
}

Step 3: Daemon processing

The Node.js daemon receives the JSON, validates it with a Zod schema, looks up @e1 in refMap to get the Playwright locator, and runs the actual click.

Step 4: Response returned

Done

4 characters. Playwright MCP would retransmit the entire accessibility tree — 12,891 characters. agent-browser doesn't do that. If the agent wants to see the current state, it runs snapshot explicitly. Fetch only when you need it.

$ agent-browser click @e1

Rust CLI parsing

Extract click + @e1, connect socket

~1ms

JSON sent

{ action: "click", selector: "@e1" }

<1ms

Daemon processing

Zod validate → refMap lookup → Playwright click

~50ms

Response returned

Result only, no full tree retransmission

<1ms

Output

Done

Session management: keep login state

Browser automation frequently needs to handle logged-in state. agent-browser provides three levels of state persistence.

Level	Method	Storage scope
Ephemeral	Default, no storage	Gone when command ends
Session	`--session-name` option	Cookies, localStorage saved
Profile	`--profile` option	IndexedDB, service workers — everything saved

Log into Google once. Next command, still logged in. That's the critical difference in real usage.

Everything a browser can do, now automatable

The real value isn't the technical spec. It's what you can actually do. Everything a human did in a browser, an AI agent can now do instead.

QA automation

agent-browser has a built-in dogfood skill. An AI agent navigates your web app like a real user, finds bugs, takes screenshots, and produces a markdown report with reproduction steps.

Exploratory testing that takes a human half a day — the agent does it autonomously.

Web scraping (including login-required sites)

Session and profile features mean you can scrape sites that require login. Authenticate once, state persists. Pull data from admin dashboards, access content that's only visible after login.

Workflow automation

Writing Gmail emails, checking Slack messages, pulling reports from internal tools — repetitive browser tasks you did manually can be delegated to an agent.

agent-browser's slack skill lets you browse Slack channels and read messages without an API token — purely browser-based. The electron skill extends automation to desktop apps like VS Code and Figma.

The core principle is simple. If a human can do it in a browser, an AI agent can do it.

🔍

QA Automation

Exploratory testing
Bug detection + screenshots
Reproduction step report

dogfood skill

🌐

Web Scraping

Scrape login-required sites
Dashboard data collection
Session/profile state persistence

session + profile

⚡

Workflow Automation

Write Gmail emails
Read Slack messages
Control desktop apps

slack + electron skill

Tools for AI need to be designed for AI

Looking at agent-browser, one thing becomes clear.

AI's interface isn't a screen — it's text. AI doesn't need button colors or layout. Just role and name. Meaning. Yet we've been wrapping human UIs with automation layers and handing them to AI.

What agent-browser shows: tools for AI need to be designed for AI from the start. Accessibility tree instead of DOM. Compact text instead of full JSON. Fetch on demand instead of returning everything. That design produced 93% token reduction.

This won't stop at browser automation. As AI agents interact with more services, services that skip the human UI and connect via API directly will become more common. MCP (Model Context Protocol) is already emerging as a standard, and various services are starting to offer AI agent-native interfaces.

Humans get the screen. AI gets text. Same service, different interface. That era is coming.