Code review is visual now.
Percy, Applitools, and Chromatic each shipped review agents that read the rendered page instead of the diff. As agents write more of the UI, “code looks fine” stops meaning “page looks right” — and the new bottleneck is looking at it.
Filed · 13.V.MMXXVI Accession 2026-006
The review agents grew eyes.
Filed under visual regressionThree of the long-running visual-testing platforms shipped something in the past two weeks that, taken together, names a new category. BrowserStack's Percy Visual Review Agent reads the diff between two rendered builds, writes a natural-language summary of every change, and auto-groups changes that recur across pages — the same modal that broke in eighteen places gets one card, not eighteen. Applitools Autonomous accepts plain-English specs ("the header is sticky, the cart icon shows a count") and produces functional, visual, and accessibility tests against the rendered page, without anyone writing selectors. Chromatic, the Storybook-native player, now lets a coding agent generate the stories and the visual-test parameters from the component source — the agent that wrote the component also writes the harness that proves it.
From pixel diff to semantic diff.
The phrase the platforms keep reaching for is visual AI, and what they mean is a model that has been trained on enough rendered UI to know that a button is a button — and that "the button moved three pixels" is not the same event as "the button vanished." A 2025 round-up from Autonomy AI argues the new generation handles four comparison modes — strict, layout, content, dynamic — and routes a diff to the right one based on what the model thinks it's looking at. Pixel-perfect diffs, the technique the field was built on, are now the fallback, not the headline.
Why now, and not a year ago.
Because the producer side caught up. Claude Opus 4.7 shipped in late April with a jump on design-related benchmarks; Figma's Dev Mode MCP server made tokens, components, and layout legible to whichever agent the developer happens to be using; Vercel's agent-browser made it cheap for an agent to spin up a real browser, take a screenshot of its own work, and decide whether it's done. The producer can now generate UI faster than a human reviewer can read the patch. So the platform agents that look at rendered pages are no longer a nice-to-have on top of a manual workflow — they're the only thing standing between the agent-written code and the user.
Looking with the right lens.
Method · rendered diffA textual code review reads four things at once — intent, behaviour, syntax, side effect — and is bad at none of them. A visual review reads one thing very well: what did this change to the page. The difference matters because the failure modes of agent-written UI are mostly visual. Skipping the visual layer because the unit tests pass is the most expensive thing a team can do to itself this year.
Three diffs, not one.
The teams getting this right are running three different diffs on every change. A strict diff for the cases where pixel-equivalence is the spec — a logo, a charted result, a generated invoice. A layout diff for the cases where the rules are structural — a header should still be a header, a cart icon should still be in the corner. And a content diff for the cases where the strings are doing the work — a price, a date, a label that has to read in seven languages. Most platforms now expose all three; the discipline is choosing the right one per scenario, because asking strict of a layout change makes the agent shrug at every legitimate update, and asking layout of a price typo makes the agent miss it.
Tell the agent what you changed.
The single biggest accuracy gain reported by every platform this month is letting the agent see the intent alongside the diff — a one-line description of what the patch was supposed to do. "Replaced the cart icon with the new outlined version" turns a panicky "the cart icon vanished from every page" alert into a quiet "the cart icon was replaced, as planned." Build the intent into the commit message and the PR title and feed them to the reviewer agent as context, the same way you'd hand them to a human teammate.
A studio practice worth stealing.
Field tag · two-agent pipelineThe pattern that's working — a few teams reported variants of it in this week's UX Collective field-notes — is straightforward enough to copy by Friday. Two agents, one PR.
Producer.
The first agent — Claude Code, Cursor, Codex, take your pick — reads the ticket, edits the components, writes the unit tests, and opens the pull request. Nothing exotic. The only requirement is that the agent screenshots the relevant pages of the running app and attaches the images to the PR body, with a one-line intent for each. agent-browser or the equivalent makes this cheap. The PR is now reviewable both as code and as renders.
Inspector.
A second agent, scoped only to reading, opens the PR, runs the visual-review platform (Percy, Applitools, Chromatic — whichever the team has standardised on), and writes a comment back to the PR. Crucially the inspector is allowed to reject but not to edit. Its job is to file an objection. The team treats its comment as a binding code review the same way they treat a human one.
Why the split holds.
Every team that has tried letting a single agent both produce and verify has reported the same drift: the agent talks itself into approving its own work because the same context that wrote the change also wrote the review. Splitting the roles introduces an adversarial seam — and adversarial seams are what make any review system work, not just AI ones. The producer optimises for shipping; the inspector optimises for catching. It's the same reason editors aren't writers and code-reviewers aren't authors.
A reviewer prompt that works.
Specimen · inspector-system-prompt-v3The prompt below is a generalised version of one a small commerce team is running as the inspector half of the two-agent pipeline. They feed it the PR diff, the producer's intent line, and screenshots of the affected pages before and after. It's worth lifting and shaping to your own design system; the spine is the part to keep.
You are the visual reviewer on a two-agent pipeline. You did not write
the code you are reviewing. Your job is to read the rendered pages —
not the diff — and decide whether the change does what the producer
said it would do, and only that.
For each page, name three things:
1. INTENDED CHANGE — what the producer wrote in their intent line.
Confirm the change is present in the after-screenshot. If it is
not, fail closed and ask for a re-render.
2. UNINTENDED CHANGES — anything else that moved, shifted, recoloured,
or vanished. Be specific. "The cart icon shifted 14px left and lost
its badge count" is useful; "the header looks different" is not.
3. RISK GRADE — pick one of {APPROVE, APPROVE-WITH-NOTES, REJECT}.
REJECT if any unintended change is visible to a user. NOTES if the
change is visible only at edge breakpoints or hover states.
Do not edit the code. Do not propose patches. File an objection or sign
the slip.
Treat the producer as a colleague. Treat the screenshot as the source
of truth. The diff is not the artifact.
The line that does the heaviest lifting is the last one — the diff is not the artifact. Both agents start treating the rendered page as the thing that matters, and the producer agent quietly improves at writing changes that look like what was asked.
Pinned, labelled, looked at.
Editor's markThe pattern of the past two weeks is this: the agent's surface has finally caught up with its output. We spent a year arguing about how agents should write web pages and we have, in the meantime, accidentally invented the only kind of reviewer that can keep up with them. Treat the rendered page as the specimen, the inspector agent as the herbarium-keeper, and the PR as the slip-tag underneath. Every change gets pinned, labelled, and looked at before it moves to the next plate.
Filed underneath this plate.
Verified · 13 May 26- Percy Visual Review Agent — the AI assistant that triages flagged diffs Percy / BrowserStack
- Applitools Autonomous — natural-language visual, functional, and accessibility testing Applitools
- Chromatic visual testing for Storybook — component-level review for agent-written UI Chromatic
- Visual regression testing tools compared — six options for teams shipping AI-generated UI Autonomy AI
- Code with Claude 2026 — Claude Code agent view, Claude Opus 4.7 release Anthropic
- Anthropic launches Claude Design — prompts into prototypes VentureBeat
- Introducing Figma's Dev Mode MCP server — bringing the design system into the agent workflow Figma
- vercel-labs / agent-browser — headless browser automation CLI built for AI agents Vercel Labs
- Designing with Claude Code and Codex CLI — agent-driven UI workflows UX Collective
- Visual regression testing in mobile QA — the 2026 guide Panto AI