Field Report · Visual Review

Code review is visual now.

Percy, Applitools, and Chromatic each shipped review agents that read the rendered page instead of the diff. As agents write more of the UI, “code looks fine” stops meaning “page looks right” — and the new bottleneck is looking at it.

Filed · 13.V.MMXXVI Accession 2026-006

Today's Art Direction · Plate VI

Pressed Specimen Folio

Herbarium imaginarium — an editorial idiom.

The herbarium plate is a 19th-century print form for studying things by pinning them: a pressed specimen mounted on warm card, copperplate title at the top, an ink-stamped accession number, a hand-lettered slip-tag, and a binomial set in italic Latin underneath. The form was invented to make looking repeatable — every specimen filed at the same scale, labelled the same way, so a botanist could compare a pressed leaf in Edinburgh against one in Kew without confusion. Apt, today, for a publication arguing that the new code review is the one that finally looks at what was made.

Title Plate
Pressed Specimen
Binomial Italic
Slip-Tag Label
Copperplate Brown
Rose-Madder Stamp
Olive Hairline
Foxed Paper
Accession Number

§ 01 · Tooling

The review agents grew eyes.

Filed under visual regression

Three of the long-running visual-testing platforms shipped something in the past two weeks that, taken together, names a new category. BrowserStack's Percy Visual Review Agent reads the diff between two rendered builds, writes a natural-language summary of every change, and auto-groups changes that recur across pages — the same modal that broke in eighteen places gets one card, not eighteen. Applitools Autonomous accepts plain-English specs ("the header is sticky, the cart icon shows a count") and produces functional, visual, and accessibility tests against the rendered page, without anyone writing selectors. Chromatic, the Storybook-native player, now lets a coding agent generate the stories and the visual-test parameters from the component source — the agent that wrote the component also writes the harness that proves it.

From pixel diff to semantic diff.

The phrase the platforms keep reaching for is visual AI, and what they mean is a model that has been trained on enough rendered UI to know that a button is a button — and that "the button moved three pixels" is not the same event as "the button vanished." A 2025 round-up from Autonomy AI argues the new generation handles four comparison modes — strict, layout, content, dynamic — and routes a diff to the right one based on what the model thinks it's looking at. Pixel-perfect diffs, the technique the field was built on, are now the fallback, not the headline.

Why now, and not a year ago.

Because the producer side caught up. Claude Opus 4.7 shipped in late April with a jump on design-related benchmarks; Figma's Dev Mode MCP server made tokens, components, and layout legible to whichever agent the developer happens to be using; Vercel's agent-browser made it cheap for an agent to spin up a real browser, take a screenshot of its own work, and decide whether it's done. The producer can now generate UI faster than a human reviewer can read the patch. So the platform agents that look at rendered pages are no longer a nice-to-have on top of a manual workflow — they're the only thing standing between the agent-written code and the user.

Observed Design implication

When the producer is an agent, the reviewer must be too — and the reviewer has to read the artifact, not the patch. A diff in pixels is the only diff at the resolution your users will see.

§ 02 · Technique

Looking with the right lens.

Method · rendered diff

A textual code review reads four things at once — intent, behaviour, syntax, side effect — and is bad at none of them. A visual review reads one thing very well: what did this change to the page. The difference matters because the failure modes of agent-written UI are mostly visual. Skipping the visual layer because the unit tests pass is the most expensive thing a team can do to itself this year.

Three diffs, not one.

The teams getting this right are running three different diffs on every change. A strict diff for the cases where pixel-equivalence is the spec — a logo, a charted result, a generated invoice. A layout diff for the cases where the rules are structural — a header should still be a header, a cart icon should still be in the corner. And a content diff for the cases where the strings are doing the work — a price, a date, a label that has to read in seven languages. Most platforms now expose all three; the discipline is choosing the right one per scenario, because asking strict of a layout change makes the agent shrug at every legitimate update, and asking layout of a price typo makes the agent miss it.

The reviewer that catches the failure is the reviewer that knows what the change was supposed to do. — a working note from the field

Tell the agent what you changed.

The single biggest accuracy gain reported by every platform this month is letting the agent see the intent alongside the diff — a one-line description of what the patch was supposed to do. "Replaced the cart icon with the new outlined version" turns a panicky "the cart icon vanished from every page" alert into a quiet "the cart icon was replaced, as planned." Build the intent into the commit message and the PR title and feed them to the reviewer agent as context, the same way you'd hand them to a human teammate.

§ 03 · Workflow

A studio practice worth stealing.

Field tag · two-agent pipeline

The pattern that's working — a few teams reported variants of it in this week's UX Collective field-notes — is straightforward enough to copy by Friday. Two agents, one PR.

Producer.

The first agent — Claude Code, Cursor, Codex, take your pick — reads the ticket, edits the components, writes the unit tests, and opens the pull request. Nothing exotic. The only requirement is that the agent screenshots the relevant pages of the running app and attaches the images to the PR body, with a one-line intent for each. agent-browser or the equivalent makes this cheap. The PR is now reviewable both as code and as renders.

Inspector.

A second agent, scoped only to reading, opens the PR, runs the visual-review platform (Percy, Applitools, Chromatic — whichever the team has standardised on), and writes a comment back to the PR. Crucially the inspector is allowed to reject but not to edit. Its job is to file an objection. The team treats its comment as a binding code review the same way they treat a human one.

Why the split holds.

Every team that has tried letting a single agent both produce and verify has reported the same drift: the agent talks itself into approving its own work because the same context that wrote the change also wrote the review. Splitting the roles introduces an adversarial seam — and adversarial seams are what make any review system work, not just AI ones. The producer optimises for shipping; the inspector optimises for catching. It's the same reason editors aren't writers and code-reviewers aren't authors.

Filed Practical move

Configure the inspector with a different system prompt, a different model where you can, and a different name. Pretend it's a different colleague. The team will start treating its review like one.

§ 04 · Prompt Lab

A reviewer prompt that works.

Specimen · inspector-system-prompt-v3

The prompt below is a generalised version of one a small commerce team is running as the inspector half of the two-agent pipeline. They feed it the PR diff, the producer's intent line, and screenshots of the affected pages before and after. It's worth lifting and shaping to your own design system; the spine is the part to keep.

System prompt · visual reviewer Filed · v3

You are the visual reviewer on a two-agent pipeline. You did not write
the code you are reviewing. Your job is to read the rendered pages —
not the diff — and decide whether the change does what the producer
said it would do, and only that.

For each page, name three things:

1.  INTENDED CHANGE — what the producer wrote in their intent line.
    Confirm the change is present in the after-screenshot. If it is
    not, fail closed and ask for a re-render.

2.  UNINTENDED CHANGES — anything else that moved, shifted, recoloured,
    or vanished. Be specific. "The cart icon shifted 14px left and lost
    its badge count" is useful; "the header looks different" is not.

3.  RISK GRADE — pick one of {APPROVE, APPROVE-WITH-NOTES, REJECT}.
    REJECT if any unintended change is visible to a user. NOTES if the
    change is visible only at edge breakpoints or hover states.

Do not edit the code. Do not propose patches. File an objection or sign
the slip.

Treat the producer as a colleague. Treat the screenshot as the source
of truth. The diff is not the artifact.

The line that does the heaviest lifting is the last one — the diff is not the artifact. Both agents start treating the rendered page as the thing that matters, and the producer agent quietly improves at writing changes that look like what was asked.

§ 05 · Field Note

Pinned, labelled, looked at.

Editor's mark

The pattern of the past two weeks is this: the agent's surface has finally caught up with its output. We spent a year arguing about how agents should write web pages and we have, in the meantime, accidentally invented the only kind of reviewer that can keep up with them. Treat the rendered page as the specimen, the inspector agent as the herbarium-keeper, and the PR as the slip-tag underneath. Every change gets pinned, labelled, and looked at before it moves to the next plate.

§ 06 · Sources

Filed underneath this plate.

Verified · 13 May 26