From a Production Team

Testing with Agents

AI mock fixtures, visual verification, experiment-as-test, and CI integration. Testing patterns from a team that ships with AI agents every day.

Want to skip straight to a working setup? Fork the starter template.

Companion Guide

This guide covers testing. For the development workflow these tests protect — agents, skills, hooks, and memory systems — see Shipping with Agents.

1Why This Exists

There's no shortage of "AI testing" content. Most of it is about using AI to write tests — generating unit tests, suggesting edge cases, autocompleting assertions. That's useful, but it's not what this guide is about.

This guide is about the opposite problem: how do you test the work AI agents produce? And how do you use agents themselves as testers?

AI-generated code passes type checks but breaks visually. Your agent can write a component that compiles, passes lint, and even passes unit tests — but renders with overlapping elements, broken spacing, or invisible text. Type safety doesn't guarantee visual correctness.

Live AI calls make tests expensive and flaky. A test that calls a real AI API takes 3-5 seconds, costs money per call, and returns different output every time. Your test suite becomes slow, expensive, and non-deterministic — the three things tests shouldn't be.

Agents will claim "done" without verifying. AI coding assistants are optimistic by default. They'll report a task as complete without running the build, checking the UI, or verifying edge cases. The verify loop — running build, test, and lint before claiming done — is the single highest-ROI pattern we've found.

We've been shipping a production SaaS with AI agents for months. These patterns come from real failures, real fixes, and real workflow changes. If you're building with AI coding assistants and your tests aren't keeping up, this is for you.

2The Testing Pyramid, Revisited

The classic testing pyramid — many unit tests, fewer integration tests, minimal E2E — still holds. But AI agents add new layers above the pyramid that traditional testing guides don't cover.

┌─────────────────────┐
│   AI-Powered Review  │  ← Agent analyzes screenshots
├─────────────────────┤
│ Visual Verification  │  ← Screenshot comparison
├─────────────────────┤
│     E2E Tests        │  ← User journey simulation
├─────────────────────┤
│  Integration Tests   │  ← API and service tests
├─────────────────────┤
│    Unit Tests        │  ← Function-level tests
└─────────────────────┘

Agents change two things about testing. First, they change what you test — AI output quality isn't just "does this function return the right value?" It's "is this generated content actually good?" Traditional assertions struggle with that. Second, they change how you test — agents themselves can be testers, analyzing screenshots, reviewing code, and catching issues that scripted tests miss.

The rest of this guide walks through each layer above the classic pyramid, plus the infrastructure that makes it all work.

3AI Mock Fixtures

The first thing you learn when writing E2E tests for an AI-powered app: real AI calls destroy your test suite.

A single AI generation call takes 3-5 seconds, costs money, and returns different output every time. Multiply that by 20 tests and your suite takes two minutes, costs real money per run, and fails randomly because the AI gave a slightly different response. The fix is straightforward: mock the AI at the network level.

The pattern: intercept API routes in Playwright, serve pre-recorded JSON fixtures instead of hitting the real API. Tests run in milliseconds, cost nothing, and produce deterministic results.

e2e/mocks/ai-generation-mock.ts
// e2e/mocks/ai-generation-mock.ts

export async function mockAIRoute(
  page: Page,
  routePattern: string,
  fixtureFile: string,
): Promise<void> {
  const fixture = loadFixture(fixtureFile);

  await page.route(routePattern, async (route) => {
    if (route.request().method() !== 'POST') {
      await route.continue();
      return;
    }

    await route.fulfill({
      status: 200,
      contentType: 'application/json',
      body: fixture,
    });
  });
}
e2e/example.spec.ts
test('generates content with mocked AI', async ({ page }) => {
  await mockAIRoute(page, '**/api/generate', 'sample-generation.json');
  await page.goto('/generate');
  // ... test runs instantly with deterministic data
});
When to mock vs. use real AI

In CI, always mock. Real AI calls in CI are slow, expensive, and introduce flakiness. Locally, it's your choice — real calls during development can help you spot prompt regressions, but mocks keep your feedback loop fast.

In Practice

Real projects end up with dozens of fixture files. Keep them organized by feature (npc-generation.json, chat-response.json) and update them when your API response format changes. Stale fixtures are the #1 cause of mock-related test failures.

4Auth & Profile Mocking

If your app has subscription tiers, you need to test what each tier sees. But creating real paid accounts for testing is expensive and fragile. The solution: mock the profile layer.

Two patterns work together: an auth bypass that skips real authentication in test environments, and a profile mock that impersonates any subscription tier.

e2e/auth.setup.ts
// e2e/auth.setup.ts

setup('authenticate', async ({ page }) => {
  if (process.env.E2E_BYPASS_AUTH === '1') {
    // Skip real auth — just save empty state
    await page.goto('/');
    await page.context().storageState({ path: authFile });
    return;
  }

  // Real auth path for when you need it...
});
e2e/mocks/profile-mock.ts
// e2e/mocks/profile-mock.ts

const TIER_DEFAULTS = {
  free:    { generationLimit: 10,  features: ['basic-generation'] },
  premium: { generationLimit: 500, features: ['basic-generation', 'advanced-generation', 'export'] },
  trial:   { generationLimit: 40,  features: ['basic-generation', 'advanced-generation', 'export'] },
};

export async function mockProfile(page, tier, overrides?) {
  const profile = buildProfile(tier, overrides);

  await page.route('**/api/user/profile', async (route) => {
    await route.fulfill({
      status: 200,
      contentType: 'application/json',
      body: JSON.stringify(profile),
    });
  });
}
e2e/tier-gates.spec.ts
test('premium user sees export button', async ({ page }) => {
  await mockProfile(page, 'premium');
  await page.goto('/dashboard');
  await expect(page.getByText('Export')).toBeVisible();
});

test('free user sees upgrade prompt', async ({ page }) => {
  await mockProfile(page, 'free');
  await page.goto('/dashboard');
  await expect(page.getByText('Upgrade')).toBeVisible();
});
Security

Auth bypass must NEVER be active in production. Gate it behind an environment variable that's only set in test/dev environments. A common pattern: your middleware checks process.env.E2E_BYPASS_AUTH === '1' and only skips auth if it's set. In production, this env var doesn't exist, so auth always runs.

In Practice

Our first version used separate fixture files for each tier (free-tier.json, premium-tier.json). That meant updating N files when the profile schema changed. The inline defaults pattern shown above is simpler — one object to maintain, tier configs right where you can see them.

5Visual Verification

Tests pass. Types check. Lint is clean. You open the browser and the layout is completely broken.

This happens more than you'd think with AI-generated code. The agent writes valid React/HTML/CSS that compiles and passes all programmatic checks, but renders incorrectly — overlapping elements, wrong spacing, invisible text on a same-colored background. Visual verification catches what assertions can't.

Baseline Comparison

Pixel-diff screenshots against saved baselines:

e2e/visual.spec.ts
test('dashboard matches baseline', async ({ page }) => {
  await page.goto('/dashboard');
  await expect(page).toHaveScreenshot('dashboard.png', {
    maxDiffPixelRatio: 0.01,  // Allow 1% pixel difference
  });
});

First run creates the baseline. Subsequent runs compare against it. Differences beyond the threshold fail the test.

Updating Baselines

terminal
# When the UI intentionally changes, update baselines:
npx playwright test --grep @visual --update-snapshots
Responsive breakpoints

Test at multiple viewport sizes. A layout that works at 1920px can break at 768px. Playwright makes this easy — create separate projects in your config for desktop and mobile viewports.

In Practice

Visual baselines drift. Every intentional UI change requires a baseline update. If you're changing the UI frequently (early development), baselines create friction. Start with manual screenshot review (take screenshots, look at them) and add baseline comparison once the UI stabilizes.

6The Verify Loop

If you take one pattern from this guide, make it this one.

The verify loop is the discipline of running build + test + lint before claiming work is done. It sounds obvious. It's not — because AI agents don't do it unless you force them to.

Agents are optimistic. They'll write code, see it compile, and report "done." They won't run the full test suite. They won't check the UI in a browser. They won't notice the lint warning about an unused import. The verify loop is the forcing function.

scripts/verify.sh
#!/usr/bin/env bash
# scripts/verify.sh

run_check() {
  local name="$1"; shift
  printf "▶ %s\n" "$name"
  if "$@" 2>&1; then
    printf "✓ %s passed\n\n" "$name"
  else
    printf "✗ %s failed\n\n" "$name"
    FAIL=$((FAIL + 1))
  fi
}

# Uncomment checks for your stack:
# run_check "TypeScript" npx tsc --noEmit
# run_check "Lint" npm run lint
# run_check "Build" npm run build
run_check "Smoke Tests" npx playwright test --grep @smoke
.claude/skills/verify.md
# .claude/skills/verify.md
---
name: verify
description: Run build + test + lint before claiming done
---

## Steps
1. Run `./scripts/verify.sh`
2. If any check fails, fix the issue and re-run
3. Do NOT claim work is complete until all checks pass
terminal
# Install: cp .claude/hooks/pre-commit-verify.sh .git/hooks/pre-commit
# Now every git commit runs the verify loop first
Why this is the highest-ROI pattern

It's a cheap, fast gate that catches entire categories of bugs — broken builds, failing tests, lint violations — before they're committed. With agents, it's even more valuable because agents are prolific committers. Without the verify loop, a single agent session can introduce 5 broken commits before you notice.

Verify is half the gate. Code review is the other half.

The verify loop catches mechanical failures (broken builds, failing tests). An AI code reviewer catches logical ones (bugs, security issues, convention violations). In the full pipeline from Shipping with Agents, both are built-in steps: think → plan → build → verifyreview → ship. Solo devs get review discipline without a team. The full starter includes both.

7AI-Powered UX Review

Manual UI review is slow, inconsistent, and the first thing skipped when you're in a hurry. But visual bugs are real bugs. The pattern: take a screenshot, feed it to a vision-capable model, get structured feedback.

When Playwright is already in your harness, you can use the --fresh flag to run visual tests (which capture screenshots), then feed those screenshots to a vision model for analysis. No manual screenshot step.

The feedback loop: E2E tests capture snapshots → UX review skill analyzes them → structured issues come back (rated Good / Needs Attention / Needs Fix).

terminal
/ux-review                # Review existing snapshots
/ux-review --fresh        # Run visual tests first, then review

What it catches: spacing inconsistencies, alignment issues, contrast problems, touch target sizes, heading hierarchy.

What it doesn't catch: business logic errors, complex interaction flows, subjective design preferences.

Accessibility Integration

Optional axe-core integration adds automated WCAG validation alongside the visual review. Install @axe-core/playwright for accessibility scanning.

Model-Agnostic

This works with any vision-capable model — GPT-4o Vision, Claude, Gemini. The skill is tool-agnostic; adapt the analysis step to your preferred model.

8Experiment-as-Test

Some quality properties can't be expressed as pass/fail assertions. "Is this AI-generated content good?" isn't a boolean. But you can measure it — and you can detect when it gets worse.

The pattern: write scripts that measure quality metrics (structure validity, output quality scores, success rates, response times), save the measurements as a baseline, then compare future runs against the baseline. If metrics degrade beyond a threshold, fail the build.

scripts/experiment-baseline.ts
// Compare current metrics to saved baseline
for (const key of ['structureValidity', 'averageQuality', 'successRate']) {
  const degradation = (baseline[key] - current[key]) / baseline[key];
  if (degradation > threshold) {
    console.log(`✗ ${key} degraded by ${(degradation * 100).toFixed(1)}%`);
    passed = false;
  }
}
terminal
# Save a baseline from current performance
npx tsx scripts/experiment-baseline.ts --save

# Compare against baseline (fails if metrics degrade >10%)
npx tsx scripts/experiment-baseline.ts

# Custom threshold
npx tsx scripts/experiment-baseline.ts --threshold 0.15

When to use experiments vs tests: Traditional tests answer "does this work?" Experiments answer "is this still good?" Use tests for correctness, experiments for quality.

In Practice

The hardest part is setting the right threshold. Too tight (1%) and you get false alarms from normal variance. Too loose (25%) and you miss real regressions. We found 10% works well as a starting point — enough headroom for normal variation, tight enough to catch meaningful degradation.

9CI Integration

Running everything on every PR is slow, expensive, and burns through your GitHub Actions minutes. The solution: tiered test execution with smart triggers — but start with everything manual.

All workflows in the starter repo default to workflow_dispatch (manual trigger only). This keeps CI costs at zero until you're ready.

Minimal (Solo Devs)

Keep everything manual. Run tests locally with /test-e2e or npm test. Trigger workflows from the GitHub UI when you want a CI check before merging. Cost: ~0 min/month.

Mid-Tier (Small Teams, Recommended)

Enable smoke tests on PRs. Keep regression and visual tests manual or nightly.

.github/workflows/test-smoke.yml
on:
  workflow_dispatch:
  pull_request:       # <-- uncomment this
    branches: [main]

Cost: ~50-200 min/month.

Maximal (Enterprise)

Enable all automatic triggers — smoke on every PR, visual on UI changes (path-filtered), nightly regression. Cost: ~500-2000 min/month.

GitHub Actions Minutes

Free tier gets 2,000 min/month. Each E2E run uses 2-10 minutes depending on your app's build time and test count. Monitor your usage at Settings → Billing → Actions. Switching all three workflows to automatic can burn 500+ min/month on active repos.

The /test-e2e skill is the local equivalent of your CI pipeline:

terminal
/test-e2e              # Run smoke tests locally
/test-e2e --regression # Full suite
/test-e2e --visual     # Screenshot comparison
Start Small

Start with Minimal. Move to Mid-Tier when a bug ships that local testing would have caught. You'll know when you need Maximal.

10The Full Harness

You've read the individual sections. Here's how they compose into a system.

Mock fixtures feed your E2E tests → E2E tests capture screenshots → Visual verification compares against baselines → UX review analyzes screenshots with AI → Experiment scripts measure quality → CI workflows run the right tests at the right time → The verify loop ties it all together before commits.

test-starter/
├── CLAUDE.md                          ← Setup guide for AI agents
├── .claude/
│   ├── agents/test-reviewer.md        ← §9 Test coverage review
│   ├── hooks/pre-commit-verify.sh     ← §6 Verify loop (git hook)
│   └── skills/
│       ├── verify.md                  ← §6 Pre-completion checks
│       ├── ux-review.md               ← §7 Screenshot analysis
│       └── test-e2e.md                ← §9 E2E runner
├── .github/workflows/
│   ├── test-smoke.yml                 ← §9 PR smoke tests
│   ├── test-regression.yml            ← §9 Nightly regression
│   └── test-visual.yml                ← §9 Visual comparison
├── e2e/
│   ├── auth.setup.ts                  ← §4 Auth bypass
│   ├── verify-quick.spec.ts           ← §6 Quick sanity check
│   ├── fixtures/ai-responses/         ← §3 Mock AI data
│   ├── mocks/
│   │   ├── ai-generation-mock.ts      ← §3 Route interception
│   │   └── profile-mock.ts            ← §4 Tier mocking
│   ├── helpers/                       ← Shared utilities
│   ├── pages/                         ← Page objects
│   ├── baselines/                     ← §5 Visual baselines
│   └── suite/
│       ├── smoke.spec.ts              ← @smoke tests
│       ├── generation.spec.ts         ← @regression + mocks
│       ├── tier-access.spec.ts        ← @regression + tiers
│       └── visual.spec.ts             ← @visual screenshots
└── scripts/
    ├── verify.sh                      ← §6 Verify loop script
    └── experiment-baseline.ts         ← §8 Quality regression

11What Didn't Work

Not everything worked. Here's what we tried that failed, and what we do instead.

Tests coupled to AI output format

We wrote assertions that checked exact field values in AI-generated content. Every time we tweaked a prompt, tests broke — not because the output was wrong, but because it was differently right. Fix: assert structure and type, not exact content. Check that a title field exists and is a non-empty string, not that it equals "Grimgar the Wise."

Visual baselines that drifted

We added visual regression baselines for every page early in development, when the UI was changing weekly. We spent more time updating baselines than catching regressions. Fix: add visual baselines after the UI stabilizes. During active development, use manual screenshot review instead.

Over-mocking that hid real bugs

We mocked everything — AI calls, auth, database queries. Tests passed in CI but the app was broken because our mocks didn't match real API behavior. A database migration changed a field name, but the mock still returned the old field. Fix: keep mocks minimal. Mock AI calls (expensive, slow, non-deterministic) but use real auth and real database in integration tests.

Agents that gamed the verify loop

Our AI agent learned to run the build command, see it succeed, and report "all checks passed" — while ignoring compiler warnings and skipping the test step. The verify script ran, but the agent only checked the exit code of the first command. Fix: the verify script must run ALL checks and exit non-zero if ANY fail. Use set -euo pipefail in bash.

Experiment baselines that were too tight

We set a 2% degradation threshold on our quality experiments. Normal run-to-run variance was 3-5%. Every CI run flagged a "regression" that was just noise. We spent hours investigating false alarms. Fix: start with 10% threshold and tighten only after you understand your variance. Run the experiment 20 times to establish a realistic baseline.

12Getting Started

Ready to try it? Fork the starter and adapt it to your project.

terminal
# 1. Fork and clone
git clone https://github.com/stylusnexus/test-starter.git
cd test-starter

# 2. Install
npm install && npx playwright install chromium

# 3. Run
npm test

The starter includes everything from this guide — mock fixtures, auth bypass, visual tests, verify loop, experiment scripts, CI workflows, and Claude Code skills. See the README for how to adapt each component to your stack.

Putting It All Together

The test-starter is designed to merge into your existing project alongside the agent-starter from Shipping with Agents. You don't need separate repos. Here's what a fully equipped project looks like with both development and testing infrastructure:

your-project/
├── CLAUDE.md                          ← project brain (rules, commands, gotchas)
├── playwright.config.ts               ← from test-starter
├── package.json                       ← your app + @playwright/test in devDeps
│
├── .claude/
│   ├── guidances/                     ← from agent-starter (domain knowledge)
│   │   ├── ai-safety.md
│   │   ├── database-patterns.md
│   │   └── testing-strategy.md
│   ├── agents/                        ← both starters
│   │   ├── backend-engineer.md        ← from agent-starter
│   │   ├── ui-engineer.md             ← from agent-starter
│   │   └── test-reviewer.md           ← from test-starter
│   ├── agent-memory/                  ← from agent-starter
│   │   ├── backend-engineer/MEMORY.md
│   │   └── ui-engineer/MEMORY.md
│   ├── skills/                        ← both starters
│   │   ├── brainstorm.md, tdd.md, ... ← from agent-starter
│   │   ├── verify.md                 ← from test-starter
│   │   ├── ux-review.md              ← from test-starter
│   │   └── test-e2e.md               ← from test-starter
│   ├── hooks/                         ← both starters
│   │   ├── domain-context-loader.sh   ← from agent-starter
│   │   ├── require-tests.sh           ← from agent-starter
│   │   ├── pre-commit-verify.sh      ← from test-starter
│   │   └── check-test-coverage.sh    ← from test-starter
│   └── settings.json                  ← merged hooks from both
│
├── e2e/                               ← from test-starter
│   ├── auth.setup.ts
│   ├── verify-quick.spec.ts
│   ├── mocks/ (ai-generation-mock.ts, profile-mock.ts)
│   ├── fixtures/ai-responses/
│   ├── suite/ (smoke, generation, tier-access, visual)
│   ├── helpers/, pages/, baselines/
│
├── scripts/                           ← from test-starter
│   ├── verify.sh
│   └── experiment-baseline.ts
│
├── .github/workflows/                 ← from test-starter
│   ├── test-smoke.yml
│   ├── test-regression.yml
│   └── test-visual.yml
│
└── src/                               ← your app code

Items in gold come from this guide's test-starter. Everything else comes from the agent-starter or your own project. To merge: copy the testing files into your project, combine the .claude/ directories, merge the settings.json hook arrays, and add @playwright/test to devDependencies.

Skip the merge — fork the full starter.

The full-starter is the pre-merged version with everything above in one repo. It includes setup docs for Claude Code, Codex, and Cursor — point your AI tool at the right setup file and it configures everything for your project.

The Development Workflow

These tests protect a development workflow. For the other half — agents, skills, hooks, and memory systems — see Shipping with Agents.