Shipping with Agents
14 specialized agents, 20+ workflow skills, automated context injection via hooks, and per-agent memory systems. Advanced Claude Code patterns from a team that ships with them every day.
Want to skip straight to a working setup? Fork the starter template.
This guide uses Claude Code for examples, but the patterns — layered context, domain agents, automated guardrails, persistent memory — apply to any AI coding assistant. Look for adapter callouts throughout the guide that map Claude Code concepts to Cursor, GitHub Copilot, Windsurf, and Codex equivalents.
1Why This Exists
There's no shortage of "50 Claude Code tips" posts. They're genuinely useful if you're getting started. But they tend to stop at the surface: write a good CLAUDE.md, use slash commands, try agents. That's the tutorial. This guide is about what comes after the tutorial.
We've been shipping a production SaaS (a Next.js 14+ app with AI-powered content generation) with Claude Code as our primary development tool for months. Not as an experiment. Not as a side project. As our daily workflow across 14 specialized agents, 20+ workflow skills, 8 automated hooks, and 18 domain guidance files. Along the way, we've learned things that don't show up in beginner guides because you can't learn them until you've hit the walls yourself.
Here are three things you won't learn until you've shipped with agents for a while: context compounds (a CLAUDE.md with 15 well-earned gotchas prevents more bugs than one with 3 generic rules), agents need memory (without persistent memory, your agents rediscover the same domain knowledge every conversation), and hooks prevent more bugs than tests (automated context injection at the right moment catches classes of mistakes that test suites only find after the fact).
This isn't a replacement for the beginner guides. It's what comes next. If you've been using Claude Code for a few weeks and you're starting to wonder how to scale it beyond ad-hoc prompting, this is for you.
2CLAUDE.md Architecture
Most guides tell you to "write a good CLAUDE.md." That's true, but it undersells the actual architecture. In a production codebase, CLAUDE.md isn't a single file. It's a layered system: a root file that acts as the project brain, plus a directory of domain-specific guidance files that get loaded on demand.
The Root CLAUDE.md
This is the file Claude reads on every conversation. It should contain the stuff that applies everywhere, and it needs to be ruthlessly curated. Ours covers five categories:
- Project overview - What the app does, its positioning, the tech stack. Two paragraphs, not two pages.
- Dev commands -
npm run dev,npm run build,npm run lint,npm test. The commands someone needs on day one. - Critical rules - Security requirements (always sanitize AI input, never expose service keys client-side), compliance rules, authentication patterns. Things where getting it wrong isn't a bug, it's an incident.
- Common gotchas - The bugs that keep biting. Ours has 15 numbered gotchas, each one earned by a real mistake: "Profile mocks MUST include trial fields even for premium tiers," "jsonMode only works with OpenAI provider, not Anthropic," "
useSearchParams()requires a Suspense boundary." Every entry is a bug that happened at least twice. - Architecture links - Pointers into deeper docs, not the docs themselves. A line like "See
.claude/guidances/testing-strategy.mdfor E2E vs integration test guidance" keeps the root file scannable.
Domain Guidances
Here's the key insight: not every piece of context belongs in CLAUDE.md. If you stuff everything into one file, you're burning tokens on AI safety patterns when the developer is editing CSS. Domain guidances are separate files that get loaded on-demand by hooks when you touch relevant code.
We have 18 of these:
.claude/guidances/
Each guidance file is self-contained. It explains when it applies, what patterns to follow, and what mistakes to avoid. Here's the general structure:
# AI Safety Patterns
## When This Guidance Applies
Loaded automatically when editing files in src/lib/ai/,
src/lib/prompts/, or src/lib/guardrails/.
## Key Patterns
- Always sanitize user input before AI calls
- Validate AI output before returning to user
- Log all AI interactions for audit
## Anti-Patterns
- Never trust raw AI output in user-facing responses
- Never skip sanitization "just for internal tools"
## Key Files
- src/lib/ai/guardrails/safety.ts
- src/lib/ai/validation/output-validator.ts
The "When This Guidance Applies" section isn't just documentation. It tells the hooks system which files should trigger loading this guidance. When you edit a file in src/lib/ai/, the AI safety guidance loads automatically. When you're working on analytics, the PostHog patterns load. You don't have to remember which guidance exists. The system remembers for you.
Your CLAUDE.md doesn't need to be 500 lines on day one. Start with dev commands and 3-4 gotchas. It'll grow naturally as you hit the same issues twice. Our gotchas section started with 3 entries. It's at 15 now, and every single one exists because we made the same mistake more than once.
Every AI coding tool has its version of CLAUDE.md. Cursor: .cursorrules (project root). GitHub Copilot: .github/copilot-instructions.md. Windsurf: .windsurfrules. Codex: AGENTS.md or codex.md. The layered architecture pattern (root file + domain guidances loaded on demand) works in all of them — the file names and loading mechanisms differ, but the principle of separating always-on context from domain-specific context is universal.
3Agent-Per-Domain
Claude Code lets you define specialized agents that run as subagents within your main conversation. We've learned the hard way that one generalist agent doesn't scale. Domain expertise gets diluted when a single agent tries to be a backend architect, a UX designer, and a competitive analyst in the same breath. The fix is straightforward: one agent per domain, each with its own expertise, memory, and focus.
Our Active Agents
These aren't theoretical. We invoke them regularly across our development workflow, organized by function:
- Engineering:
backend-engineer,ui-engineer,ai-engineer - Product & Strategy:
product-manager,competitive-intel-analyst - Design & Content:
ux-designer,technical-writer - Domain specialists: agents specific to our product's content domain (every product has these, whether it's compliance, content, analytics, or industry-specific logic)
Our agent-starter includes definitions for backend-engineer, ui-engineer, ux-designer, ai-engineer, product-manager, and technical-writer, plus a domain-specialist.md.example template for creating agents specific to your product's domain. Each one comes with expertise areas, working style guidelines, and memory integration. Fork, customize, and add your own.
When to Create an Agent vs. Extend CLAUDE.md
Not everything needs its own agent. Here's how we decide:
- Create an agent when a domain is recurring, needs specialized knowledge, benefits from its own persistent memory, or requires specific tools. Our
supabase-specialistexists because database migrations have enough nuance (RLS policies, idempotent DDL, type generation) that a generalist gets it wrong regularly. - Use CLAUDE.md for one-off rules, project-wide conventions, and things every agent should know. "Use conventional commits" belongs in CLAUDE.md. "How to design a migration that handles concurrent schema changes" belongs in an agent.
Agent Definition Structure
Each agent is a markdown file in .claude/agents/. The structure is simple:
---
name: backend-architect
description: Use this agent for API design, database schema,
and server-side architecture decisions
---
You are an expert backend engineer specializing in
Next.js API routes, PostgreSQL, and serverless architecture.
## Your Expertise
- API design and REST patterns
- Database schema and migrations
- Authentication and authorization
- Performance optimization and caching
## Key Conventions
- All API routes validate input with Zod
- Database migrations must be idempotent
- Never bypass RLS policies
## Agent Memory
You have persistent memory at
`.claude/agent-memory/backend-architect/`.
Read your MEMORY.md before starting work.
Write insights back after completing work.
The frontmatter (name and description) tells Claude Code when to suggest this agent. The body gives the agent its persona, expertise boundaries, and a pointer to its persistent memory (more on that in section 4).
The Dispatch Pattern
In practice, dispatch happens naturally in conversation. You say "review this database migration" and Claude recognizes it should spawn the backend-architect subagent. Or you explicitly ask: "Use the supabase-specialist to check this RLS policy." Either way, the subagent runs with its own context, its own expertise framing, and its own memory. When it finishes, the result flows back to your main conversation.
This matters because the subagent doesn't carry the full conversation context. It gets a focused task with the right domain framing. A ux-designer reviewing a component doesn't need to know about your database migration strategy. A backend-architect designing an API doesn't need your CSS conventions. The narrower context means better output.
Global vs. Project Scope
Almost everything in Claude Code's configuration system exists at two levels: global (your user profile, applies everywhere) and project (checked into the repo, applies to that codebase). This applies to agents, hooks, settings, and CLAUDE.md itself. Understanding the precedence matters.
- CLAUDE.md: A global one lives at
~/.claude/CLAUDE.mdand loads for every project. A project one lives at the repo root. Both are loaded, with project-level content taking precedence when they conflict. - Agents: Global agents in
~/.claude/agents/are available everywhere (your personal workflow specialists). Project agents in.claude/agents/travel with the repo and are shared with your team. If both define an agent with the same name, the project version wins. - Hooks: Global hooks in
~/.claude/settings.jsonrun for every project. Project hooks in.claude/settings.jsonrun only for that repo. Both fire, they don't override. This means your global "notify me when idle" hook works alongside the project's "check compliance on every edit" hook. - Memory: Global project memory (
~/.claude/projects/.../memory/) persists user preferences across all conversations. Agent memory (.claude/agent-memory/) is project-scoped and version-controlled, so the whole team benefits from accumulated domain knowledge.
Put personal workflow preferences at the global level (notification hooks, your preferred commit style, agents for tasks you do across all projects). Put domain knowledge at the project level (compliance hooks, specialized agents, guidances). Your teammates get the project-level setup automatically when they clone the repo. Your personal preferences stay yours.
We started with one mega-agent that tried to do everything. It knew about database patterns, UI conventions, content strategy, analytics, and deployment. Output quality dropped because domain expertise was too diluted. The agent would mix concerns, applying backend patterns to frontend code or suggesting UI changes during infrastructure work. Splitting into 14 specialists was the fix. Each agent is dumber in isolation but dramatically better at its specific job.
Claude Code has first-class agent definitions (.claude/agents/). Other tools don't — yet. But the pattern still works. Cursor: create separate .cursorrules sections for different domains, or use Cursor's chat personas. Copilot: organize domain knowledge into separate instruction files and reference them. Codex: use sections in AGENTS.md to define domain-specific personas. The core insight — specialized context produces better output than generalist context — is tool-agnostic. Even a simple "When I ask about database work, also read docs/database-patterns.md" in your instructions file gets you 80% of the benefit.
4Agent Memory
Here's the problem agents have without memory: they're goldfish. Every conversation starts from zero. Your backend architect discovers that generated types files pick up stderr output, works around it, and then forgets that lesson forever. Next week, same agent, same mistake, same debugging session. Memory fixes this.
Two-Tier Memory Hierarchy
We use two layers of persistent memory, each with a different scope:
- Project memory (
~/.claude/projects/.../memory/) stores cross-cutting insights that apply to every agent and every conversation. User preferences ("always create feature branches, never commit to dev"), workflow decisions ("use squash merge for release-please"), GTM status, and architectural patterns that span domains. This is the shared brain. - Agent memory (
.claude/agent-memory/<agent-name>/) stores domain-specific knowledge each agent accumulates over time. The backend architect remembers database migration patterns. The UI engineer remembers component conventions. The storyteller remembers narrative quality standards. Each of our 14 agents has its own directory, checked into version control.
The distinction matters. "We deploy on Render, not Vercel" belongs in project memory because every agent needs it. "RLS policies must use auth.uid() scoping" belongs in the Supabase specialist's agent memory because only that agent cares.
Memory Directory Structure
.claude/agent-memory/
backend-architect/
ui-engineer/
epic-storyteller/
The Protocol
Every agent follows the same memory lifecycle:
- Before starting work: read your
MEMORY.mdto load prior context. This is the single most important step. Without it, the agent starts every task blind to everything it's learned before. - After completing work: write insights back to memory. If you discovered a pattern, hit a wall, or made a decision worth remembering, persist it.
The key is knowing what to save and what to skip:
- Save: domain patterns ("API routes use cookie auth, not Bearer tokens"), key decisions ("chose HNSW over IVFFlat for vector indexes"), recurring issues ("generated types file picks up stderr, check head AND tail"), validated approaches that worked.
- Don't save: session-specific context (file paths you read, intermediate debugging steps), information already captured in CLAUDE.md, general programming knowledge the model already has.
What a MEMORY.md Looks Like
The index file is small, scannable, and links to detail files for anything that needs more than a line:
# Backend Architect - Memory
## Key Insights
- [Migration safety](./patterns.md) — Always use IF NOT EXISTS, test idempotency
- [RLS patterns](./issues.md) — Service role bypass gap found on content table
## Patterns
- API routes use cookie auth via SSR, not Bearer tokens
- Database views need type assertion workaround
- Trial quota uses atomic UPDATE RETURNING, fail-secure on zero rows
## Known Issues
- Generated types file picks up stderr output, check head AND tail
- Views not in generated types cause "excessively deep" errors
- Migrations in src/lib/db/ are NOT applied to production
It's loaded into context every time the agent starts. A 500-line memory file burns tokens on old insights while crowding out the actual task. Use topic files (patterns.md, issues.md) for details, and link to them from the index. One line per insight, with a link for depth.
Memory is what turns agents from stateless tools into teammates who learn. The backend architect who remembers that jsonMode only works with OpenAI doesn't need to rediscover it through a 30-minute debugging session. The UI engineer who remembers that useSearchParams() needs a Suspense boundary doesn't ship that bug again. Over time, each agent's memory becomes a curated knowledge base for its domain, built from real mistakes and real wins rather than generic docs.
Cross-session memory is no longer Claude-exclusive. Windsurf: has a full Memories system — auto-generated and manual, stored per-workspace, loaded when relevant. Closest to Claude Code's model. Codex: stores transcripts locally, codex resume picks up earlier threads, and AGENTS.md files act as persistent instructions. Cursor: removed its built-in Memories feature in v2.1.x; now relies on Rules files and third-party MCP tools (Recallium, ContextForge) for persistence. Copilot: no native memory; use a docs/ai-memory/ directory with markdown files and reference them in copilot-instructions.md. The file-based approach (version-controlled domain knowledge that loads per session) works as a universal fallback for any tool.
5Hooks as Context Injection
This is the thing nobody talks about. CLAUDE.md tells Claude what to know. Agents tell Claude who to be. But hooks tell Claude when to know it. They're the automated nervous system that fires the right context at the right moment, without you having to remember anything.
Hook Event Types
Claude Code supports several hook events. Here are the ones we actually use:
- PreToolUse: fires before any tool runs (Read, Edit, Bash, Write). This is where you inject context. When Claude is about to read a database migration file, you can ensure it loads your migration safety guide first.
- PostToolUse: fires after a tool completes. This is where you validate. After an edit to an AI prompt file, you can check that safety patterns are followed.
- Stop: fires when the agent is about to stop and return a response. This is your last chance to enforce requirements, like making sure tests were run before declaring work complete.
- SubagentStart: fires when a subagent is spawned. Useful for injecting additional context into specialized agents at dispatch time.
- Notification: fires on system events like session start, idle timeout, and permission prompts. Less commonly used but valuable for session-level setup.
Hook Configuration
Hooks live in your settings.json. Each hook specifies which tools trigger it (via a regex matcher) and what command to run:
{
"hooks": {
"PreToolUse": [
{
"matcher": "Read|Edit|Write",
"hooks": [{
"type": "command",
"command": "bash .claude/hooks/domain-context-loader.sh"
}]
}
],
"PostToolUse": [
{
"matcher": "Edit|Write",
"hooks": [{
"type": "command",
"command": "bash .claude/hooks/instrumentation-check.sh"
}]
}
],
"Stop": [
{
"hooks": [{
"type": "command",
"command": "bash .claude/hooks/require-tests.sh"
}]
}
]
}
}
The Domain Context Loader
This is our most powerful hook. It watches which files you're touching and automatically loads the relevant guidance document. No manual invocation, no forgetting, no wasted tokens loading AI safety docs when you're editing CSS:
#!/bin/bash
# Injects relevant guidance when you touch domain-specific files
INPUT=$(cat)
FILE_PATH=$(echo "$INPUT" | jq -r '.tool_input.file_path // ""')
[ -z "$FILE_PATH" ] && exit 0
GUIDANCE=""
case "$FILE_PATH" in
*/api/*|*/routes/*) GUIDANCE="api-patterns" ;;
*/auth/*|*/middleware/*) GUIDANCE="auth-security" ;;
*/db/*|*/migrations/*) GUIDANCE="database-patterns" ;;
*/ai/*|*/prompts/*) GUIDANCE="ai-safety" ;;
*/components/ui/*) GUIDANCE="design-system" ;;
esac
if [ -n "$GUIDANCE" ] && [ -f ".claude/guidances/${GUIDANCE}.md" ]; then
echo "Read .claude/guidances/${GUIDANCE}.md before making changes."
fi
The hook reads the tool input from stdin (JSON with the file path being accessed), pattern-matches it against your domain structure, and if there's a relevant guidance file, tells Claude to read it. The guidance loads just-in-time, exactly when it's needed.
The Stop Guard
The stop guard prevents Claude from declaring "done" without running tests when critical files were modified. This catches the "it compiles, ship it" failure mode:
#!/bin/bash
# Block stopping if critical files changed without test verification
INPUT=$(cat)
# Check if tests were run in this session
if git diff --name-only HEAD 2>/dev/null | grep -qE '\.(ts|tsx)$'; then
# TypeScript files changed — verify build passes
if ! echo "$INPUT" | jq -r '.transcript[]?.message // ""' | grep -q "npm run build"; then
echo "CRITICAL: TypeScript files were modified. Run npm run build before completing."
exit 1
fi
fi
exit 0
When the hook exits with a non-zero code, Claude receives the error message and must address it before stopping. This turns "you should run the build" from a suggestion into a requirement.
Before and After
- Manually remembering to check compliance docs before editing AI prompts
- Forgetting to run the build after TypeScript changes
- Loading context by hand ("read the database patterns guide first")
- Inconsistent quality depending on what the developer remembers to ask for
- Context auto-loads when you touch relevant files
- Build verification required before stopping
- Compliance and safety checks run on every write
- Consistent quality regardless of what the developer remembers
Creating Hooks from Conversation
Writing hooks by hand works, but there's a faster way. The hookify plugin analyzes your conversation history to find behaviors worth preventing, then generates the hook automatically.
Say Claude just made a mistake — it edited a migration file without checking the existing schema, or it committed without running the build. Instead of writing a hook from scratch, you say /hookify and it reviews the conversation, identifies the failure pattern, and generates a PreToolUse or PostToolUse hook to prevent it from happening again. The hook goes straight into your .claude/settings.json.
This is how most of our hooks were born. We didn't design them upfront. We hit a problem, said "that should never happen again," and hookify turned the lesson into an automated guardrail. It's the fastest path from "that was a mistake" to "that can't happen anymore."
claude install-plugin hookify. Then use /hookify after a mistake, or /hookify list to see your current rules, or /hookify configure to enable/disable individual hooks. Each rule can target specific hook events (PreToolUse, PostToolUse, Stop) and specific tool matchers.
They turn tribal knowledge into automated guardrails. Every gotcha in your CLAUDE.md started as something someone forgot. Hooks make sure the next person (or the next agent) doesn't forget it either. We have 8 hooks in production, and they prevent more bugs than any single test suite.
Hooks are Claude Code's most unique feature — no other AI coding tool has a direct equivalent. But the pattern (automated context injection based on what files you're touching) can be approximated. All tools: use git hooks (.git/hooks/pre-commit) to run verification before commits — this works regardless of which AI tool made the changes. Cursor: Cursor's @docs and @codebase references provide manual context injection. CI/CD: move guardrail checks (build verification, compliance scanning) into your CI pipeline so they catch issues regardless of which tool authored the code. The principle is: don't rely on the developer (or agent) remembering to check. Automate the check.
6Skills Composition
Skills are reusable workflows that Claude executes as slash commands. They're markdown files that define a process, and Claude follows them. But the real power isn't individual skills. It's how they compose into chains that enforce an entire development lifecycle.
Two Types of Skills
We've learned to distinguish between skills that should be rigid and skills that should flex:
- Rigid skills (TDD, debugging, verification): these enforce discipline. The TDD skill says "write the test first, watch it fail, then implement." You don't adapt away from that. You don't skip the failing test step because you're confident. The rigidity is the point. These skills exist precisely because developers (and agents) will take shortcuts without them.
- Flexible skills (brainstorming, architecture, code review): these adapt principles to context. The brainstorming skill explores intent and proposes approaches, but it doesn't prescribe a fixed number of questions or a rigid template. It reads the room. Different problems need different exploration depths.
Brainstorming First, Always
Every creative task (new features, new components, new behavior) runs through the brainstorming skill before implementation. This prevents the number one failure mode we see with AI coding: jumping straight to code.
Without brainstorming, Claude sees "add a feedback system" and immediately starts writing a React component. With brainstorming, it first asks what kind of feedback, who sees it, what the data model looks like, how it interacts with existing systems, and whether there are edge cases worth designing for upfront. The code that comes after brainstorming is dramatically better because the thinking happened first.
Skill Chaining
Individual skills are useful. Chained skills are transformative. Here's the chain we run for every significant feature:
brainstorm → write-plan → execute-plan → verify → review → ship
think plan build test review ship
Each skill hands off to the next. Brainstorming produces a design. Writing-plans turns that design into an implementation plan with ordered steps. Executing-plans works through the plan with review checkpoints. Verification confirms the work actually works (not just compiles). Review-and-ship runs an AI code review, fixes any issues found, then commits, pushes, and creates a PR.
The chain enforces: think, plan, build, test, review, ship. You can't skip straight to build without thinking. You can't ship without testing. And you can't ship without review.
Notice that review is a built-in step in the pipeline, not a separate process someone remembers to do. Every feature that ships goes through an AI code reviewer that checks for bugs, security issues, and convention violations — before a human ever sees the PR. Solo developers get review discipline without needing a team. Teams get a consistent baseline review that frees human reviewers to focus on design and architecture instead of nitpicking style.
Custom Project-Specific Skills
Beyond the general development skills, we've built several that are specific to our workflow:
experiment: runs AI content generation N times, measures quality metrics, suggests prompt improvements, and tracks trends over time. Essential when your product generates AI content and "it works" isn't good enough.- Structured specs: we write specs before implementation on anything non-trivial. Specs go through drafting, review, and approval before a plan gets written. The brainstorm and write-plan skills handle this naturally. Prevents building the wrong thing.
deploy: merges dev to main with a squash commit formatted for release-please automation. One command, consistent deploys, proper changelogs.orient: rebuilds working context after/clear. When you compact or clear your conversation, this skill re-loads the project state, current branch, recent changes, and open issues so you don't start cold.visualize-project: generates tiered Mermaid architecture diagrams based on project complexity. Runs during setup (Phase 1.5, between Discover and Generate CLAUDE.md) to ground your CLAUDE.md in confirmed visuals, or anytime to regenerate. Scores complexity from 10 signals and outputs 1–6 diagrams todocs/architecture/diagrams/.
Skill Definition Structure
A skill is a markdown file that defines when to use it, what process to follow, and what the output should be. Here's a simplified example:
# Brainstorming
Help turn ideas into fully formed designs through collaborative dialogue.
## When to Use
Before any creative work: features, components, new behavior.
## Process
1. Explore project context — read relevant code, understand current state
2. Ask clarifying questions (one at a time, not a wall of questions)
3. Propose 2-3 approaches with trade-offs
4. Present design, get approval
5. Write spec document
6. Hand off to write-plan skill
The "When to Use" section is important. It tells Claude (and developers) when this skill applies, which helps with discoverability. When someone types "I want to add a notification system," Claude can recognize that brainstorming should run first.
Our agent-starter includes: brainstorm, TDD, verify, write-plan, execute-plan, orient, debug, review-and-ship, activity-summary, and experiment. Together they cover the full workflow: think, plan, build, test, debug, verify, measure, ship. Start with brainstorm + TDD + verify (those three alone will change how you work), then adopt the rest as you need them.
Claude Code skills are slash commands backed by markdown files. Other tools achieve similar workflow automation differently. Cursor: define workflows as step-by-step instructions in your .cursorrules or use Cursor's Composer for multi-step tasks. Windsurf: Flows are Windsurf's equivalent — multi-step AI workflows with checkpoints. Any tool: the simplest version is a docs/workflows/ directory with markdown files describing each workflow. Tell your AI "follow the process in docs/workflows/feature-development.md." It's less automatic than slash commands, but the workflow discipline (think → plan → build → verify → ship) matters more than the invocation mechanism.
7Multi-Agent Orchestration
A single specialist agent is good. Several of them working in concert is where things get interesting. Once you've got agents that each do their job well, the next question is: how do you coordinate them? The answer isn't a framework. It's a set of dispatch patterns that match how work actually flows.
Parallel Dispatch
Some tasks are naturally independent. When we needed to evaluate a new feature's impact, we asked our UI engineer and marketing agent the same question at the same time: "How should we present this new feature to users?" Both worked independently. The UI engineer focused on component architecture and interaction patterns. The marketing agent focused on positioning and user-facing copy. We synthesized their recommendations into a single plan that was better than either would've produced alone.
The pattern is straightforward: identify tasks with no shared state or sequential dependencies, dispatch agents in parallel, then merge the results. You don't need a message bus or an orchestration layer. You need to recognize when two questions can be answered independently.
# Two independent tasks, dispatched simultaneously:
Agent 1 (ui-engineer): "Design the feature sidebar component"
Agent 2 (product-manager): "Define user stories for feature discovery"
# Both run in parallel — no shared state, no dependencies
# Results merge in the main conversation when both complete
Foreground vs Background Agents
Not every agent dispatch needs your immediate attention. We've learned to distinguish between two modes:
- Foreground agents: you need the results before you can proceed. Research tasks, architecture analysis, design decisions. When you ask the backend architect to evaluate two schema approaches, you're blocked until the answer comes back. These run in the conversation flow and you wait for the output.
- Background agents: fire and forget, get notified when done. Code reviews, documentation audits, compliance checks. When you dispatch the security auditor to review a PR, you don't sit and watch. You keep working and check the results when they're ready.
The distinction matters for workflow efficiency. If you treat every agent dispatch as foreground, you'll spend half your time waiting. If you treat everything as background, you'll miss critical decisions that need your input before proceeding.
The Review-Merge Pipeline
This is our most automated multi-step workflow. One command replaces six manual steps:
# What used to be 6 manual steps:
1. Review uncommitted changes for issues
2. Fix any issues found during review
3. Commit with conventional commit message
4. Push to remote
5. Create pull request
6. Merge with admin privileges
# Now it's one command:
/review-and-ship
The pipeline reviews your uncommitted changes, identifies issues (missing types, incomplete error handling, style violations), fixes them automatically, commits everything with a properly formatted conventional commit message, pushes to the remote, creates a PR, and merges it. The entire flow is encoded in a single skill that chains these steps with error handling at each gate. If the review finds something it can't auto-fix, it stops and asks you.
Once the PR is merged, the deploy skill handles production: it merges your development branch to main with a squash commit formatted for release automation (release-please, changesets, or whatever you use). Review-and-ship gets code into a PR. Deploy gets it to production. Two skills, clean separation.
Gated Development: The Skill Chain
For larger features, we chain skills together so each phase has an explicit gate before the next one starts:
┌─────────────┐ ┌────────────┐ ┌──────────────┐
│ brainstorm │ → │ write-plan │ → │ execute-plan │
│ explore, │ │ break into │ │ build task │
│ propose, │ │ steps with │ │ by task, │
│ get approval│ │ exact code │ │ TDD each one │
└─────────────┘ └────────────┘ └──────┬───────┘
│
▼
┌────────────┐ ┌──────────────┐
│ deploy │ ← │review-and- │
│ squash to │ │ ship │
│ main, push │ │ review, fix, │
│ to prod │ │ commit, PR │
└────────────┘ └──────┬───────┘
│
┌──────┴───────┐
│ verify │
│ build, test, │
│ lint, check │
└──────────────┘
You can't start writing a plan until the brainstorm is approved. You can't start building until the plan is reviewed. Each skill hands off to the next, and every handoff is a checkpoint where you can course-correct. This prevents the most common failure mode in AI-assisted development: jumping straight to code before the problem is well-defined.
Second Opinions: Agents as Consultants
One of the most underrated uses for multiple agents is getting a second opinion when you're stuck. If you've been debugging a database migration for 20 minutes and you're going in circles, spawn a fresh backend-engineer agent with just the problem description. It doesn't carry your conversation's accumulated assumptions and dead ends. It looks at the problem fresh.
We do this regularly. The main conversation has been working on a feature for a while and hits a wall. Rather than continuing to push through with increasingly stale context, we ask a specialist agent to take an independent look. The agent doesn't know what we've already tried (that's the point), so it approaches the problem without bias. Sometimes it finds the same answer we were circling around. Often it spots something we'd been looking right past.
This works for more than debugging. Stuck on an architecture decision? Ask the backend-engineer and ui-engineer independently how they'd approach it. Getting conflicting signals on a feature? Ask the product-manager to evaluate the trade-offs without knowing which side you're leaning toward. Fresh context is a feature, not a limitation.
Parallel dispatch is powerful but harder to coordinate. Get comfortable with one agent at a time before launching three. Master the skill chain (brainstorm, plan, execute, verify, ship) first. Once that feels natural, you'll develop an intuition for which tasks can safely run in parallel and which need sequential gates. And when you're stuck, remember: a fresh agent with no context baggage is often exactly what you need.
8Experiment-Driven Dev
When your product generates AI content, "it works" isn't good enough. The output might parse, it might look reasonable on a quick glance, but is it actually good? Is it consistently good? Does it degrade when you change models or tweak prompts? You can't answer these questions by reading a few samples. You need to measure.
The Core Loop
Every AI quality improvement we make follows the same cycle: measure, fix a targeted metric, re-measure. Not "try something and see if it feels better." Not "tweak the prompt and hope." Measure a specific metric, make a targeted change, measure again to confirm improvement. That's the whole process.
# 1. Measure baseline
Run generation 10 times → collect metrics
- Parse success rate: 8/10 (80%)
- Schema compliance: 7/10 (70%)
- Average quality score: 3.2/5
# 2. Fix targeted metric (parse failures first)
Identify root cause → JSON wrapping issue in prompt
Add explicit schema to system prompt
# 3. Re-measure
Run generation 10 times → collect metrics
- Parse success rate: 10/10 (100%) ← fixed
- Schema compliance: 7/10 (70%) ← unchanged, tackle next
- Average quality score: 3.4/5 ← slight improvement
How It Works in Practice
An experiment script runs your generation N times (we typically use 10 for quick checks, 25-50 for model comparisons), collects structured metrics, and produces a summary you can actually act on. There's no magic framework here. It's a script that calls your generation endpoint, parses the output, checks it against your schema, scores it on whatever dimensions matter, and reports the results.
What we measure depends on the content type, but it generally falls into three categories:
- Structural integrity: does the output parse? Does it conform to the expected schema? Are all required fields present? This is the baseline. If your JSON doesn't parse, nothing else matters.
- Content quality: does the output follow domain rules? Are the generated values within expected ranges? Does the structure match what the UI expects? This catches the subtler failures where the output is valid JSON but contains nonsensical or incomplete content.
- Model comparison: how does GPT-4o perform vs Claude for this specific content type? Not in general benchmarks, but for your exact use case with your exact prompts. The answer is often surprising.
How Results Feed Back Into Development
Here's a real example. We were generating structured skill challenges using Haiku (Anthropic's fast model). It seemed to work in manual testing. When we ran experiments, the data told a different story: 60% JSON parse failure rate. Six out of ten generations produced output that couldn't be parsed. We switched to GPT-4o Mini, re-ran the experiment, and got 100% parse success. The fix was also 21x cheaper per call and faster. Without the experiment, we'd still be shipping a feature with a 60% failure rate and blaming "AI flakiness."
That's the feedback loop: experiment reveals a problem, data points to the cause, you make a targeted fix, experiment confirms the fix worked. No guessing, no "this prompt feels better," no shipping on vibes.
Model Comparison: Skill Challenge Generation (n=10 per model)
| Metric | Haiku 4.5 | GPT-4o Mini |
|---------------------|------------|-------------|
| Parse success | 4/10 (40%) | 10/10 (100%)|
| Schema compliance | 3/10 (30%) | 10/10 (100%)|
| Avg latency | 2.8s | 1.9s |
| Cost per generation | $0.021 | $0.001 |
Decision: Switch to GPT-4o Mini for all skill challenges.
Re-run after switch: 10/10 parse, 10/10 schema. Confirmed.
We skipped experiments early on and guessed at prompt improvements. When we finally measured, we found we were wrong about 40% of the time. A prompt change that "felt better" in manual testing actually degraded schema compliance. A model we assumed was superior for a task was slower, more expensive, and less reliable than the cheaper alternative. Now we measure everything. The 10 minutes an experiment takes saves hours of debugging production failures.
A bash script that runs your generation 10 times and counts failures is enough to start. Seriously. Call your endpoint in a loop, try to parse each response, count successes and failures. That alone will tell you more about your AI output quality than any amount of manual testing. Fancy metrics come later. Start by counting what breaks.
9The Full Stack
We've covered each layer on its own. Now let's see how they compose. The whole point is that these aren't independent features you bolt on. They're a stack, and each layer amplifies the ones below it.
┌─────────────────────────────────────────────────────────────────────┐
│ Layer 8: Experiments Quality feedback loop │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ Layer 7: MCP Servers External tool integrations │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ Layer 6: Skills Workflow discipline (20+ active) │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ Layer 5: Hooks Automated context injection (8) │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ Layer 4: Agent Memory Persistent learning across sessions │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ Layer 3: Agents Specialized roles (14 active) │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ Layer 2: Guidances On-demand domain knowledge (18) │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ Layer 1: CLAUDE.md Base configuration, project rules │
└─────────────────────────────────────────────────────────────────────┘
Here's what matters: these layers compound multiplicatively, not additively. CLAUDE.md alone is good. It gives every conversation a baseline. Add guidances, and Claude loads domain knowledge on demand instead of burning tokens on everything at once. Add hooks, and that domain knowledge loads automatically instead of requiring you to remember. Add agents, and each domain gets a specialist instead of a generalist. Add memory, and those specialists learn instead of starting fresh. Add skills, and your workflow gets discipline instead of ad-hoc prompting. Add experiments, and you're measuring quality instead of guessing.
Each layer solves a different problem:
- CLAUDE.md solves "Claude doesn't know our project rules"
- Guidances solve "CLAUDE.md is too big and wastes tokens"
- Agents solve "one generalist can't master every domain"
- Memory solves "agents forget what they learned yesterday"
- Hooks solve "we keep forgetting to load the right context"
- Skills solve "we skip steps when we're in a hurry"
- MCP Servers solve "Claude can't reach external tools"
- Experiments solve "we don't know if our changes actually improved things"
You don't need all eight layers on day one. But understand that they're designed to stack. When you hit the ceiling of one layer, the next one is waiting.
MCP Servers: Extending Agent Reach
MCP (Model Context Protocol) servers package external integrations as tools that any agent can invoke. Instead of writing custom API wrappers, you connect a server and every agent in your setup can use it. The practical question isn't "should I use MCP?" but "when does MCP pay for itself vs. just calling an API directly?"
Our rule of thumb: if 3+ agents need access to the same external system, an MCP server pays off. If only one agent uses a service, a direct API call is simpler. MCP adds value when integrations are shared infrastructure, not one-off connections.
Here are the categories we've found most useful:
- Documentation lookup (e.g., context7): fetches current library and framework docs so agents don't rely on stale training data. This alone prevents a whole class of "the API changed" errors.
- Browser automation (e.g., Playwright): visual testing, UI verification, and scraping. Essential for verifying frontend changes actually look right.
- Database management (e.g., Supabase, Neon, Postgres): schema inspection, query execution, and migration management. Agents can check table structure before writing queries.
- Deployment and infra (e.g., Render, Vercel, Cloudflare Workers, AWS, Railway): service management, logs, metrics, and environment configuration. Useful for deploy verification and incident investigation.
- Observability (e.g., Sentry, PostHog): error tracking and analytics queries. Agents can check error rates after deploys or investigate user-reported issues.
- Source control (e.g., GitHub): PR management, issue tracking, code review, and CI check status. Probably the most universally useful MCP server.
- Communication (e.g., Slack): channel messages, thread search, and sending updates. Useful for agents that need to report results or check team discussions for context.
We'd recommend context7 or GitHub as your first MCP connection. Both are universally useful regardless of your stack. Add more as you notice agents repeatedly needing access to the same external system. The full list of available servers is at modelcontextprotocol.io.
Plugins: Packaged Skills and Integrations
Beyond MCP servers, every major AI coding tool now has a plugin system that bundles skills, agents, hooks, and integrations into installable packages. The ecosystem has matured rapidly:
- Claude Code: plugins bundle skills, hooks, and MCP configs. Install with
claude install-plugin <url>. Community registries host plugins like Superpowers (workflow skills), Hookify (automated guardrails), and PR Review Toolkit (code review agents). - Cursor: marketplace launched Feb 2026 with 30+ plugins from Figma, Stripe, AWS, Linear, Datadog, GitLab, and more. Install via
/add-pluginin the editor. Plugins can include MCP servers, skills, subagents, hooks, and rules. - Codex: plugin directory in the app and CLI. Bundles skills, MCP servers, and app integrations. Supports workspace-scoped and home-scoped installs.
- MCP registries: MCPMarket (10,000+ servers), mcp.so (20,000+ servers), and mcpservers.org. MCP servers work across all tools that support the protocol — install once, use everywhere.
The practical split: MCP servers are the portable layer (same server works in Claude Code, Cursor, Codex, Windsurf, and Copilot). Plugins are tool-specific (a Claude Code plugin doesn't install in Cursor, and vice versa). When choosing, prefer MCP servers for integrations you want cross-tool, and tool-specific plugins for workflow features that are tightly coupled to your editor.
Claude Code: claude install-plugin <github-url> or claude mcp add <server>. Cursor: browse cursor.com/marketplace or /add-plugin. Codex: plugin directory in app, or ~/.codex/plugins/ for manual install. MCP servers (any tool): add to your tool's MCP config file — the server name, command, and args are the same regardless of which tool runs them. Browse servers at mcpmarket.com.
We use these patterns for our production SaaS, but there's nothing domain-specific about the architecture. CLAUDE.md, guidances, hooks, agents, memory, skills, experiments. They work for any project with enough complexity to benefit from structured context. That's the point.
10What This Enabled
We're not claiming this stack is magic. We're claiming it's measurably better than the alternative. Here's what it actually enabled, with specifics:
- 19 content generators shipped and maintained simultaneously. Each generator has its own prompt engineering, Zod schemas, compliance rules, and quality benchmarks. Without agents and guidances, keeping that many generators consistent would require a team three times our size.
- Compliance automation. Domain-specific rules (legal compliance, AI safety sanitization, authentication patterns) are enforced by hooks on every file edit. Not by manual review. Not by hoping someone remembers. Automatically, every time.
- Agent-assisted iteration. Experiment-driven development caught quality regressions that manual testing missed. When we changed a prompt and thought it was better, experiments showed our intuition was wrong 40% of the time. We measure now.
- Context persistence. Agent memory prevents re-learning the same lessons across sessions. The backend architect doesn't rediscover that
jsonModeonly works with OpenAI. The UI engineer doesn't re-debug theuseSearchParamsSuspense boundary issue. Knowledge sticks. - Instant onboarding. New conversation sessions start productive immediately. Session-start hooks load project context, CLAUDE.md provides rules and gotchas, and agent memory provides domain history. There's no "let me re-explain the project" warmup period.
Before and After
- Start each session re-explaining the project
- Manually check compliance on every AI-related edit
- Guess at prompt quality based on a few manual tests
- Forget what worked last week and rediscover it
- Hope the developer remembers to load the right docs
- Session starts with full context loaded automatically
- Compliance checked by hooks on every file write
- Quality measured with experiments (10+ generations, real metrics)
- Insights persist in agent memory across sessions
- Hooks load the right guidance for the right files, every time
The compound effect is the important part. Any one of these improvements is nice. All of them together changed how fast and how confidently we ship. The stack doesn't just help us write code faster. It helps us write correct code faster, which is the part that actually matters.
11What Didn't Work
It wouldn't be honest to only share what worked. Here are the failures that taught us the most. Each one is a pattern we tried, a reason it broke, and what replaced it.
We started with one agent definition that tried to cover everything: frontend, backend, database, content, security, analytics. It knew a little about a lot. The result? Generic output that missed domain nuance consistently. It would apply backend patterns to frontend code, suggest UI changes during infrastructure work, and give surface-level advice on every domain instead of deep advice on any one. Splitting into 14 specialists, each with their own memory directory and expertise boundaries, was the fix. Each agent is narrower in isolation but dramatically better at its specific job.
We stuffed everything into CLAUDE.md. Every rule, every pattern, every gotcha, every domain-specific convention. It felt thorough. Within weeks, we were hitting context limits. Claude was burning thousands of tokens loading AI safety docs when someone was editing CSS. The file was so long that important rules got buried in the middle where models pay less attention. Moving domain knowledge to on-demand guidances (loaded by hooks only when relevant files are touched) solved the problem. CLAUDE.md stayed lean. Token usage dropped. The right context loads at the right time.
We relied on conversation context for domain insights. An agent would discover that generated types files pick up stderr output, work around it, and then that knowledge would vanish when the conversation ended. Next week, same agent, same discovery, same 30-minute debugging session. We tried putting everything in CLAUDE.md (see Failure 2), but that didn't scale either. Building the agent memory protocol (read MEMORY.md before starting, write insights back after completing work) made knowledge persistent without bloating the global context. Each agent learns once and remembers forever.
We tweaked prompts based on vibes and hoped they'd improve. "This feels better" was the quality bar. When we started running experiments (10 generations per variant, structured evaluation criteria, statistical comparison), we found our intuition was wrong 40% of the time. Changes we thought improved quality sometimes made it worse. Changes we thought were neutral sometimes delivered significant improvements. Now we measure everything. Prompt changes don't ship without experiment results. The experiment framework wasn't the first thing we built, but in hindsight, it should've been closer to the top.
Every failure above has the same root cause: trying to keep things simple by centralizing. One agent, one file, one conversation, one gut feel. The fix in every case was the same: decompose, specialize, and measure. It's more infrastructure upfront, but it scales where the simple approach doesn't.
12Getting Started
You don't need to adopt this entire stack at once. That's a recipe for overengineering before you've felt the pain each layer solves. Here's the incremental path we'd recommend, one layer per week:
Week 1: Write a Solid CLAUDE.md
This is the highest-impact, lowest-effort starting point. A good CLAUDE.md makes every single conversation better. Include:
- Project overview: what the app does, tech stack, two paragraphs max
- Dev commands: how to build, test, lint, and run locally
- 3-5 critical rules: security requirements, compliance rules, things where getting it wrong is an incident
- 3-5 common gotchas: bugs that have bitten you twice. Each one saves a future debugging session
That's it. Don't overthink it. This alone will make Claude noticeably more useful because it stops guessing at your project conventions.
Week 2: Create 2-3 Guidances
Pick your most complex domains. For most projects, that's authentication, database patterns, and whatever your core business logic is. Move the deep knowledge out of CLAUDE.md and into .claude/guidances/. Each guidance should be self-contained: when it applies, what patterns to follow, what mistakes to avoid.
You'll feel the benefit immediately. CLAUDE.md gets shorter and more scannable. Domain knowledge lives closer to where it's needed.
Week 3: Define Your First Agent
Pick the domain you work in most. If you're a full-stack developer who spends 60% of your time on the backend, start with a backend architect agent. Give it specialized instructions, point it at relevant guidances, and create a memory directory for it. One agent, well-defined, is better than five vague ones.
Week 4: Set Up One Hook
Start with the domain context loader (PreToolUse). It's the easiest to implement and has the highest impact. When Claude touches a file in your auth directory, the auth guidance loads automatically. When it touches your database code, the migration patterns load. No manual remembering required.
This is the moment it stops feeling like a collection of config files and starts feeling like a system that knows what you need before you ask.
Week 5+: Add Skills and Iterate
The starter template ships with 11 skills covering the full workflow: brainstorm, write-plan, execute-plan, TDD, debug, verify, review-and-ship, deploy, orient, activity-summary, and experiment. Start with brainstorm + TDD + verify (those three enforce the think, build, verify discipline loop), then adopt the rest as your workflow demands. Let the system grow organically from the problems you actually hit.
The key is to let each layer earn its place. Don't build agent memory until your agents are forgetting things. Don't build experiments until you're guessing at quality. Each layer should solve a pain you've already felt.
For testing patterns — AI mock fixtures, visual verification, experiment-as-test, and CI integration — see Testing with Agents, the companion to this guide.
Fork our starter template to get a pre-configured CLAUDE.md, 3 example guidances, an agent definition, 10 workflow skills, 2 hooks, and settings wiring: github.com/stylusnexus/agent-starter. Already have a project? The template includes a setup runbook that walks Claude through scanning your codebase and generating everything automatically.
Putting It All Together
The agent-starter covers development. The test-starter covers testing. You don't need two repos — they merge into one project. Here's what a fully equipped project looks like:
your-project/
.claude/
guidances/ — on-demand domain knowledge
agents/ — specialized roles
agent-memory/ — persistent knowledge
skills/ — workflow discipline
hooks/ — automated guardrails
e2e/ — from test-starter
scripts/
.github/workflows/ — from test-starter
Items highlighted in gold come from the test-starter. Everything else comes from the agent-starter or your own project. The merge is straightforward: copy the testing files in, merge the .claude/ directories, combine the settings.json hook arrays, and add @playwright/test to your devDependencies.
The full-starter is the pre-merged version with everything above in one repo. It includes setup docs for Claude Code, Codex, and Cursor — point your AI tool at the right setup file and it configures everything for your project.
These patterns power our production SaaS. They'll work for your project too.