Building a Scalable Multi-Agent System for Test-Driven Development with Claude Code
Introduction
As AI-powered development tools like Claude Code become more sophisticated, teams are moving beyond simple code generation to complex, multi-agent workflows that can handle entire product development lifecycles. This post explores a practical architecture for implementing test-driven development (TDD) using Claude Code agents across multiple repositories, with a focus on maintaining scalability, auditability, and efficient token usage.
The challenge we're addressing is real: how do you maintain consistency between product requirements, implementation tasks, and test specifications across multiple repositories while ensuring that changes propagate efficiently without requiring full workflow reruns?
Background: The Multi-Spec Challenge
Traditional development workflows suffer from several critical issues when scaled across multiple repositories:
- Requirements drift: PRDs become outdated, leading to implementation misalignment
- Context explosion: AI tools lose effectiveness when context grows too large
- Change propagation: Small requirement changes trigger expensive full workflow reruns
- Cross-repo inconsistency: Different repositories implement conflicting interpretations of the same requirements
The solution we'll explore uses three interconnected specifications managed by specialized agents:
- PRDSpec: Product Requirements Definition as a computational model
- TaskSpec: Implementation specifications per repository
- TestSpec: Verification and testing specifications
Core Concepts: From Documents to Computational Models
The Fundamental Shift: Addressable, Diff-Aware Specifications
The key insight is treating specifications not as documents but as computational models with three critical properties:
- Stable Addressability: Every component has a persistent, unique identifier
- Change Locality: Modifications can be precisely located and scoped
- Impact Propagation: Changes automatically identify affected downstream components
This approach transforms the traditional workflow from:
PRD change → Full regeneration of Task + Test specs
To:
PRD Rule R-002-01 changed → Identify affected Tasks → Update only impacted Tests
PRDSpec Architecture: Directory-Based, Not Monolithic
Instead of a single massive PRD file, we use a structured directory approach:
specs/prd/
├── _meta.md # Version, ownership, scope
├── vision.md # Long-term objectives (low-change frequency)
├── glossary.md # Domain vocabulary (critical for consistency)
├── constraints.md # Legal, compliance, technical constraints
├── non-functional.md # SLA, performance, security requirements
├── features/
│ ├── F-001-user-auth.md
│ ├── F-002-payment.md
│ └── F-003-admin.md
└── future/
└── F-900-xxx.md # Explicitly future-scoped features
Feature-Level PRD Template
Each feature follows a strict template that supports computational processing:
# Feature: F-002 Payment Settlement
## 1. Business Goal
- Users can complete payment transactions
- System must guarantee transaction integrity
## 2. Scope
### In Scope
- Standard payment flows
- Failure recovery mechanisms
### Out of Scope
- Refund processing (handled in F-003)
## 3. Core Rules (Mandatory)
- R-002-01: Payments must be idempotent
- R-002-02: Failures must be auditable and recoverable
## 4. User Flow (Abstract)
- State transition model (text-based, not UI-specific)
## 5. Edge Cases
- Network interruption scenarios
- Retry mechanisms
## 6. Repository Impact (Interface Points Only)
- backend-api: settlement lifecycle management
- frontend-web: payment status display
## 7. Acceptance Criteria (Language-Level)
- When user submits duplicate payment request, system returns original transaction ID
- When payment fails, system logs failure reason and maintains transaction stateThe critical element here is Acceptance Criteria living in the PRD—this serves as the "common ancestor" for both TaskSpec and TestSpec generation.
Analysis: Solving the Token Explosion Problem
The Differential Processing Challenge
The core problem you identified is accurate: if a test reveals that a PRD point needs clarification, traditionally the entire agent workflow would need to rerun. This creates two major issues:
- Token cost explosion: Each iteration consumes full context
- Process friction: Teams avoid making necessary changes due to workflow overhead
Solution: Explicit Change Impact Modeling
The solution isn't making agents "smarter" but making specifications structurally diff-aware. Here's how:
1. Stable ID System
Every requirement, task, and test case gets a permanent identifier:
PRD Rules: R-002-01, R-002-02...
Task IDs: T-002-BE-01, T-002-FE-01...
Test Cases: TC-002-01, TC-002-02...
2. Explicit Mapping Tables
TaskSpec includes mandatory mapping sections:
## Acceptance Mapping
| PRD Rule | Task ID |
|---------|--------|
| R-002-01 | T-002-BE-01, T-002-BE-02 |
| R-002-02 | T-002-BE-03 |3. Diff-Driven Agent Inputs
Instead of passing entire specifications, agents receive targeted diffs:
Input:
- Changed: R-002-01 (content diff)
- Affected Tasks: T-002-BE-01, T-002-BE-02
- Context: Only related rules and constraints
This reduces token consumption from O(entire_spec) to O(change_scope).
Agent and Skills Architecture
Core Design Principle: Agents Execute, Skills Constrain
The fundamental principle is that agents handle execution while skills encode process constraints. This separation ensures:
- Predictable token usage
- Auditable decision-making
- Consistent behavior across iterations
Complete Agent + Skills Directory Structure
claude_code_agents/
├── README.md
├── agents/
│ ├── prd/
│ │ ├── prd_builder.md
│ │ └── prd_reviewer.md
│ ├── task/
│ │ ├── taskspec_generator.md
│ │ └── taskspec_reviewer.md
│ └── test/
│ ├── testspec_generator.md
│ └── test_executor.md
├── skills/
│ ├── core/
│ │ ├── diff_processor.md
│ │ ├── spec_validator.md
│ │ └── impact_analyzer.md
│ ├── prd/
│ │ ├── prd_templates.md
│ │ ├── requirement_patterns.md
│ │ └── acceptance_criteria_guide.md
│ ├── task/
│ │ ├── repo_metadata.md
│ │ ├── task_decomposition.md
│ │ └── cross_repo_coordination.md
│ └── test/
│ ├── test_case_patterns.md
│ ├── e2e_scenarios.md
│ └── coverage_analysis.md
└── workflows/
├── requirement_change.md
├── new_feature.md
└── regression_testing.md
Detailed Agent Specifications
PRD Builder Agent
# PRD Builder Agent
## Role
Transform natural language requirements into structured PRDSpec format
## Inputs
- Natural language requirements
- Existing PRD context (diff-based)
- Domain glossary
- Constraint templates
## Outputs
- Feature-level PRDSpec
- Rule IDs with stable references
- Impact assessment on existing features
## Skills Required
- prd_templates
- requirement_patterns
- diff_processorPRD Reviewer Agent
# PRD Reviewer Agent
## Role
Validate PRD completeness, testability, and consistency
## Inputs
- Draft PRDSpec
- Historical anti-patterns
- Cross-feature dependencies
## Outputs
- Completeness assessment
- Testability gaps
- Recommended clarifications
## Skills Required
- spec_validator
- acceptance_criteria_guide
- requirement_patternsTaskSpec Generator Agent
# TaskSpec Generator Agent
## Role
Convert PRDSpec into repository-specific implementation tasks
## Inputs
- PRDSpec (rule-level)
- Repository metadata
- Technical constraints
- Cross-repo dependencies
## Outputs
- Repo-specific TaskSpec
- Implementation task breakdown
- Acceptance mapping tables
## Skills Required
- repo_metadata
- task_decomposition
- cross_repo_coordination
- diff_processorCritical Skills Implementation
Diff Processor Skill
# Diff Processor Skill
## Purpose
Enable agents to work with change deltas rather than full specifications
## Core Functions
1. Parse specification diffs
2. Identify impact scope
3. Extract minimal context needed
4. Determine when full-spec analysis is required
## Usage Pattern
```python
# Pseudo-code for skill usage
diff = extract_spec_diff(old_spec, new_spec)
affected_items = identify_affected_items(diff)
context = build_minimal_context(affected_items)
# Agent operates only on minimal contextImplications: Real-World Benefits
Scalability Advantages
This architecture delivers several measurable improvements:
- Token Efficiency: Context consumption scales with change size, not total specification size
- Parallel Processing: Independent changes can be processed simultaneously
- Incremental Updates: Only affected tests and tasks require regeneration
- Human Review Focus: Reviews concentrate on change deltas rather than full specifications
Quality Assurance
The structured approach provides:
- Traceability: Every test case maps back to specific requirements
- Change Impact Visibility: Teams understand exactly what will be affected
- Regression Prevention: Automated detection of requirement-breaking changes
- Documentation Synchronization: Specifications automatically reflect implementation reality
Team Productivity
Real productivity gains come from:
- Reduced Context Switching: Developers receive precisely scoped task specifications
- Faster Onboarding: New team members can understand requirements through structured IDs and mappings
- Confident Refactoring: Clear mapping between requirements and implementation enables safe changes
- Audit Trail: Complete history of requirement evolution and implementation decisions
Conclusion
The multi-agent specification management system described here represents a significant evolution in how teams can use AI tools for complex product development. By treating specifications as computational models rather than documents, and by explicitly designing for differential processing, teams can achieve the benefits of comprehensive AI assistance without the traditional scalability limitations.
The key insights are:
- Structure enables scale: Well-designed specification formats allow AI agents to work efficiently with minimal context
- Skills constrain agents: Process intelligence belongs in reusable skills, not individual agent prompts
- Differential processing is mandatory: Without explicit change management, AI-assisted workflows become prohibitively expensive
This architecture is particularly valuable for teams managing multiple repositories with shared requirements, where consistency and change management are critical success factors. The next step for teams implementing this approach should be starting with a single feature to validate the diff processing mechanisms before scaling to full product requirements.
The investment in upfront specification structure pays dividends through reduced AI token costs, faster iteration cycles, and higher confidence in requirement-implementation alignment across complex, multi-repository projects.
