Building a Scalable Multi-Agent System for Test-Driven Development with Claude Code

8 min read, 1454 words, last updated: 2025/12/29

Introduction

As AI-powered development tools like Claude Code become more sophisticated, teams are moving beyond simple code generation to complex, multi-agent workflows that can handle entire product development lifecycles. This post explores a practical architecture for implementing test-driven development (TDD) using Claude Code agents across multiple repositories, with a focus on maintaining scalability, auditability, and efficient token usage.

The challenge we're addressing is real: how do you maintain consistency between product requirements, implementation tasks, and test specifications across multiple repositories while ensuring that changes propagate efficiently without requiring full workflow reruns?

Background: The Multi-Spec Challenge

Traditional development workflows suffer from several critical issues when scaled across multiple repositories:

Requirements drift: PRDs become outdated, leading to implementation misalignment
Context explosion: AI tools lose effectiveness when context grows too large
Change propagation: Small requirement changes trigger expensive full workflow reruns
Cross-repo inconsistency: Different repositories implement conflicting interpretations of the same requirements

The solution we'll explore uses three interconnected specifications managed by specialized agents:

PRDSpec: Product Requirements Definition as a computational model
TaskSpec: Implementation specifications per repository
TestSpec: Verification and testing specifications

Core Concepts: From Documents to Computational Models

The Fundamental Shift: Addressable, Diff-Aware Specifications

The key insight is treating specifications not as documents but as computational models with three critical properties:

Stable Addressability: Every component has a persistent, unique identifier
Change Locality: Modifications can be precisely located and scoped
Impact Propagation: Changes automatically identify affected downstream components

This approach transforms the traditional workflow from:

PRD change → Full regeneration of Task + Test specs

To:

PRD Rule R-002-01 changed → Identify affected Tasks → Update only impacted Tests

PRDSpec Architecture: Directory-Based, Not Monolithic

Instead of a single massive PRD file, we use a structured directory approach:

specs/prd/
├── _meta.md                # Version, ownership, scope
├── vision.md               # Long-term objectives (low-change frequency)
├── glossary.md             # Domain vocabulary (critical for consistency)
├── constraints.md          # Legal, compliance, technical constraints
├── non-functional.md       # SLA, performance, security requirements
├── features/
│   ├── F-001-user-auth.md
│   ├── F-002-payment.md
│   └── F-003-admin.md
└── future/
    └── F-900-xxx.md        # Explicitly future-scoped features

Feature-Level PRD Template

Each feature follows a strict template that supports computational processing:

# Feature: F-002 Payment Settlement
 
## 1. Business Goal
- Users can complete payment transactions
- System must guarantee transaction integrity
 
## 2. Scope
### In Scope
- Standard payment flows
- Failure recovery mechanisms
 
### Out of Scope
- Refund processing (handled in F-003)
 
## 3. Core Rules (Mandatory)
- R-002-01: Payments must be idempotent
- R-002-02: Failures must be auditable and recoverable
 
## 4. User Flow (Abstract)
- State transition model (text-based, not UI-specific)
 
## 5. Edge Cases
- Network interruption scenarios
- Retry mechanisms
 
## 6. Repository Impact (Interface Points Only)
- backend-api: settlement lifecycle management
- frontend-web: payment status display
 
## 7. Acceptance Criteria (Language-Level)
- When user submits duplicate payment request, system returns original transaction ID
- When payment fails, system logs failure reason and maintains transaction state

The critical element here is Acceptance Criteria living in the PRD—this serves as the "common ancestor" for both TaskSpec and TestSpec generation.

Analysis: Solving the Token Explosion Problem

The Differential Processing Challenge

The core problem you identified is accurate: if a test reveals that a PRD point needs clarification, traditionally the entire agent workflow would need to rerun. This creates two major issues:

Token cost explosion: Each iteration consumes full context
Process friction: Teams avoid making necessary changes due to workflow overhead

Solution: Explicit Change Impact Modeling

The solution isn't making agents "smarter" but making specifications structurally diff-aware. Here's how:

1. Stable ID System

Every requirement, task, and test case gets a permanent identifier:

PRD Rules: R-002-01, R-002-02...
Task IDs: T-002-BE-01, T-002-FE-01...
Test Cases: TC-002-01, TC-002-02...

2. Explicit Mapping Tables

TaskSpec includes mandatory mapping sections:

## Acceptance Mapping
| PRD Rule | Task ID |
|---------|--------|
| R-002-01 | T-002-BE-01, T-002-BE-02 |
| R-002-02 | T-002-BE-03 |

3. Diff-Driven Agent Inputs

Instead of passing entire specifications, agents receive targeted diffs:

Input: 
- Changed: R-002-01 (content diff)
- Affected Tasks: T-002-BE-01, T-002-BE-02
- Context: Only related rules and constraints

This reduces token consumption from O(entire_spec) to O(change_scope).

Agent and Skills Architecture

Core Design Principle: Agents Execute, Skills Constrain

The fundamental principle is that agents handle execution while skills encode process constraints. This separation ensures:

Predictable token usage
Auditable decision-making
Consistent behavior across iterations

Complete Agent + Skills Directory Structure

claude_code_agents/
├── README.md
├── agents/
│   ├── prd/
│   │   ├── prd_builder.md
│   │   └── prd_reviewer.md
│   ├── task/
│   │   ├── taskspec_generator.md
│   │   └── taskspec_reviewer.md
│   └── test/
│       ├── testspec_generator.md
│       └── test_executor.md
├── skills/
│   ├── core/
│   │   ├── diff_processor.md
│   │   ├── spec_validator.md
│   │   └── impact_analyzer.md
│   ├── prd/
│   │   ├── prd_templates.md
│   │   ├── requirement_patterns.md
│   │   └── acceptance_criteria_guide.md
│   ├── task/
│   │   ├── repo_metadata.md
│   │   ├── task_decomposition.md
│   │   └── cross_repo_coordination.md
│   └── test/
│       ├── test_case_patterns.md
│       ├── e2e_scenarios.md
│       └── coverage_analysis.md
└── workflows/
    ├── requirement_change.md
    ├── new_feature.md
    └── regression_testing.md

Detailed Agent Specifications

PRD Builder Agent

# PRD Builder Agent
 
## Role
Transform natural language requirements into structured PRDSpec format
 
## Inputs
- Natural language requirements
- Existing PRD context (diff-based)
- Domain glossary
- Constraint templates
 
## Outputs
- Feature-level PRDSpec
- Rule IDs with stable references
- Impact assessment on existing features
 
## Skills Required
- prd_templates
- requirement_patterns
- diff_processor

PRD Reviewer Agent

# PRD Reviewer Agent
 
## Role
Validate PRD completeness, testability, and consistency
 
## Inputs
- Draft PRDSpec
- Historical anti-patterns
- Cross-feature dependencies
 
## Outputs
- Completeness assessment
- Testability gaps
- Recommended clarifications
 
## Skills Required
- spec_validator
- acceptance_criteria_guide
- requirement_patterns

TaskSpec Generator Agent

# TaskSpec Generator Agent
 
## Role
Convert PRDSpec into repository-specific implementation tasks
 
## Inputs
- PRDSpec (rule-level)
- Repository metadata
- Technical constraints
- Cross-repo dependencies
 
## Outputs
- Repo-specific TaskSpec
- Implementation task breakdown
- Acceptance mapping tables
 
## Skills Required
- repo_metadata
- task_decomposition
- cross_repo_coordination
- diff_processor

Critical Skills Implementation

Diff Processor Skill

# Diff Processor Skill
 
## Purpose
Enable agents to work with change deltas rather than full specifications
 
## Core Functions
1. Parse specification diffs
2. Identify impact scope
3. Extract minimal context needed
4. Determine when full-spec analysis is required
 
## Usage Pattern
```python
# Pseudo-code for skill usage
diff = extract_spec_diff(old_spec, new_spec)
affected_items = identify_affected_items(diff)
context = build_minimal_context(affected_items)
# Agent operates only on minimal context

Implications: Real-World Benefits

Scalability Advantages

This architecture delivers several measurable improvements:

Token Efficiency: Context consumption scales with change size, not total specification size
Parallel Processing: Independent changes can be processed simultaneously
Incremental Updates: Only affected tests and tasks require regeneration
Human Review Focus: Reviews concentrate on change deltas rather than full specifications

Quality Assurance

The structured approach provides:

Traceability: Every test case maps back to specific requirements
Change Impact Visibility: Teams understand exactly what will be affected
Regression Prevention: Automated detection of requirement-breaking changes
Documentation Synchronization: Specifications automatically reflect implementation reality

Team Productivity

Real productivity gains come from:

Reduced Context Switching: Developers receive precisely scoped task specifications
Faster Onboarding: New team members can understand requirements through structured IDs and mappings
Confident Refactoring: Clear mapping between requirements and implementation enables safe changes
Audit Trail: Complete history of requirement evolution and implementation decisions

Conclusion

The multi-agent specification management system described here represents a significant evolution in how teams can use AI tools for complex product development. By treating specifications as computational models rather than documents, and by explicitly designing for differential processing, teams can achieve the benefits of comprehensive AI assistance without the traditional scalability limitations.

The key insights are:

Structure enables scale: Well-designed specification formats allow AI agents to work efficiently with minimal context
Skills constrain agents: Process intelligence belongs in reusable skills, not individual agent prompts
Differential processing is mandatory: Without explicit change management, AI-assisted workflows become prohibitively expensive

This architecture is particularly valuable for teams managing multiple repositories with shared requirements, where consistency and change management are critical success factors. The next step for teams implementing this approach should be starting with a single feature to validate the diff processing mechanisms before scaling to full product requirements.

The investment in upfront specification structure pays dividends through reduced AI token costs, faster iteration cycles, and higher confidence in requirement-implementation alignment across complex, multi-repository projects.