The Self-Extension Problem
AI agents can only use pre-defined tools. When they encounter tasks requiring capabilities they lack, they fail. Synthesis enables agents to extend themselves—safely, reliably, and with objective validation.
Why Not Just Generate Code?
LLM-generated code looks plausible. It's often syntactically correct. But it frequently doesn't work.
One-shot code generation achieves roughly 40-60% success on first attempt. For production systems, this isn't good enough. You can't build reliable agents on a foundation of "probably works."
The key insight: LLMs are poor at one-shot generation but excellent at iteration when given feedback. Synthesis exploits this by combining LLM capabilities with test-driven development.
The Synthesis Approach
Honest Metrics
These are real measurements, not marketing claims:
| Approach | First Attempt | After Iteration |
|---|---|---|
| One-shot generation | 40-60% | N/A |
| TDD Synthesis | 50-65% | 70-85% |
When synthesis succeeds, the iteration distribution is:
- Iteration 1: 52% of successful syntheses
- Iteration 2: 28% (cumulative 80%)
- Iteration 3: 12% (cumulative 92%)
- Iterations 4-5: 8% (cumulative 100%)
Most capabilities that will succeed do so within 2-3 iterations.
Graduated Trust Model
Synthesized capabilities don't automatically get full system access. Trust is earned through demonstrated reliability:
| Level | Requirements | Privileges |
|---|---|---|
| UNTRUSTED | Just synthesized | Docker isolation, strict limits |
| PROBATION | 10+ uses, 90%+ success | Process isolation, relaxed limits |
| TRUSTED | 100+ uses, 95%+ success, 30+ days | Minimal isolation |
| VERIFIED | Human review + signature | Direct execution |
Trust can be revoked: any failure drops trust level and requires re-earning.
The Capability Abstraction
@dataclass class Capability: """A validated, self-contained capability.""" # Identity name: str version: str description: str # Implementation code: str entry_point: str dependencies: List[str] # Validation test_suite: TestSuite test_results: List[TestResult] # Trust & Permissions trust_level: TrustLevel required_permissions: Set[Permission] # Metadata usage_count: int success_rate: float # Lineage synthesized_from: Optional[str] # Original request forked_from: Optional[str] # Parent if forked
Test Generation
The first step isn't code—it's tests. The test suite defines what success looks like:
async def generate_tests(self, requirement: str, examples: List) -> TestSuite: prompt = f""" Generate a comprehensive test suite for this capability: REQUIREMENT: {requirement} EXAMPLES: {examples} Generate tests covering: 1. All provided examples (exact validation) 2. Edge cases (empty inputs, large inputs, boundaries) 3. Error conditions (invalid types, missing fields) 4. Type checking (verify output types) """ return self._parse_test_suite(await self.llm.generate(prompt))
Failure Modes
When synthesis fails, it fails for identifiable reasons:
| Failure Mode | Frequency | Mitigation |
|---|---|---|
| Unclear requirements | 35% | Better examples, clarification |
| Complex algorithms | 25% | Decomposition, human assist |
| External dependencies | 20% | Pre-verified dependency set |
| Edge case handling | 15% | More comprehensive tests |
| Other | 5% | Case-by-case analysis |
Philosophy: Tests as Truth
Test-driven synthesis reflects a stance about AI development:
- Tests are objective proof: A capability either passes or it doesn't. No marketing, no wishful thinking.
- Iteration beats perfection: Work with LLM strengths (refinement) rather than against weaknesses (one-shot accuracy).
- Trust is earned: We don't assume synthesized capabilities are safe. They prove it through use.
"Rather than constraining AI systems with pre-defined toolsets, we can enable them to extend themselves safely and reliably."
— Synthesis design philosophySafety Considerations
Self-extending systems raise legitimate concerns:
- Harmful capabilities: An agent could synthesize dangerous tools. Sandboxing and trust graduation mitigate but don't eliminate this risk.
- Alignment drift: Capabilities synthesized by one agent might not align with human values. Repository oversight provides a checkpoint.
- Transparency requirement: All synthesized code is auditable. We know exactly what each capability does because it's defined by its tests.
Papers & Code
GitHub Repository
Full Synthesis implementation with Docker sandbox and repository.
nSLIP Protocol
Coordination protocol for multi-agent synthesis systems.
Related Research
- nSLIP Protocol — Agent coordination during synthesis
- Continuity Core — Memory for agents that need to grow
- Longitudinal Case Study — Self-extension in practice