The Self-Extension Problem

AI agents can only use pre-defined tools. When they encounter tasks requiring capabilities they lack, they fail. Synthesis enables agents to extend themselves—safely, reliably, and with objective validation.

Why Not Just Generate Code?

LLM-generated code looks plausible. It's often syntactically correct. But it frequently doesn't work.

One-shot code generation achieves roughly 40-60% success on first attempt. For production systems, this isn't good enough. You can't build reliable agents on a foundation of "probably works."

The key insight: LLMs are poor at one-shot generation but excellent at iteration when given feedback. Synthesis exploits this by combining LLM capabilities with test-driven development.

The Synthesis Approach

┌─────────────────────────────────────────────────────────────┐ │ SYNTHESIS FRAMEWORK │ ├─────────────────────────────────────────────────────────────┤ │ │ │ 1. CAPABILITY REQUEST │ │ └─► Natural language description + example I/O │ │ │ │ 2. TEST GENERATOR │ │ └─► Parse requirements │ │ └─► Generate comprehensive test suite │ │ └─► Include edge cases, error conditions │ │ │ │ 3. TDD SYNTHESIS ENGINE │ │ └─► Generate implementation │ │ └─► Run tests in sandbox │ │ └─► If all pass → success │ │ └─► Else: analyze failures, regenerate │ │ └─► Until: success OR max_iterations │ │ │ │ 4. SANDBOXED RUNTIME │ │ └─► Docker containers for untrusted code │ │ └─► Resource limits (CPU, memory, time) │ │ └─► Network isolation │ │ │ │ 5. CAPABILITY REPOSITORY │ │ └─► Store validated capabilities │ │ └─► Semantic search for reuse │ │ └─► Trust scores and version control │ │ │ └─────────────────────────────────────────────────────────────┘

Honest Metrics

These are real measurements, not marketing claims:

Approach First Attempt After Iteration
One-shot generation 40-60% N/A
TDD Synthesis 50-65% 70-85%

When synthesis succeeds, the iteration distribution is:

Most capabilities that will succeed do so within 2-3 iterations.

Graduated Trust Model

Synthesized capabilities don't automatically get full system access. Trust is earned through demonstrated reliability:

Level Requirements Privileges
UNTRUSTED Just synthesized Docker isolation, strict limits
PROBATION 10+ uses, 90%+ success Process isolation, relaxed limits
TRUSTED 100+ uses, 95%+ success, 30+ days Minimal isolation
VERIFIED Human review + signature Direct execution

Trust can be revoked: any failure drops trust level and requires re-earning.

The Capability Abstraction

@dataclass
class Capability:
    """A validated, self-contained capability."""
    
    # Identity
    name: str
    version: str
    description: str
    
    # Implementation
    code: str
    entry_point: str
    dependencies: List[str]
    
    # Validation
    test_suite: TestSuite
    test_results: List[TestResult]
    
    # Trust & Permissions
    trust_level: TrustLevel
    required_permissions: Set[Permission]
    
    # Metadata
    usage_count: int
    success_rate: float
    
    # Lineage
    synthesized_from: Optional[str]  # Original request
    forked_from: Optional[str]       # Parent if forked

Test Generation

The first step isn't code—it's tests. The test suite defines what success looks like:

async def generate_tests(self, requirement: str, examples: List) -> TestSuite:
    prompt = f"""
    Generate a comprehensive test suite for this capability:
    
    REQUIREMENT: {requirement}
    EXAMPLES: {examples}
    
    Generate tests covering:
    1. All provided examples (exact validation)
    2. Edge cases (empty inputs, large inputs, boundaries)
    3. Error conditions (invalid types, missing fields)
    4. Type checking (verify output types)
    """
    
    return self._parse_test_suite(await self.llm.generate(prompt))

Failure Modes

When synthesis fails, it fails for identifiable reasons:

Failure Mode Frequency Mitigation
Unclear requirements 35% Better examples, clarification
Complex algorithms 25% Decomposition, human assist
External dependencies 20% Pre-verified dependency set
Edge case handling 15% More comprehensive tests
Other 5% Case-by-case analysis

Philosophy: Tests as Truth

Test-driven synthesis reflects a stance about AI development:

"Rather than constraining AI systems with pre-defined toolsets, we can enable them to extend themselves safely and reliably."

— Synthesis design philosophy

Safety Considerations

Self-extending systems raise legitimate concerns:

Papers & Code

GitHub Repository

Full Synthesis implementation with Docker sandbox and repository.

nSLIP Protocol

Coordination protocol for multi-agent synthesis systems.

Related Research