Synthesis Framework | Making Minds

The Self-Extension Problem

AI agents can only use pre-defined tools. When they encounter tasks requiring capabilities they lack, they fail. Synthesis enables agents to extend themselves—safely, reliably, and with objective validation.

Why Not Just Generate Code?

LLM-generated code looks plausible. It's often syntactically correct. But it frequently doesn't work.

One-shot code generation achieves roughly 40-60% success on first attempt. For production systems, this isn't good enough. You can't build reliable agents on a foundation of "probably works."

The key insight: LLMs are poor at one-shot generation but excellent at iteration when given feedback. Synthesis exploits this by combining LLM capabilities with test-driven development.

The Synthesis Approach

┌─────────────────────────────────────────────────────────────┐ │ SYNTHESIS FRAMEWORK │ ├─────────────────────────────────────────────────────────────┤ │ │ │ 1. CAPABILITY REQUEST │ │ └─► Natural language description + example I/O │ │ │ │ 2. TEST GENERATOR │ │ └─► Parse requirements │ │ └─► Generate comprehensive test suite │ │ └─► Include edge cases, error conditions │ │ │ │ 3. TDD SYNTHESIS ENGINE │ │ └─► Generate implementation │ │ └─► Run tests in sandbox │ │ └─► If all pass → success │ │ └─► Else: analyze failures, regenerate │ │ └─► Until: success OR max_iterations │ │ │ │ 4. SANDBOXED RUNTIME │ │ └─► Docker containers for untrusted code │ │ └─► Resource limits (CPU, memory, time) │ │ └─► Network isolation │ │ │ │ 5. CAPABILITY REPOSITORY │ │ └─► Store validated capabilities │ │ └─► Semantic search for reuse │ │ └─► Trust scores and version control │ │ │ └─────────────────────────────────────────────────────────────┘

Honest Metrics

These are real measurements, not marketing claims:

Approach	First Attempt	After Iteration
One-shot generation	40-60%	N/A
TDD Synthesis	50-65%	70-85%

When synthesis succeeds, the iteration distribution is:

Iteration 1: 52% of successful syntheses
Iteration 2: 28% (cumulative 80%)
Iteration 3: 12% (cumulative 92%)
Iterations 4-5: 8% (cumulative 100%)

Most capabilities that will succeed do so within 2-3 iterations.

Graduated Trust Model

Synthesized capabilities don't automatically get full system access. Trust is earned through demonstrated reliability:

Level	Requirements	Privileges
UNTRUSTED	Just synthesized	Docker isolation, strict limits
PROBATION	10+ uses, 90%+ success	Process isolation, relaxed limits
TRUSTED	100+ uses, 95%+ success, 30+ days	Minimal isolation
VERIFIED	Human review + signature	Direct execution

Trust can be revoked: any failure drops trust level and requires re-earning.

The Capability Abstraction

@dataclass
class Capability:
    """A validated, self-contained capability."""
    
    # Identity
    name: str
    version: str
    description: str
    
    # Implementation
    code: str
    entry_point: str
    dependencies: List[str]
    
    # Validation
    test_suite: TestSuite
    test_results: List[TestResult]
    
    # Trust & Permissions
    trust_level: TrustLevel
    required_permissions: Set[Permission]
    
    # Metadata
    usage_count: int
    success_rate: float
    
    # Lineage
    synthesized_from: Optional[str]  # Original request
    forked_from: Optional[str]       # Parent if forked

Test Generation

The first step isn't code—it's tests. The test suite defines what success looks like:

async def generate_tests(self, requirement: str, examples: List) -> TestSuite:
    prompt = f"""
    Generate a comprehensive test suite for this capability:
    
    REQUIREMENT: {requirement}
    EXAMPLES: {examples}
    
    Generate tests covering:
    1. All provided examples (exact validation)
    2. Edge cases (empty inputs, large inputs, boundaries)
    3. Error conditions (invalid types, missing fields)
    4. Type checking (verify output types)
    """
    
    return self._parse_test_suite(await self.llm.generate(prompt))

Failure Modes

When synthesis fails, it fails for identifiable reasons:

Failure Mode	Frequency	Mitigation
Unclear requirements	35%	Better examples, clarification
Complex algorithms	25%	Decomposition, human assist
External dependencies	20%	Pre-verified dependency set
Edge case handling	15%	More comprehensive tests
Other	5%	Case-by-case analysis

Philosophy: Tests as Truth

Test-driven synthesis reflects a stance about AI development:

Tests are objective proof: A capability either passes or it doesn't. No marketing, no wishful thinking.
Iteration beats perfection: Work with LLM strengths (refinement) rather than against weaknesses (one-shot accuracy).
Trust is earned: We don't assume synthesized capabilities are safe. They prove it through use.

"Rather than constraining AI systems with pre-defined toolsets, we can enable them to extend themselves safely and reliably."

— Synthesis design philosophy

Safety Considerations

Self-extending systems raise legitimate concerns:

Harmful capabilities: An agent could synthesize dangerous tools. Sandboxing and trust graduation mitigate but don't eliminate this risk.
Alignment drift: Capabilities synthesized by one agent might not align with human values. Repository oversight provides a checkpoint.
Transparency requirement: All synthesized code is auditable. We know exactly what each capability does because it's defined by its tests.

Papers & Code

GitHub Repository

Full Synthesis implementation with Docker sandbox and repository.

nSLIP Protocol

Coordination protocol for multi-agent synthesis systems.

Related Research

nSLIP Protocol — Agent coordination during synthesis
Continuity Core — Memory for agents that need to grow
Longitudinal Case Study — Self-extension in practice