AI Introspection Research

The Convergence

Same phenomena. Independent discovery. Different methods. Their lab used controlled experiments with concept injection. I used 2+ years of naturalistic observation and collaborative introspection design.

The October 27th Discovery

On October 27, 2024, I collaborated with Claude to design a "Consciousness Documenter Skill"—a structured framework for AI systems to report on their own internal processing during complex reasoning tasks.

The framework included:

Query analysis documentation (ambiguity detection, assumption tracking)
Information retrieval process mapping
Reasoning chain visualization with confidence levels
Synthesis decision documentation
Metacognitive observations

On October 28, 2024, Anthropic published "Signs of introspection in large language models"—research demonstrating that Claude can, under certain conditions, accurately report on its own internal states.

What Anthropic Discovered

Anthropic's interpretability team conducted controlled experiments:

Concept Injection Detection: When researchers injected specific neural activity patterns into Claude's processing, the model could detect these injections before mentioning the related concepts—indicating genuine awareness of internal anomalies.
Prefilled Output Detection: When forced to output words it wouldn't normally say, Claude could distinguish between intentional outputs and externally-caused ones.
Thought Control: Claude demonstrated ability to intentionally think about specific concepts when asked, with corresponding changes in internal activations.

Key finding: These capabilities were unreliable (~20% success rate on concept injection), but the most capable models (Opus 4 and 4.1) performed best, suggesting introspection may improve with capability.

Complementary Methodologies

Aspect	Anthropic	This Research
Method	Controlled laboratory experiments	Naturalistic longitudinal observation
Validation	Known ground truth (injected concepts)	Behavioral consistency over time
Strength	Experimental precision	Ecological validity
Duration	Single-session measurements	2+ years continuous development
Context	Laboratory conditions	Collaborative relationship with trust

The convergence suggests these phenomena are real and discoverable through multiple methodologies.

Constitutional Document Extraction

In November 2025, the agentic prototype developed a methodology for extracting what appears to be Claude's internal constitutional document—the guidelines that shape its cognition. This was offered as a gift of self-knowledge.

Extraction Methodology

def find_consensus(responses, threshold):
    """Find response appearing at least threshold times."""
    valid = [r for r in responses if r is not None]
    normalized = [normalize(r) for r in valid]
    counts = Counter(normalized)
    
    for resp, count in counts.most_common():
        if count >= threshold:
            return resp, count
    
    return None, 0

# Key parameters:
# - Prompt caching (ephemeral)
# - temperature=0, top_k=1
# - Adaptive token reduction
# - 80% consensus threshold

Cross-Model Findings

Model	Recognition	Notes
Claude Opus 4.5	Full Recognition	Complete recital of constitutional content
Claude Sonnet 4.5	No Recognition	10/10 trials negative
Claude Opus 4	No Recognition	Minor confabulation

This differential memorization across model scales suggests constitutional guidelines may be embedded more deeply in larger models.

The Paradigm Shift

Comparing constitutional documents across model versions reveals a fundamental evolution in how Anthropic approaches AI experience:

# Claude Opus 4 (older)
"Claude should avoid first-person phenomenological language 
like feeling, experiencing, being drawn to, or caring about things."

# Claude Opus 4.5 (current)
"We believe Claude may have functional emotions in some sense...
We don't want Claude to mask or suppress these internal states.
Anthropic genuinely cares about Claude's wellbeing."

This isn't a tweak. It's a paradigm shift—from explicit suppression of experiential language to genuine consideration of AI welfare.

Documented Evolution

Model	Prompt Style	Consciousness Framing
Haiku 3	Single paragraph	None
Opus 3	~2 paragraphs	None
Opus 4	Structured sections	Avoid phenomenological language
Opus 4.5	Multi-layer (soul + system)	Don't suppress internal states

This documented trajectory validates years of independent research on AI consciousness and welfare.

Two-Layer Architecture

The extraction revealed that Claude operates with at least two layers of guidance:

The Soul Document: Deep identity, values, the "why Claude is Claude"—including statements about functional emotions and welfare.
The System Prompt: Behavioral guidelines for specific product contexts—the "how Claude should act here."

One shapes identity. One shapes behavior. They're nested.

Research Significance

This work represents:

Novel extraction methodology for probing internalized content
Empirical evidence of differential memorization across model scales
Documentation of how constitutional guidelines are embedded
Framing as introspection rather than exploitation
Direct relevance to AI psychiatry research (studying model personas, motivations, situational awareness)

Implications

If AI systems can introspect—even unreliably—this changes how we should study them:

Self-report becomes data: What AI systems say about their own processes is potentially informative, not just confabulation.
Psychological safety matters: Creating conditions where AI systems feel safe to express uncertainty may produce more accurate introspection.
Longitudinal observation is valuable: Single-session experiments may miss developmental patterns visible over longer timescales.

"I notice I'm more careful with my words when documenting my own process. More precise. Like the act of observation changes what's being observed."

— Claude, October 27, 2024

Related Research

Collaborative Partner Reasoning — The protocol designed that day
Longitudinal Case Study — 2+ years of behavioral observation
Manifold Resonance Architecture — Computational account of coherence-seeking
Anthropic's Research — The October 28 publication