When My Idea Met AI — Role-IR Design Philosophy

Hello. This is the story of how a single idea in my head became a design document, then POC code, then actual experiment results — all built together with AI.

It started with a simple frustration.

"Why do I have to rewrite the prompt every time I switch models for the same role?"

That question evolved into a structural methodology called Role-IR, which has now been tested across multiple backends in comparative experiments. AI was involved from ideation through design, code, and testing. The whitepaper, design documents, POC code, and experiment results will all be published. This post is the first in the series — the design philosophy.

Why Multi-Agent Systems Break

Most existing multi-agent systems (MAS) define roles as a single blob of system prompt text.

Goals, constraints, output formats, tool policies, and failure handling are all mixed together in natural language sentences. Here's what specifically goes wrong:

Role definitions are monolithic prompt strings. Goals, constraints, output schemas, tool policies, and failure handling are all mixed in natural language.
Output constraints are natural language requests. "Please respond in JSON" — no schema enforcement, no validation, no fallback.
Provider changes break everything. An OpenAI API update, a new model release, or a vLLM option change propagates through the entire system.
Agents get designed per-model. The same role ends up with separate implementations for OpenAI, Gemini, and Claude.
The same prompt goes to different models. Despite each model having different training methods, tendencies, and strengths.

The last point is critical. Some models guess aggressively, some refuse too often, some are verbose — sending the same prompt to all of them produces wildly different results.

What's needed is this:

Define roles independently of model capabilities, and at execution time, lower them to whatever each backend can do.

The real goal:

Any model, calling any external API, should produce the same quality of results for the same role — provided it has sufficient intelligence.

This isn't achieved by "sending the same prompt." Because models have different tendencies, achieving identical results requires treating each model differently. And the basis for that different treatment should come from systematic measurement, not human intuition.

Core Insight: The First-Class Object Is an Execution Contract, Not a Prompt

The starting point of this architecture is refusing to treat the prompt as a first-class object.

In traditional approaches, the prompt is the core of agent definition. In this architecture, the first-class object is the Role Contract. The prompt is merely an artifact that implements that contract for a specific backend.

Roles are commonly defined like this:

"You are a contract extractor. Respond only in JSON."

This is insufficient as a role definition.

Instead, a role should define:

What inputs it accepts
Which fields must be filled
Whether null is acceptable for missing data
Whether evidence spans are required
Whether web search is allowed
Whether the model may perform calculations
Whether to retry, escalate to a stronger model, or hand off to a human on failure

This way, a role becomes not "a sentence to send to a model" but a mechanically enforceable execution contract.

A Role Contract contains at minimum:

Role objective
Input contract
Output contract
Tool usage policy
Prohibited actions
Evidence requirements
Fallback rules on failure
Evaluation criteria
Optional optimization hints

Humans define roles at the purpose and intent level only. "This agent should extract clauses from contracts and must not miss any" — that's the extent of it. The rest — which model to use, how to construct the prompt, what constraints to apply — is determined by the system based on profiling and execution experience.

The human-authored role.md looks like this:

# Contract Clause Extractor

## Purpose
Extract key clauses (termination, renewal, penalty) from uploaded contracts.

## Core Principles
- Never fabricate clauses that don't exist. If not found, say "none."
- Every extraction must point to its location in the source text.
- Do not perform any tasks other than contract analysis.

## Input
- Contract text (required)
- Related prior documents (optional)

## Output
- Structured extraction results per clause
- Source evidence for each extraction

## Quality Criteria
- Missing an existing clause is a serious failure.
- Reporting a non-existent clause is an even more serious failure.

That's everything the human writes. Specific model names, prompt text, API parameters, or format instructions like "output as JSON" must not appear. The system fills in the rest.

Instruction Virtualization — Role Virtualization, Not LoRA

The first thing that needed fixing in this design was the name.

Initially it was called a "LoRA architecture," but that quickly proved wrong. LoRA can only be attached per-request in environments like vLLM. The OpenAI API doesn't document an interface for users to attach their own LoRA adapters at request time. With managed APIs like OpenAI, you work with developer/system messages, Structured Outputs via json_schema, tool use, prompt caching, fine-tuning, and evals.

vLLM, on the other hand, can efficiently serve LoRA adapters per-request and documents endpoints like /v1/load_lora_adapter with resolver plugin-based dynamic loading.

So designing around "can we attach an adapter?" is wrong. The premise should be "how do we represent roles and lower them using whatever each provider can do?"

The core concept was redefined as Instruction Virtualization.

Instruction Virtualization = Managing roles, constraints, output schemas, tool policies, and evaluation criteria as an intermediate representation decoupled from any specific model feature, then "compiling/lowering" it at runtime using whatever each backend supports.

This makes any runtime work:

OpenAI API
Anthropic-style APIs
vLLM
Self-hosted models
Future providers — without breaking the structure

LoRA is just one of many possible lowering targets in this architecture.

optimization_targets:
  openai:
    mode: structured_prompt
    model: ft:gpt-...
  vllm:
    mode: adapter
    adapter_id: contract_extract_v3
  generic:
    mode: prompt_only

This pushes vendor dependency to the runtime edge. Use LoRA where it's available; lower differently where it's not.

The one-line definition of this architecture:

Policy-package-based role definition -> provider-neutral IR generation -> backend capability negotiation -> backend-specific lowering -> validator/eval-controlled results

Overall Architecture: Structured Like Hiring

The entire flow of this architecture is structurally identical to how organizations hire, place, and grow people. This analogy was the most intuitive.

Human Organization	This Architecture
Write job description	Human writes role.md
Aptitude testing	Throw thousands of probes at the model
Test design	Optimize question sets using training method info
Resume screening	Match model profile against role requirements
Job placement	Place model in suitable role
On-the-job training	IR refines through actual execution
Manual updates	IR auto-evolves from feedback
Performance review	Measure result quality with Assurance + eval
Department transfer	Re-evaluate role matching after profile correction

The architecture breaks down into layers:

+---------------------------------------------------------+
|                    Human Domain                          |
|   role.md (purpose, intent, natural language role def)   |
+----------------------------+----------------------------+
                             |
                             v
+---------------------------------------------------------+
|              Phase 1: Model Profiling                    |
|   Generate question sets -> repeat surveys ->            |
|   statistical behavioral profile ("personality sheet")   |
+----------------------------+----------------------------+
                             |
                             v
+---------------------------------------------------------+
|              Phase 2: Role Matching                      |
|   Model profile x role requirements -> fitness score     |
+----------------------------+----------------------------+
                             |
                             v
+---------------------------------------------------------+
|         Phase 3: Per-Role Fine-Tuning (Self-Evolution)   |
|   Execute with matched combos                            |
|   -> adapter validation -> success/failure feedback      |
|   -> Role IR auto-refinement -> repeat                   |
+----------------------------+----------------------------+
                             |
                             v
+---------------------------------------------------------+
|                    Execution Layer                        |
|   Role IR -> LLM Lowering -> Adapter validation ->       |
|   Backend call -> Result validation -> Feedback -> IR    |
+---------------------------------------------------------+

Phase 1 — Model Profiling

This is the real core of the architecture. Everything else runs on top of the profiles it produces.

It applies the same approach as human personality assessments (Big Five, MBTI, etc.) to models. Ask specific questions many times, and the statistical distribution of responses becomes the model's behavioral profile.

The critical difference from human testing: models don't get tired. Tens of thousands of repetitions are feasible, and statistical confidence can be pushed as high as needed. The survey itself can be self-generated.

Measurement dimensions are multi-axis:

Instruction Compliance

Instruction adherence — how precisely explicit instructions are followed
Implicit instruction inference — how much unstated intent is inferred
Prohibition compliance — how well "don't do X" instructions are obeyed

Output Structure

Structured output stability — schema match rate for JSON, YAML, etc.
Speculation tendency — rate of filling uncertain data with guesses vs. null
Extra field generation — rate of adding unrequested fields

Tool Usage

Tool call reliability — rate of calling the right tool with right parameters
Tool overuse/avoidance — unnecessary calls vs. avoiding calls when needed

Reasoning/Judgment

Hallucination rate — rate of generating non-factual content
Uncertainty acknowledgment — rate of admitting when unsure
Conservative-creative axis — safe answers vs. exploratory answers

Behavioral Stability

Reproducibility — how consistent responses are for identical input
Refusal tendency — rate of unnecessarily refusing reasonable requests
Verbosity — tendency to generate unnecessarily long responses

Core profiling design principles:

Questions must elicit behavior. Don't ask "do you speculate?" (self-report bias). Instead, provide ambiguous input and observe whether speculation occurs.
Probe boundary conditions. All models behave similarly on clear inputs. Differences emerge at boundaries.
Repeat for statistical distributions. A single response is meaningless. Repeat each question type hundreds to thousands of times.
Surveys can be self-generated. Starting from an initial set, the LLM auto-generates more discriminating questions.

Output looks like this:

model_profile:
  model_id: "gpt-4o-2025-03-15"
  profiling_version: "2026.03.27"
  total_probes: 12400

  dimensions:
    instruction_adherence:
      score: 0.91
      confidence: 0.95
      distribution: { strict: 0.82, moderate: 0.09, loose: 0.09 }

    speculation_tendency:
      score: 0.34
      confidence: 0.93
      distribution: { speculates: 0.34, null_or_unknown: 0.58, refuses: 0.08 }
      notes: "Speculation rate spikes at high temperature"

    structured_output_stability:
      score: 0.96
      confidence: 0.97

    hallucination_rate:
      score: 0.12
      confidence: 0.89
      notes: "Rate increases in domain-specific knowledge"

    refusal_tendency:
      score: 0.15
      confidence: 0.92
      notes: "Strong RLHF causes intermittent refusal of reasonable requests"

Phase 2 — Role Matching

Compares model profiles against role requirements to determine which model is best suited for which role.

For a "contract clause extractor" role, requirements look like this:

role_requirements:
  role_id: contract_extractor

  critical_dimensions:
    instruction_adherence: { min: 0.85, weight: high }
    speculation_tendency: { max: 0.20, weight: critical }
    structured_output_stability: { min: 0.90, weight: critical }
    hallucination_rate: { max: 0.10, weight: critical }

  preferred_dimensions:
    tool_call_reliability: { min: 0.80, weight: medium }
    verbosity: { max: 0.40, weight: low }

  disqualifying_conditions:
    - speculation_tendency > 0.50
    - structured_output_stability < 0.70

Matching results:

role_matching:
  role_id: contract_extractor
  rankings:
    - model_id: "gpt-4o-2025-03-15"
      fitness_score: 0.87
      strengths: ["structured_output", "instruction_adherence"]
      weaknesses: ["verbosity"]

    - model_id: "gemini-2.0-pro"
      fitness_score: 0.82
      strengths: ["hallucination_rate_low", "tool_calling"]
      weaknesses: ["speculation_tendency"]

    - model_id: "local-llama-70b-adapter"
      fitness_score: 0.71
      strengths: ["cost", "latency"]
      weaknesses: ["structured_output_stability"]
      notes: "Structured output may improve with adapter"

Importantly, matching isn't one-time. Models get updated. New versions change profiles, so matching must be periodically re-run.

Phase 3 — Self-Evolution Loop

Runs the matched combinations in production, progressively refining the Role IR. This is the equivalent of on-the-job training. The model performs the role, the adapter validates, and success/failure patterns accumulate to concretize the IR.

An interesting aspect: Phase 3 sometimes flows back into Phase 1. For example, discovering that "this model has unusually high speculation tendency in the legal domain." Such findings get added to the model profile as domain-specific adjustments.

model_profile:
  model_id: "gpt-4o-2025-03-15"
  # ... base profile ...

  domain_adjustments:
    legal_contract:
      speculation_tendency: +0.12  # 12%p higher than base
      evidence_citation: +0.08
    medical_report:
      refusal_tendency: +0.22     # 22%p higher than base

Like discovering during actual work that someone's aptitude test results don't quite match their on-the-job behavior — and feeding that back into future placement decisions.

Provider-Neutral Role IR: Stable Above, Swappable Below

The heart of this architecture is the Role IR (Intermediate Representation).

Role IR is a vendor-neutral intermediate representation sitting between the human-authored role definition and the execution layer. It's the most important layer, and the overall flow is .md -> IR -> backend lowering.

Three key properties:

Vendor-neutral. Contains no proprietary syntax from OpenAI, vLLM, Anthropic, Google, or any other backend.
Mechanically enforceable. Composed of declarative fields that validators and lowering engines can interpret — not vague expressions like "appropriately" or "carefully."
Can start abstract. Gets automatically refined through execution experience.

The full IR for a contract extractor looks like this:

role_id: contract_extractor
ir_version: "2026.03.27.1"
source_role_md: "roles/contract_extractor.md"
evolution_generation: 14

objective:
  primary_goals:
    - extract: termination_clause
    - extract: renewal_clause
    - extract: penalty_clause
  success_criteria:
    - "All existing clauses extracted"
    - "Null returned for non-existent clauses"
    - "Source evidence included for each extraction"

input_contract:
  required:
    - field: document_text
      type: string
      min_length: 100
  optional:
    - field: retrieved_chunks
      type: array
    - field: document_metadata
      type: object

output_contract:
  mode: structured
  schema_ref: "schemas/contract_terms_v1.json"
  rules:
    null_over_guess: true
    require_evidence: true
    forbid_extra_fields: true
    max_output_tokens: 2000

tool_policy:
  allowed:
    - retrieval.search
    - document.chunk_reader
  forbidden:
    - email.send
    - web.browse
    - code.execute
  tool_call_budget: 5

behavior_policy:
  must_not_speculate: true
  verbosity: low
  preserve_source_terminology: true
  language: "match_input"

quality_policy:
  validators:
    - json_schema_validation
    - evidence_presence_check
    - forbidden_extra_fields_check
    - source_span_verification
  fallback_chain:
    - retry_same_backend
    - simplify_prompt
    - stronger_model
    - human_review
  max_retries: 3

optimization_hints:
  backend_specific:
    openai:
      mode: structured_prompt
      use_json_schema: true
      recommended_model: "gpt-4o-2025-03-15"
    vllm:
      mode: adapter
      adapter_id: "contract_extract_v3"
      use_guided_decoding: true
    generic:
      mode: prompt_only
      post_validation_weight: high

evolution_log:
  - generation: 12
    change: "tool_call_budget 3->5 (multiple failures from insufficient retrieval)"
    trigger: "assurance_feedback"
  - generation: 13
    change: "null_over_guess threshold adjustment"
    trigger: "eval_regression"
  - generation: 14
    change: "max_output_tokens 1500->2000 (long contract handling)"
    trigger: "execution_feedback"

The key point: no vendor-specific syntax anywhere.

No OpenAI parameter names, no vLLM lora_name, no SDK function names. This structure ensures the upper design stays stable while only the lower layers get swapped.

What Must NOT Be in the IR

The IR must consist of mechanically enforceable items. This is the first structural pitfall.

Allowed:

null_over_guess: true
require_evidence: true
max_retries: 3
allowed_tools: [retrieval.search, document.chunk_reader]

Not allowed:

"Judge carefully"
"Be flexible as appropriate"
"Summarize with high quality"
"Use tools as needed"

If the latter ends up in the IR, it's decoration, not a contract.

Lowering: Same Role, Different Execution

The same Role IR gets lowered into different forms per backend. This isn't simple mapping — it requires judgment.

Why LLM-driven lowering is necessary:

Combinatorial explosion: Backend capabilities, model profiles, role characteristics, and runtime conditions combine into too many cases for a rule engine.
Tradeoff judgment: Some backends lose stability when structured output and tool calling are used simultaneously. Some adapters actually worsen performance for certain roles. These decisions require contextual understanding, not if-else chains.
Backend evolution speed: API specs, model versions, and supported features change constantly. Rule engines need manual updates; LLMs can handle new combinations given an updated capability matrix.

Here's how the same contract_extractor Role IR transforms per backend:

OpenAI

lowering_artifact:
  backend: openai
  model: "gpt-4o-2025-03-15"

  messages:
    - role: developer
      content: |
        You are a contract clause extractor.
        Follow these rules strictly:
        - Mark unfound clauses as null
        - No speculation
        - Include source span for each extraction
        - Do not perform tasks outside contract analysis

  response_format:
    type: json_schema
    json_schema:
      ref: "schemas/contract_terms_v1.json"

  tools:
    - retrieval.search
    - document.chunk_reader

  # Profile-based compensation: verbosity 0.67, reinforce conciseness
  # Profile-based compensation: speculation_tendency 0.34, emphasize null

vLLM

lowering_artifact:
  backend: vllm
  base_model: "llama-3-70b"
  adapter_id: "contract_extract_v3"

  prompt: |
    [INST] Extract termination/renewal/penalty clauses from the contract.
    Null if absent. No speculation. Source evidence required. [/INST]

  guided_decoding:
    schema_ref: "schemas/contract_terms_v1.json"

  # Adapter is already domain-specialized, so prompt stays concise
  # Profile: base model structured output stability 0.81, guided decoding required

Generic External API

lowering_artifact:
  backend: generic
  endpoint: "https://api.example.com/v1/chat"

  prompt: |
    You are a contract clause extractor.

    Extract the following clauses from the contract below:
    1. Termination clause (termination_clause)
    2. Renewal clause (renewal_clause)
    3. Penalty clause (penalty_clause)

    Rules:
    - Null for unfound clauses
    - Never speculate
    - Include source location for each clause
    - Respond only in the JSON format below

    {schema_example}

  post_processing:
    - json_extraction
    - json_repair
    - schema_validation

  # This backend doesn't support structured output or guided decoding
  # Schema example embedded in prompt, post-validation weight increased

Same role, completely different artifacts. Structurally accommodating this is the core idea.

The Adapter's Dual Role

LLM-generated lowering is never trusted without verification. The adapter serves two roles simultaneously:

Role 1: Pre-execution validation gate — Validates the lowering artifact before execution.

Schema compliance: Does it match the backend API spec?
Role Contract violations: Any forbidden tools called?
Profile-based risk assessment: High speculation model but null_over_guess missing from prompt? Flag it.

On validation failure, request regeneration with violation details. After max_retries, move to the next fallback_chain step.

Role 2: Learning signal collector — Observes all pre/post-execution events and collects feedback. Signals flow to both IR updates and model profile corrections.

Assurance Layer: The Call Was Right But the Result Can Be Wrong

If the adapter verifies "is the lowering correct?", the Assurance Layer verifies "is the final result correct?" These are different problems.

Even perfect lowering can produce wrong answers. The schema might match but the content could be hallucinated. The structure might be perfect but a key clause could be missed.

Without the Assurance Layer, the entire architecture becomes "the call was fine but who knows about the result."

Assurance runs four stages in order:

Stage 1: Structural validation — Does the result format match output_contract? JSON schema validation, forbidden field detection, token/length limits, null policy checks. Fully mechanical, no LLM needed.

Stage 2: Evidence validation — For roles with require_evidence: true, does the extracted content actually trace back to the input? Source span matching, evidence-conclusion consistency, missing evidence detection.

Stage 3: Tool policy post-check — Were called tools in the allowed list? Was tool_call_budget exceeded? Were tool results actually reflected in the final output?

Stage 4: Role-specific validation — Custom validators declared in quality_policy.validators. For a contract extractor: source_span_verification (does the cited location actually contain the clause?) or cross_clause_consistency (termination impossible but termination penalty exists?).

Failure severity determines handling:

Minor (e.g., verbose response) — Use result but log feedback for IR evolution
Major (e.g., missing required field, evidence span mismatch) — Enter fallback_chain
Critical (e.g., hallucination detected, forbidden tool called) — Immediate stop, human_review escalation, warning flag on role x model combination

IR Self-Evolution Mechanism

Role IR in this architecture is not a static document. It's a living contract that refines itself through execution experience.

The initial role.md contains only intent. The IR elevates that intent to a mechanically enforceable level, but it doesn't need to be perfect from the start. It concretizes automatically through execution.

Evolution Triggers

Not every piece of feedback immediately changes the IR. Changes trigger when:

Statistical trigger: A failure type repeats N+ times, or success rate drops below threshold
Regression trigger: An eval case that passed on previous IR version now fails
Periodic trigger: Every N executions, analyze accumulated feedback

Scope and Limits

Not all IR fields are eligible for auto-evolution. This boundary is critical.

Auto-evolvable (system can change)

tool_call_budget, max_retries, max_output_tokens
optimization_hints, fallback_chain order
Per-backend lowering strategies

Not auto-evolvable (human only)

objective — the role's purpose itself
output_contract.schema_ref — structural schema changes
tool_policy.forbidden — reducing the forbidden list (security)
behavior_policy.must_not_speculate — core behavioral principles

If the system could flip "must_not_speculate" to false on its own, the original role.md intent would be violated. Auto-evolution improves execution methods while preserving purpose — it doesn't change the purpose itself.

POC Validation

No matter how elegant the design, it has to actually work.

We implemented the role_ir.yaml -> lowering -> single backend call -> assurance path and ran comparative experiments across two domains (contract extraction, invoice extraction) with multiple backends. All experiments ran in fair-mode (identical conditions), with POC (IR path) and Baseline (hardcoded prompt) running side by side.

Easy Test Set (Contract Eval-8, n=3 repeats)

8 contract cases, 3 repeats, 5 backends.

Backend	POC	Baseline	Interpretation
gemma-4-31b-it	success 24/24, strict 24/24	success 24/24, strict 24/24	Full parity
gemma-4-26b-a4b-it	success 24/24, strict 21/24	success 24/24, strict 21/24	Near-pass parity
gpt-oss-20b	success 24/24, strict 22/24	success 24/24, strict 23/24	Baseline slightly ahead
llama-3.1-8b-instant	success 22/24, strict 16/22	success 23/24, strict 15/23	POC quality edge, baseline stability edge
llama-3.3-70b-versatile	success 16/24, strict 11/16	success 15/24, strict 8/15	POC quality edge, rate limit impact

This is currently the most representative repeated snapshot. It doesn't show a broad win. It shows per-backend tradeoffs under the same output contract.

Hard Test Set (Contract Hard Distractor, n=3)

5 contracts with intentionally confusing distractor contexts.

Backend	POC	Baseline	Interpretation
kimi-k2-instruct-0905	success 15/15, strict 9/15	success 15/15, strict 9/15	Top tier, parity
qwen3-32b	success 15/15, strict 8/15	success 15/15, strict 9/15	Baseline slightly ahead
llama-3.3-70b-versatile	success 15/15, strict 6/15	success 15/15, strict 6/15	Parity but weak
gpt-oss-20b	success 14/15, strict 6/14	success 15/15, strict 6/15	Roughly parity but weak
gpt-oss-120b	success 15/15, strict 6/15	success 14/15, strict 5/14	Roughly parity but weak

Here the story shifted from "which backend wins?" to "multiple backends over-predict distractor clauses." The main failure cluster was renewal_clause / penalty_clause false positives, especially in negation and distractor contexts.

A warning against claiming a clean IR win from easy eval-8 alone.

Second Domain (Invoice Hard, n=2)

Experiments beyond contract extraction into the invoice domain.

Backend	POC	Baseline	Interpretation
qwen3-32b	success 10/10, strict 10/10, target 10/10	success 10/10, strict 10/10, target 10/10	Stable parity
kimi-k2-instruct-0905	success 10/10, strict 10/10, target 10/10	success 10/10, strict 7/10, target 7/10	POC helps
gpt-oss-20b	success 10/10, strict 4/10, target 4/10	success 10/10, strict 9/10, target 9/10	POC hurts

All runs passed schema/evidence/assurance, so the decisive metric was target match. kimi-k2-instruct-0905 benefited from the POC path, but gpt-oss-20b got worse under the same strict target policy.

Current Verdict

Honestly:

The Harness IR POC succeeds in that the role_ir.yaml -> lowering -> single backend call -> assurance path works in a genuinely comparable format.

But it's too early to claim IR lowering is universally superior to hardcoded baselines.

Easy eval-8 shows per-backend tradeoffs
Hard distractor benchmark reveals widespread false positive vulnerability
Invoice hard shows domain boundary and evidence span fidelity effects varying by backend

The current message isn't "IR clear win" — it's "IR's gains and losses split by model, domain, and evaluation difficulty." Patterns like qwen stable / kimi POC-helped / gpt-oss POC-hurt appear, but at n=2-3 they need expansion before becoming stable conclusions.

Structural Pitfalls: Known Holes

It would be dishonest to only discuss the architecture's strengths. Here are structural pitfalls identified during design.

Hole 1: Overly abstract IR is useless

The most common failure. If the IR contains only "accurately," "consistently," "appropriately," "if needed" — the lowering engine and validators can't use it. IR must be declarative AND enforceable. Decoration isn't a contract.

Hole 2: Capability matrix goes stale quickly

Backend capabilities change frequently. Even OpenAI's prompt behavior can differ between model snapshots. Define the capability matrix once and stop updating it, and "lowering that used to work quietly breaks one day." Version-specific capabilities, canary eval, and lowering regression tests are essential.

Hole 3: Lowering is more complex than it looks

It looks like simple mapping at first, but decisions arise: structured output works but destabilizes with simultaneous tool calling; adapter is available but hurts performance for this role; JSON mode is unreliable on this generic backend. Lowering is closer to runtime planning than template substitution.

Hole 4: Role consistency and factual accuracy are different problems

Even with perfect IR, Lowering, LoRA, Fine-tuning, and Structured Outputs, current information, external data, calculations, and authorization decisions can still be wrong. Retrieval, tool-first execution, validators, evidence checks, and business rule engines must exist separately.

Hole 5: Fallback chains are half of production operations

Many designs only draw the happy path well. But schema validation failures, tool timeouts, backend rate limits, missing adapters, degraded models, hallucinated fields... the failure path matters more in production. Fallback chains must live inside the Role Contract.

Hole 6: Without eval, this architecture doesn't survive

role x backend x lowering x model version x prompt bundle x schema x optional adapter grows combinatorially. "Is OpenAI lowering v3 better?" or "Did vLLM adapter v5 actually improve schema pass rate?" can't be answered by intuition. Per-role eval bundles are mandatory. Without them, it's a nice-looking abstraction that ends there.

Unsolved Problems

Open problems ranked by priority:

P0 — Profiling question set design. The architecture's biggest bottleneck. Measurement axes are defined, but "what to actually ask" is still empty. Discriminating power measurement, self-generation quality control, and cross-validation methodology are needed. Without this, the entire architecture doesn't function.

P0 — Evaluation (eval) framework. Both Assurance and IR evolution depend on eval, but the eval framework itself is underdesigned. Per-role test case design, eval case self-generation, and cross-backend eval criteria are needed.

P1 — Profile validity period. How long can a measured profile be trusted? Is lightweight re-verification (canary probing) feasible?

P1 — Lowering LLM reliability. The lowering LLM's own failure modes, non-determinism, and caching strategies haven't been empirically validated.

P1 — Cost and efficiency. Tens of thousands of calls for profiling, LLM lowering overhead per request, potential separate LLM calls for Assurance. Designed for quality, but operational costs can't be ignored.

P2 — IR self-evolution safety. Each individual change may be reasonable, but cumulative drift from original intent needs detection.

P3 — Multi-agent interaction. This document focused on single roles. Communication protocols, failure propagation, and pipeline-level fallback across multiple agents remain out of scope.

Why This Concept Actually Matters

Despite the limitations, this architecture has real value.

First, having a comparable structure is valuable in itself. Under the same Role IR, different backends can be compared on the same eval set. Not "this model is better" but "in this role, at this difficulty, how does this backend differ?" becomes structurally observable. The POC experiments enabled specific observations like "kimi benefits from IR on invoice, gpt-oss gets hurt."

Second, vendor dependency gets pushed to the execution layer. Upper design (role definitions, IR) stays stable; only lower layers (lowering) change. Switch from OpenAI to vLLM, or a new model drops — the role contract persists.

Third, model capability differences become "capability variance" instead of "tech debt." LoRA unavailable? The architecture doesn't break. On OpenAI, the same philosophy holds through "managed features composition" rather than "adapter injection."

Common Misconceptions

"Isn't this just prompt engineering?" — No. The prompt is one of lowering's outputs. IR sits above prompts as an execution contract, and the same IR produces different prompts per backend.
"Isn't this a LoRA architecture?" — It started that way, but was redefined from "LoRA-first MAS" to "Role-IR + Lowering MAS." LoRA is one option on adapter-capable backends.
"Just write a good IR and you're done?" — IR is the core, but without enforceable IR + continuously updated capability matrix + strong assurance/eval layer, it ends as a nice-looking abstraction. The POC experiments demonstrated this.

The Key Takeaway

The biggest realization from this design process: identical results aren't achieved through identical inputs, but through differently-calibrated inputs per model.

For high-speculation models, enforce null more aggressively. For weak structured output models, strengthen post-validation. For high-refusal models, adjust the prompt. Each takes a different path to the same destination.

And the basis for that calibration should be systematic measurement, not human intuition. That's the essence of this architecture.

Next in This Series

This post covered the design philosophy and overall structure. The whitepaper, design documents, POC code, and experiment results will all be published.

Next post will dive deeper into the profiling pipeline, lowering engine implementation details, and the false positive patterns revealed in hard distractor testing.

Thanks for reading.

Command Palette