Design Done, Now Prove It — IR Lowering POC Implementation Plan

Hello. This is the fifth post in the Role-IR series.

Part 1 covered design philosophy, Part 2 enterprise infrastructure, Part 3 integrated strategy, Part 4 the adversarial assurance loop. Over four posts I designed quite a lot — but nothing actually runs yet.

This post answers: "So when you actually build this, what comes first?"

After writing four design documents, the main takeaway was this: no matter how refined a design is, without code it's all hypothesis. And the one hypothesis this project needs to prove first is:

"Does IR-based lowering improve structured extraction stability compared to hardcoded prompt templates?"

If this is proven, the rest of the infrastructure — orchestrator, DB, Queue, Worker pool — can be attached incrementally. If not, the previous four posts remain a theoretical contribution.

POC Scope: As Narrow as Possible

The most important thing in a POC is keeping the scope tight. Most of what the design documents covered gets deliberately excluded.

Item	POC Scope
Role	`contract_extractor` — just one
Models	3–5 representative backends (structured path, json mode path, stress path)
IR	Minimum fields only: objective, output_contract, tool_policy, behavior_policy, quality_policy
Lowering	Rule-based. Branches by provider capability
Assurance	Structural validation + evidence presence check + same-artifact retry only
Profiling	None (manual description of 2 model traits)
DB/Queue/Worker	None. Pure Python script

What We're Not Doing

This needs to be stated explicitly — not "we'll do it later" but "we're not doing it now."

Generator / Critic / Evidence Checker / Validator role separation loop (Part 4's adversarial verification)
Convergence detection, best-effort selection
Human review queue
Profile-based model selection, MoE, governance

These expand in the next phase only if POC results are positive. If IR lowering itself doesn't show a benefit, none of the extensions matter.

Core Component Design

1. Role IR Schema

The IR's Pydantic model. Only minimum fields needed for the POC.

class OutputContract(BaseModel):
    schema_ref: str                    # JSON Schema file path
    null_over_guess: bool = True       # No guessing — return null
    require_evidence: bool = True      # Source evidence required
    forbid_extra_fields: bool = True   # No fields beyond schema
    max_output_tokens: int = 2000

class BehaviorPolicy(BaseModel):
    must_not_speculate: bool = True
    verbosity: Literal["low", "medium", "high"] = "low"
    language: str = "match_input"

class QualityPolicy(BaseModel):
    validators: list[str]              # ["json_schema", "evidence_presence"]
    fallback_chain: list[str]          # ["retry_same", "human_review"]
    max_retries: int = 2

class RoleIR(BaseModel):
    role_id: str
    ir_version: str
    objective: list[str]
    input_contract: InputContract
    output_contract: OutputContract
    tool_policy: ToolPolicy
    behavior_policy: BehaviorPolicy
    quality_policy: QualityPolicy
    optimization_hints: dict = {}

The core principle: every field must be mechanically enforceable. Vague expressions like "carefully" or "appropriately" cannot enter the IR. This is where Part 1's "execution contracts, not prompts" principle becomes code.

2. Capability Matrix

Declares what each backend can and can't do.

class BackendCapability(BaseModel):
    backend_id: str
    structured_outputs: bool    # JSON schema response support
    tool_calling: bool
    guided_decoding: bool       # Constrained decoding support
    prompt_caching: bool

POC uses just two:

OpenAI: structured outputs supported, tool calling supported
Generic: neither supported, plain prompt + post-parse

How these differences create branching in the lowering engine is the POC's key observation point.

3. Lowering Engine

Takes IR + Capability and produces backend-specific call artifacts.

IR Field	OpenAI (structured_outputs=True)	Generic (structured_outputs=False)
`objective`	Insert into developer message	Insert at prompt top
`output_contract.schema_ref`	Use `response_format.json_schema`	Insert schema example into prompt + post-parse
`output_contract.null_over_guess`	Reinforce "return null" in developer message	Repeated emphasis in prompt
`behavior_policy.must_not_speculate`	Insert into developer message	Insert into prompt
`behavior_policy.verbosity=low`	Conciseness instruction in developer message	Conciseness instruction in prompt

The same IR transforms into different artifacts depending on backend capability. This is the essence of lowering, and the point where "the IR doesn't change when the model changes" actually works.

Output format:

class LoweringArtifact(BaseModel):
    backend_id: str
    # OpenAI path
    messages: list[dict] | None = None
    response_format: dict | None = None
    tools: list[dict] | None = None
    model: str | None = None
    # Generic path
    prompt: str | None = None
    post_processing: list[str] | None = None

4. LLM Backends

class BaseLLMBackend(ABC):
    @abstractmethod
    def call(self, artifact: LoweringArtifact) -> BackendResponse:
        ...

OpenAIBackend: Calls via openai SDK. Uses response_format.
GenericBackend: Sends plain prompt via httpx. Extracts JSON from response + repair.

5. Assurance Layer

POC validates in two stages only.

class AssuranceLayer:
    def validate(self, output, ir, input_text):
        # Stage 1: Structural validation
        structural = self.structural_validator.validate(output, ir)
        # Stage 2: Evidence validation (only when require_evidence=True)
        evidence = self.evidence_validator.validate(output, input_text, ir)
        return AssuranceResult(structural=structural, evidence=evidence)

Structural validation: JSON Schema match, required fields present, forbidden fields absent, null_over_guess violation detection
Evidence validation: Substring match — does each extracted clause's evidence_span actually exist in the input_text?

The adversarial verification loop, Circuit Breaker, and Selective Activation from Part 4 are absent here. Deliberately. The POC only needs to determine "does IR lowering itself help?"

6. Runner

Orchestrates the full pipeline in one pass.

class PipelineRunner:
    def run(self, role_dir, input_text, backend_id):
        ir = self.ir_loader.load(role_dir)
        capability = self.capability_store.get(backend_id)
        artifact = self.lowering_engine.lower(ir, capability)
        response = self.backend.call(artifact)
        assurance = self.assurance_layer.validate(
            response.output, ir, input_text)
        if not assurance.passed and retries < ir.quality_policy.max_retries:
            return self.run_with_retry(...)
        return PipelineResult(output=response.output, assurance=assurance)

role.md → IR → lowering → LLM call → assurance — this single flow is where the designs from Parts 1 through 4 meet in code.

Baseline Comparison: The Reason the POC Exists

The POC's core is an IR-based vs hardcoded comparison. Running only the IR pipeline and showing "it works" is meaningless. We need to compare against a hardcoded prompt on the same model with the same input.

Baseline (No IR)

HARDCODED_PROMPT = """
You are a contract clause extractor.
Extract termination/renewal/penalty clauses from the contract.
Output in JSON format.
"""

POC (IR-based)

runner = PipelineRunner(backend_id="openai")
result = runner.run("roles/contract_extractor", input_text, "openai")

Comparison Metrics

Run each N times and compare:

Metric	Measurement
Schema match rate	JSON Schema validation pass rate
Required field coverage	Percentage of required fields that aren't null
Evidence presence rate	Percentage of evidence_spans that actually exist in source text
Hallucination rate	Percentage of clauses generated without source text support
Speculation rate	Percentage of fields that should be null but were filled

Important: we don't declare Pass/Fail from a single number. Multi-sample runs, per-case failure reasons, and per-backend variance are all examined together.

Implementation Order

Step 1: Project Setup + IR Schema

pyproject.toml (pydantic, openai, httpx, jsonschema, pytest)
IR Pydantic model, YAML loader
Manually authored IR and output schema for contract_extractor
IR schema tests

Step 2: Lowering Engine

BackendCapability + 2 presets
LoweringArtifact data class
Rule-based lowering engine
Artifact generation tests per IR + capability combination

Step 3: LLM Backends

BaseLLMBackend ABC
OpenAI backend (openai SDK, response_format)
Generic backend (httpx, JSON extraction + repair)

Step 4: Assurance Layer

JSON schema structural validation
Evidence span substring match
Two-stage orchestrator
Assurance tests

Step 5: Runner + Baseline + Comparison

Full pipeline Runner
IR-based execution script
Hardcoded prompt baseline script
N-run result comparison script
Test contract fixtures

Five steps, but the key is writing tests first at each stage. Step 2 (Lowering) tests matter most — they must prove that the same IR meeting different capabilities produces different artifacts.

Dependencies

[project]
name = "harness-ir"
requires-python = ">=3.11"
dependencies = [
    "pydantic>=2.0",
    "pyyaml>=6.0",
    "jsonschema>=4.0",
    "openai>=1.0",
    "httpx>=0.27",
    "rich>=13.0",
]

Kept to a minimum. No DB, no Queue, no web framework. A CLI or script you run once and it finishes.

Completion Criteria and Kill Criteria

The most important thing in a POC is deciding "when it's a success and when we stop" before starting. Part 3 introduced the Kill Line concept; here it descends to concrete specifics.

Completion (Next phase can be discussed)

README, guides, and doc index consistently describe the repository as IR lowering POC
Multi-sample evaluation set expanded to at least 8 contracts, each with clause presence/null expectations documented
At least 3 representative backends with the following metrics recorded under identical criteria:
- Successful execution
- Schema match rate
- Field coverage
- Evidence presence rate
- Assurance pass rate
Per-case failure cause table and raw output references documented in experiment notes
Final interpretation can conservatively state superior / comparable / inferior per model

Kill (Better to stop this approach here)

Even after multi-sample and per-case analysis, no repeatable advantage is found
Observed differences are explained entirely by provider/model-specific traits, not lowering
Assurance cannot practically distinguish good output from bad output

Graduation Rule

Do not proceed to adversarial verification loops or governance extensions until the completion criteria above are met.

This is Part 3's Kill Line applied concretely. No matter how appealing the design is, if the POC can't prove it, we stop. That principle runs through this entire series.

Reflection

Writing a POC plan as the fifth post after four design documents might seem backwards. Usually you POC first and expand the design.

But in this case, the original design documents already existed, and filtering out overblown claims and converting them to realistic designs had to come first. The POC is the final step that tests whether "realistic" really means realistic.

There's only one hypothesis: "Does IR-based lowering improve structured extraction stability compared to hardcoded prompt templates?" Once that question has an answer — positive or negative — the next step becomes clear.

Design Done, Now Prove It — IR Lowering POC Implementation Plan

Design Done, Now Prove It — IR Lowering POC Implementation Plan

POC Scope: As Narrow as Possible

What We're Not Doing

Core Component Design

1. Role IR Schema

2. Capability Matrix

3. Lowering Engine

4. LLM Backends

5. Assurance Layer

6. Runner

Baseline Comparison: The Reason the POC Exists

Baseline (No IR)

POC (IR-based)

Comparison Metrics

Implementation Order

Step 1: Project Setup + IR Schema

Step 2: Lowering Engine

Step 3: LLM Backends

Step 4: Assurance Layer

Step 5: Runner + Baseline + Comparison

Dependencies

Completion Criteria and Kill Criteria

Completion (Next phase can be discussed)

Kill (Better to stop this approach here)

Graduation Rule

Reflection

Comments

Role-IR + Lowering Architecture

Measuring and Deploying Models — Profile-Based Selection and Matching

More from this blog

Assembling Systems, Not Building Semiconductors — A New Learning Path in the AI Era

Korean TTS Workbench — (5) Qwen3-TTS → VoxCPM2, Swapping the Model Out

Korean TTS Workbench — (4) When the Workarounds Failed, a Hardcoded Line Inside the Library

Korean TTS Workbench — (3) The Last Syllable Sounds Cut Off — and the Cause Was the Model Itself

Korean TTS Workbench — (2) Korean-Specific Tricks and Running on 4GB VRAM

Command Palette

Design Done, Now Prove It — IR Lowering POC Implementation Plan

POC Scope: As Narrow as Possible

What We're Not Doing

Core Component Design

1. Role IR Schema

2. Capability Matrix

3. Lowering Engine

4. LLM Backends

5. Assurance Layer

6. Runner

Baseline Comparison: The Reason the POC Exists

Baseline (No IR)

POC (IR-based)

Comparison Metrics

Implementation Order

Step 1: Project Setup + IR Schema

Step 2: Lowering Engine

Step 3: LLM Backends

Step 4: Assurance Layer

Step 5: Runner + Baseline + Comparison

Dependencies

Completion Criteria and Kill Criteria

Completion (Next phase can be discussed)

Kill (Better to stop this approach here)

Graduation Rule

Reflection

Comments

Role-IR + Lowering Architecture

Measuring and Deploying Models — Profile-Based Selection and Matching

More from this blog