Skip to main content

Command Palette

Search for a command to run...

Design Done, Now Prove It — IR Lowering POC Implementation Plan

role.md → IR → lowering → LLM call → assurance: a minimal implementation to test one hypothesis

Published
8 min read
E
I build data and AI systems that have to survive real constraints: time, cost, memory, and messy integration boundaries.

Design Done, Now Prove It — IR Lowering POC Implementation Plan

Hello. This is the fifth post in the Role-IR series.

Part 1 covered design philosophy, Part 2 enterprise infrastructure, Part 3 integrated strategy, Part 4 the adversarial assurance loop. Over four posts I designed quite a lot — but nothing actually runs yet.

This post answers: "So when you actually build this, what comes first?"

After writing four design documents, the main takeaway was this: no matter how refined a design is, without code it's all hypothesis. And the one hypothesis this project needs to prove first is:

"Does IR-based lowering improve structured extraction stability compared to hardcoded prompt templates?"

If this is proven, the rest of the infrastructure — orchestrator, DB, Queue, Worker pool — can be attached incrementally. If not, the previous four posts remain a theoretical contribution.


POC Scope: As Narrow as Possible

The most important thing in a POC is keeping the scope tight. Most of what the design documents covered gets deliberately excluded.

ItemPOC Scope
Rolecontract_extractor — just one
Models3–5 representative backends (structured path, json mode path, stress path)
IRMinimum fields only: objective, output_contract, tool_policy, behavior_policy, quality_policy
LoweringRule-based. Branches by provider capability
AssuranceStructural validation + evidence presence check + same-artifact retry only
ProfilingNone (manual description of 2 model traits)
DB/Queue/WorkerNone. Pure Python script

What We're Not Doing

This needs to be stated explicitly — not "we'll do it later" but "we're not doing it now."

  • Generator / Critic / Evidence Checker / Validator role separation loop (Part 4's adversarial verification)
  • Convergence detection, best-effort selection
  • Human review queue
  • Profile-based model selection, MoE, governance

These expand in the next phase only if POC results are positive. If IR lowering itself doesn't show a benefit, none of the extensions matter.


Core Component Design

1. Role IR Schema

The IR's Pydantic model. Only minimum fields needed for the POC.

class OutputContract(BaseModel):
    schema_ref: str                    # JSON Schema file path
    null_over_guess: bool = True       # No guessing — return null
    require_evidence: bool = True      # Source evidence required
    forbid_extra_fields: bool = True   # No fields beyond schema
    max_output_tokens: int = 2000

class BehaviorPolicy(BaseModel):
    must_not_speculate: bool = True
    verbosity: Literal["low", "medium", "high"] = "low"
    language: str = "match_input"

class QualityPolicy(BaseModel):
    validators: list[str]              # ["json_schema", "evidence_presence"]
    fallback_chain: list[str]          # ["retry_same", "human_review"]
    max_retries: int = 2

class RoleIR(BaseModel):
    role_id: str
    ir_version: str
    objective: list[str]
    input_contract: InputContract
    output_contract: OutputContract
    tool_policy: ToolPolicy
    behavior_policy: BehaviorPolicy
    quality_policy: QualityPolicy
    optimization_hints: dict = {}

The core principle: every field must be mechanically enforceable. Vague expressions like "carefully" or "appropriately" cannot enter the IR. This is where Part 1's "execution contracts, not prompts" principle becomes code.

2. Capability Matrix

Declares what each backend can and can't do.

class BackendCapability(BaseModel):
    backend_id: str
    structured_outputs: bool    # JSON schema response support
    tool_calling: bool
    guided_decoding: bool       # Constrained decoding support
    prompt_caching: bool

POC uses just two:

  • OpenAI: structured outputs supported, tool calling supported
  • Generic: neither supported, plain prompt + post-parse

How these differences create branching in the lowering engine is the POC's key observation point.

3. Lowering Engine

Takes IR + Capability and produces backend-specific call artifacts.

IR FieldOpenAI (structured_outputs=True)Generic (structured_outputs=False)
objectiveInsert into developer messageInsert at prompt top
output_contract.schema_refUse response_format.json_schemaInsert schema example into prompt + post-parse
output_contract.null_over_guessReinforce "return null" in developer messageRepeated emphasis in prompt
behavior_policy.must_not_speculateInsert into developer messageInsert into prompt
behavior_policy.verbosity=lowConciseness instruction in developer messageConciseness instruction in prompt

The same IR transforms into different artifacts depending on backend capability. This is the essence of lowering, and the point where "the IR doesn't change when the model changes" actually works.

Output format:

class LoweringArtifact(BaseModel):
    backend_id: str
    # OpenAI path
    messages: list[dict] | None = None
    response_format: dict | None = None
    tools: list[dict] | None = None
    model: str | None = None
    # Generic path
    prompt: str | None = None
    post_processing: list[str] | None = None

4. LLM Backends

class BaseLLMBackend(ABC):
    @abstractmethod
    def call(self, artifact: LoweringArtifact) -> BackendResponse:
        ...
  • OpenAIBackend: Calls via openai SDK. Uses response_format.
  • GenericBackend: Sends plain prompt via httpx. Extracts JSON from response + repair.

5. Assurance Layer

POC validates in two stages only.

class AssuranceLayer:
    def validate(self, output, ir, input_text):
        # Stage 1: Structural validation
        structural = self.structural_validator.validate(output, ir)
        # Stage 2: Evidence validation (only when require_evidence=True)
        evidence = self.evidence_validator.validate(output, input_text, ir)
        return AssuranceResult(structural=structural, evidence=evidence)
  • Structural validation: JSON Schema match, required fields present, forbidden fields absent, null_over_guess violation detection
  • Evidence validation: Substring match — does each extracted clause's evidence_span actually exist in the input_text?

The adversarial verification loop, Circuit Breaker, and Selective Activation from Part 4 are absent here. Deliberately. The POC only needs to determine "does IR lowering itself help?"

6. Runner

Orchestrates the full pipeline in one pass.

class PipelineRunner:
    def run(self, role_dir, input_text, backend_id):
        ir = self.ir_loader.load(role_dir)
        capability = self.capability_store.get(backend_id)
        artifact = self.lowering_engine.lower(ir, capability)
        response = self.backend.call(artifact)
        assurance = self.assurance_layer.validate(
            response.output, ir, input_text)
        if not assurance.passed and retries < ir.quality_policy.max_retries:
            return self.run_with_retry(...)
        return PipelineResult(output=response.output, assurance=assurance)

role.md → IR → lowering → LLM call → assurance — this single flow is where the designs from Parts 1 through 4 meet in code.


Baseline Comparison: The Reason the POC Exists

The POC's core is an IR-based vs hardcoded comparison. Running only the IR pipeline and showing "it works" is meaningless. We need to compare against a hardcoded prompt on the same model with the same input.

Baseline (No IR)

HARDCODED_PROMPT = """
You are a contract clause extractor.
Extract termination/renewal/penalty clauses from the contract.
Output in JSON format.
"""

POC (IR-based)

runner = PipelineRunner(backend_id="openai")
result = runner.run("roles/contract_extractor", input_text, "openai")

Comparison Metrics

Run each N times and compare:

MetricMeasurement
Schema match rateJSON Schema validation pass rate
Required field coveragePercentage of required fields that aren't null
Evidence presence ratePercentage of evidence_spans that actually exist in source text
Hallucination ratePercentage of clauses generated without source text support
Speculation ratePercentage of fields that should be null but were filled

Important: we don't declare Pass/Fail from a single number. Multi-sample runs, per-case failure reasons, and per-backend variance are all examined together.


Implementation Order

Step 1: Project Setup + IR Schema

  • pyproject.toml (pydantic, openai, httpx, jsonschema, pytest)
  • IR Pydantic model, YAML loader
  • Manually authored IR and output schema for contract_extractor
  • IR schema tests

Step 2: Lowering Engine

  • BackendCapability + 2 presets
  • LoweringArtifact data class
  • Rule-based lowering engine
  • Artifact generation tests per IR + capability combination

Step 3: LLM Backends

  • BaseLLMBackend ABC
  • OpenAI backend (openai SDK, response_format)
  • Generic backend (httpx, JSON extraction + repair)

Step 4: Assurance Layer

  • JSON schema structural validation
  • Evidence span substring match
  • Two-stage orchestrator
  • Assurance tests

Step 5: Runner + Baseline + Comparison

  • Full pipeline Runner
  • IR-based execution script
  • Hardcoded prompt baseline script
  • N-run result comparison script
  • Test contract fixtures

Five steps, but the key is writing tests first at each stage. Step 2 (Lowering) tests matter most — they must prove that the same IR meeting different capabilities produces different artifacts.


Dependencies

[project]
name = "harness-ir"
requires-python = ">=3.11"
dependencies = [
    "pydantic>=2.0",
    "pyyaml>=6.0",
    "jsonschema>=4.0",
    "openai>=1.0",
    "httpx>=0.27",
    "rich>=13.0",
]

Kept to a minimum. No DB, no Queue, no web framework. A CLI or script you run once and it finishes.


Completion Criteria and Kill Criteria

The most important thing in a POC is deciding "when it's a success and when we stop" before starting. Part 3 introduced the Kill Line concept; here it descends to concrete specifics.

Completion (Next phase can be discussed)

  • README, guides, and doc index consistently describe the repository as IR lowering POC
  • Multi-sample evaluation set expanded to at least 8 contracts, each with clause presence/null expectations documented
  • At least 3 representative backends with the following metrics recorded under identical criteria:
    • Successful execution
    • Schema match rate
    • Field coverage
    • Evidence presence rate
    • Assurance pass rate
  • Per-case failure cause table and raw output references documented in experiment notes
  • Final interpretation can conservatively state superior / comparable / inferior per model

Kill (Better to stop this approach here)

  • Even after multi-sample and per-case analysis, no repeatable advantage is found
  • Observed differences are explained entirely by provider/model-specific traits, not lowering
  • Assurance cannot practically distinguish good output from bad output

Graduation Rule

Do not proceed to adversarial verification loops or governance extensions until the completion criteria above are met.

This is Part 3's Kill Line applied concretely. No matter how appealing the design is, if the POC can't prove it, we stop. That principle runs through this entire series.


Reflection

Writing a POC plan as the fifth post after four design documents might seem backwards. Usually you POC first and expand the design.

But in this case, the original design documents already existed, and filtering out overblown claims and converting them to realistic designs had to come first. The POC is the final step that tests whether "realistic" really means realistic.

There's only one hypothesis: "Does IR-based lowering improve structured extraction stability compared to hardcoded prompt templates?" Once that question has an answer — positive or negative — the next step becomes clear.