Design Done, Now Prove It — IR Lowering POC Implementation Plan
role.md → IR → lowering → LLM call → assurance: a minimal implementation to test one hypothesis
Design Done, Now Prove It — IR Lowering POC Implementation Plan
Hello. This is the fifth post in the Role-IR series.
Part 1 covered design philosophy, Part 2 enterprise infrastructure, Part 3 integrated strategy, Part 4 the adversarial assurance loop. Over four posts I designed quite a lot — but nothing actually runs yet.
This post answers: "So when you actually build this, what comes first?"
After writing four design documents, the main takeaway was this: no matter how refined a design is, without code it's all hypothesis. And the one hypothesis this project needs to prove first is:
"Does IR-based lowering improve structured extraction stability compared to hardcoded prompt templates?"
If this is proven, the rest of the infrastructure — orchestrator, DB, Queue, Worker pool — can be attached incrementally. If not, the previous four posts remain a theoretical contribution.
POC Scope: As Narrow as Possible
The most important thing in a POC is keeping the scope tight. Most of what the design documents covered gets deliberately excluded.
| Item | POC Scope |
| Role | contract_extractor — just one |
| Models | 3–5 representative backends (structured path, json mode path, stress path) |
| IR | Minimum fields only: objective, output_contract, tool_policy, behavior_policy, quality_policy |
| Lowering | Rule-based. Branches by provider capability |
| Assurance | Structural validation + evidence presence check + same-artifact retry only |
| Profiling | None (manual description of 2 model traits) |
| DB/Queue/Worker | None. Pure Python script |
What We're Not Doing
This needs to be stated explicitly — not "we'll do it later" but "we're not doing it now."
- Generator / Critic / Evidence Checker / Validator role separation loop (Part 4's adversarial verification)
- Convergence detection, best-effort selection
- Human review queue
- Profile-based model selection, MoE, governance
These expand in the next phase only if POC results are positive. If IR lowering itself doesn't show a benefit, none of the extensions matter.
Core Component Design
1. Role IR Schema
The IR's Pydantic model. Only minimum fields needed for the POC.
class OutputContract(BaseModel):
schema_ref: str # JSON Schema file path
null_over_guess: bool = True # No guessing — return null
require_evidence: bool = True # Source evidence required
forbid_extra_fields: bool = True # No fields beyond schema
max_output_tokens: int = 2000
class BehaviorPolicy(BaseModel):
must_not_speculate: bool = True
verbosity: Literal["low", "medium", "high"] = "low"
language: str = "match_input"
class QualityPolicy(BaseModel):
validators: list[str] # ["json_schema", "evidence_presence"]
fallback_chain: list[str] # ["retry_same", "human_review"]
max_retries: int = 2
class RoleIR(BaseModel):
role_id: str
ir_version: str
objective: list[str]
input_contract: InputContract
output_contract: OutputContract
tool_policy: ToolPolicy
behavior_policy: BehaviorPolicy
quality_policy: QualityPolicy
optimization_hints: dict = {}
The core principle: every field must be mechanically enforceable. Vague expressions like "carefully" or "appropriately" cannot enter the IR. This is where Part 1's "execution contracts, not prompts" principle becomes code.
2. Capability Matrix
Declares what each backend can and can't do.
class BackendCapability(BaseModel):
backend_id: str
structured_outputs: bool # JSON schema response support
tool_calling: bool
guided_decoding: bool # Constrained decoding support
prompt_caching: bool
POC uses just two:
- OpenAI: structured outputs supported, tool calling supported
- Generic: neither supported, plain prompt + post-parse
How these differences create branching in the lowering engine is the POC's key observation point.
3. Lowering Engine
Takes IR + Capability and produces backend-specific call artifacts.
| IR Field | OpenAI (structured_outputs=True) | Generic (structured_outputs=False) |
objective | Insert into developer message | Insert at prompt top |
output_contract.schema_ref | Use response_format.json_schema | Insert schema example into prompt + post-parse |
output_contract.null_over_guess | Reinforce "return null" in developer message | Repeated emphasis in prompt |
behavior_policy.must_not_speculate | Insert into developer message | Insert into prompt |
behavior_policy.verbosity=low | Conciseness instruction in developer message | Conciseness instruction in prompt |
The same IR transforms into different artifacts depending on backend capability. This is the essence of lowering, and the point where "the IR doesn't change when the model changes" actually works.
Output format:
class LoweringArtifact(BaseModel):
backend_id: str
# OpenAI path
messages: list[dict] | None = None
response_format: dict | None = None
tools: list[dict] | None = None
model: str | None = None
# Generic path
prompt: str | None = None
post_processing: list[str] | None = None
4. LLM Backends
class BaseLLMBackend(ABC):
@abstractmethod
def call(self, artifact: LoweringArtifact) -> BackendResponse:
...
- OpenAIBackend: Calls via openai SDK. Uses response_format.
- GenericBackend: Sends plain prompt via httpx. Extracts JSON from response + repair.
5. Assurance Layer
POC validates in two stages only.
class AssuranceLayer:
def validate(self, output, ir, input_text):
# Stage 1: Structural validation
structural = self.structural_validator.validate(output, ir)
# Stage 2: Evidence validation (only when require_evidence=True)
evidence = self.evidence_validator.validate(output, input_text, ir)
return AssuranceResult(structural=structural, evidence=evidence)
- Structural validation: JSON Schema match, required fields present, forbidden fields absent, null_over_guess violation detection
- Evidence validation: Substring match — does each extracted clause's evidence_span actually exist in the input_text?
The adversarial verification loop, Circuit Breaker, and Selective Activation from Part 4 are absent here. Deliberately. The POC only needs to determine "does IR lowering itself help?"
6. Runner
Orchestrates the full pipeline in one pass.
class PipelineRunner:
def run(self, role_dir, input_text, backend_id):
ir = self.ir_loader.load(role_dir)
capability = self.capability_store.get(backend_id)
artifact = self.lowering_engine.lower(ir, capability)
response = self.backend.call(artifact)
assurance = self.assurance_layer.validate(
response.output, ir, input_text)
if not assurance.passed and retries < ir.quality_policy.max_retries:
return self.run_with_retry(...)
return PipelineResult(output=response.output, assurance=assurance)
role.md → IR → lowering → LLM call → assurance — this single flow is where the designs from Parts 1 through 4 meet in code.
Baseline Comparison: The Reason the POC Exists
The POC's core is an IR-based vs hardcoded comparison. Running only the IR pipeline and showing "it works" is meaningless. We need to compare against a hardcoded prompt on the same model with the same input.
Baseline (No IR)
HARDCODED_PROMPT = """
You are a contract clause extractor.
Extract termination/renewal/penalty clauses from the contract.
Output in JSON format.
"""
POC (IR-based)
runner = PipelineRunner(backend_id="openai")
result = runner.run("roles/contract_extractor", input_text, "openai")
Comparison Metrics
Run each N times and compare:
| Metric | Measurement |
| Schema match rate | JSON Schema validation pass rate |
| Required field coverage | Percentage of required fields that aren't null |
| Evidence presence rate | Percentage of evidence_spans that actually exist in source text |
| Hallucination rate | Percentage of clauses generated without source text support |
| Speculation rate | Percentage of fields that should be null but were filled |
Important: we don't declare Pass/Fail from a single number. Multi-sample runs, per-case failure reasons, and per-backend variance are all examined together.
Implementation Order
Step 1: Project Setup + IR Schema
pyproject.toml(pydantic, openai, httpx, jsonschema, pytest)- IR Pydantic model, YAML loader
- Manually authored IR and output schema for
contract_extractor - IR schema tests
Step 2: Lowering Engine
- BackendCapability + 2 presets
- LoweringArtifact data class
- Rule-based lowering engine
- Artifact generation tests per IR + capability combination
Step 3: LLM Backends
- BaseLLMBackend ABC
- OpenAI backend (openai SDK, response_format)
- Generic backend (httpx, JSON extraction + repair)
Step 4: Assurance Layer
- JSON schema structural validation
- Evidence span substring match
- Two-stage orchestrator
- Assurance tests
Step 5: Runner + Baseline + Comparison
- Full pipeline Runner
- IR-based execution script
- Hardcoded prompt baseline script
- N-run result comparison script
- Test contract fixtures
Five steps, but the key is writing tests first at each stage. Step 2 (Lowering) tests matter most — they must prove that the same IR meeting different capabilities produces different artifacts.
Dependencies
[project]
name = "harness-ir"
requires-python = ">=3.11"
dependencies = [
"pydantic>=2.0",
"pyyaml>=6.0",
"jsonschema>=4.0",
"openai>=1.0",
"httpx>=0.27",
"rich>=13.0",
]
Kept to a minimum. No DB, no Queue, no web framework. A CLI or script you run once and it finishes.
Completion Criteria and Kill Criteria
The most important thing in a POC is deciding "when it's a success and when we stop" before starting. Part 3 introduced the Kill Line concept; here it descends to concrete specifics.
Completion (Next phase can be discussed)
- README, guides, and doc index consistently describe the repository as
IR lowering POC - Multi-sample evaluation set expanded to at least 8 contracts, each with clause presence/null expectations documented
- At least 3 representative backends with the following metrics recorded under identical criteria:
- Successful execution
- Schema match rate
- Field coverage
- Evidence presence rate
- Assurance pass rate
- Per-case failure cause table and raw output references documented in experiment notes
- Final interpretation can conservatively state
superior / comparable / inferiorper model
Kill (Better to stop this approach here)
- Even after multi-sample and per-case analysis, no repeatable advantage is found
- Observed differences are explained entirely by provider/model-specific traits, not lowering
- Assurance cannot practically distinguish good output from bad output
Graduation Rule
Do not proceed to adversarial verification loops or governance extensions until the completion criteria above are met.
This is Part 3's Kill Line applied concretely. No matter how appealing the design is, if the POC can't prove it, we stop. That principle runs through this entire series.
Reflection
Writing a POC plan as the fifth post after four design documents might seem backwards. Usually you POC first and expand the design.
But in this case, the original design documents already existed, and filtering out overblown claims and converting them to realistic designs had to come first. The POC is the final step that tests whether "realistic" really means realistic.
There's only one hypothesis: "Does IR-based lowering improve structured extraction stability compared to hardcoded prompt templates?" Once that question has an answer — positive or negative — the next step becomes clear.

