Measuring and Deploying Models — Profile-Based Selection and Matching

Hello. This is the sixth post in the Role-IR series.

Parts 1 through 5 covered design philosophy, enterprise infrastructure, integrated strategy, adversarial verification, and the POC implementation plan. The Part 5 POC focused on proving whether declarative IR outperforms hardcoded prompts. This post asks the next question:

"How can we select and deploy models more effectively?"

In the current structure, model selection for each role is a manual human decision. That works when you have 2–3 candidates, but as the pool grows, you need a systematic way to judge "which model fits this role."

This post explores building that judgment by measuring model behavioral traits as numerical scores and comparing them against role requirements. This is still at the experiment planning stage — nothing here has been validated yet.

Core Distinction: Policies Stay Declarative, Selection Goes Numerical

First, an important clarification.

null_over_guess: true
forbid_extra_fields: true
require_evidence: true

Rules like these must stay declarative for mechanical enforcement. Expressing "no speculation" as a numerical value of 0.05 blurs the boundary.

On the other hand, numerical representation is useful for:

Model behavioral traits (instruction adherence, speculation tendency, evidence citation frequency)
Role requirement intensity (structured output importance, speculation tolerance)
Fitness between role and model

Policies are enforced declaratively; selection is judged numerically — that's the premise of this post.

Premise: API-Centric Experimentation

There's an important premise. Experiments in this post are mostly conducted through external API calls.

No direct modification of model internal parameters
Fine-tuning, LoRA, adapter training are not required at this stage
We're validating black-box API profiling + numerical selection/control

This is not an experiment about "changing models." It's about "measuring models better at the API level and deploying them better."

Hypothesis Layers

The hypotheses for profile-based selection stack in layers.

Representation Hypothesis

Expressing role requirements and model behavioral traits as numerical scores enables more stable selection than declarative metadata alone.

Matching Hypothesis

Using fitness scores between role scores and model profiles yields better structured extraction performance than fixed model selection.

Lowering Hypothesis

Reflecting numerical selection information (model choice, reinforced instructions, per-field routing) in lowering reduces quality variance from the same Role IR.

Governance Hypothesis

Profile-based role assignment (generator/critic/evidence checker) makes the adversarial verification loop converge faster and produces better best-effort results.

Testing all four at once is impossible. They must be built up one layer at a time.

Experiment Stages

Stage 1: Establish Baseline

Confirm that the Part 5 POC serves as a stable comparison baseline.

Metrics:

Schema match rate, field coverage, evidence presence rate, per-model variance

Success criteria:

Difference between baseline and IR approach is reproducible per model
Variability on repeated calls to the same backend is measurable

Stage 2: API-Based Model Profiling

Question: Can model behavioral traits be measured reliably through API calls alone?

Run probe sets hundreds to thousands of times per model to collect behavioral scores.

Measurement axes:

instruction_adherence
structured_output_stability
field_completeness
speculation_tendency
evidence_citation_tendency
reproducibility

Example output:

model_profile:
  model_id: gpt-oss-120b
  profiling_version: "2026-04-06"
  total_probes: 1200
  dimensions:
    structured_output_stability: { score: 0.93, confidence: 0.95 }
    speculation_tendency: { score: 0.22, confidence: 0.91 }
    evidence_citation_tendency: { score: 0.81, confidence: 0.89 }

The key insight: these scores are interpretable behavioral scores, not embedding vectors. Each dimension must be human-readable.

Success criteria:

Repeated profiling of the same model doesn't fluctuate significantly
Profile differences between different models are actually separable

If profile reproducibility fails, everything downstream is meaningless. This is the first Kill Line.

Stage 3: Numerical Role Requirements

The role side needs the same axes to enable comparison with models.

Core principle:

Hard constraints from output_contract, tool_policy, behavior_policy stay declarative
Only preferences/risks/intensity get extracted as numerical scores

role_requirement:
  role_id: contract_extractor
  dimensions:
    structured_output_required: 0.98
    speculation_tolerance: 0.05
    evidence_requirement: 0.95
    verbosity_preference: 0.10

Risk: Over-aggressive numerical conversion of role semantics reduces explainability. Where to quantify and where to keep declarative is the critical judgment at this stage.

Stage 4: Role-Model Fitness Matching

Once model profiles and role requirements share the same axes, fitness scores become calculable.

Fitness score example:

fit_score =
  + w1 × structured_output_match
  + w2 × evidence_match
  - w3 × speculation_penalty
  - w4 × refusal_penalty

Experiment groups:

Fixed single model
Manual human selection
Fitness score-based automatic selection

Success criteria:

Automatic selection outperforms fixed model on average scores
Effect is larger for roles that need model switching

If automatic selection doesn't beat baseline, it's just added complexity. This is the second Kill Line.

Stage 5: Profile-Based Lowering Adjustment

Question: If the selected model's profile also informs the lowering strategy, does quality improve further?

Connecting profiles not just to model selection but to execution method decisions.

Examples:

Low structured output stability → Strengthen schema description + repeat null rules + enhance post-processing
High speculation tendency → Reinforce uncertainty instructions + emphasize evidence requirements
Verbose model → Strong verbosity constraints

Success criteria:

Same model with profile-based lowering outperforms basic lowering
Particularly improved schema_pass and evidence_rate

If matching works but lowering reflection shows no difference, using only matching is the right call.

Stage 6: Field-Level Routing

Question: Does assigning different models/strategies per field rather than per document improve performance?

Validating a structure that spends more only on high-risk fields.

Examples:

termination_clause: Conservative model
penalty_clause: Critic-friendly model
renewal_clause: Low-cost model

Success criteria:

Core field accuracy improves while limiting total cost increase
Risk field failure rate decreases compared to single-model approach

Stage 7: Adversarial Governance MVP

Question: Does using profiles to assign generator/critic/evidence checker improve adversarial verification loop performance?

Connects to the adversarial assurance loop from Part 4.

Minimum loop:

Generator selection (profile-based)
Critic selection (profile-based — conservatism/speculation suppression traits)
Critic issue generation
Generator revision
Structural/evidence validation
Terminate when score delta becomes small

Success criteria:

Convergence rate improves over simple retry without critic
Best-effort result quality improves
Cost efficiency per iteration count achieved

Unstructured Models: Valuable as Stress Test Targets

Separate from profiling, there's an interesting observation: models with weak structured output aren't useless — quite the opposite.

Models that don't reliably support JSON mode strongly stress-test:

Prompt-only structured output stability
JSON extraction/repair logic
Schema validation, evidence validation
Retry policies, fallback policies

These models may be weak as "production models," but they're excellent as stress test subjects measuring how well the harness defends.

Based on current observations, an experiment strategy:

Model Family	Use Case
Gemma 3	Unstructured output/post-processing/validation stress test
Gemma 4 26B	JSON capable but evidence quality wobbles — mid-tier bench
Gemma 4 31B	Upper baseline for structured output (reference)

Gold model (structured baseline), Stress model (format failure inducer), Near-miss model (format passes but evidence wobbles) — this three-tier split lets you measure harness defense capability layer by layer.

Idea: Extending to Vector Representations

Everything so far has been matching based on interpretable behavioral scores. Taking it one step further, you could imagine representing role and model semantics as vectors (embeddings) themselves.

For example:

Comparing semantic similarity between roles in embedding space
Compressing model behavioral patterns into latent vectors
Using LoRA or adapters to tune role-specific traits at the model level

However, this direction isn't immediately feasible in the current setup. It's implementable with adapter layers on local models, but external API-based systems don't expose model internals. At this point, it's an idea — "this is a direction we could expand into."

Practically, confirming whether API-based behavioral score profiling delivers enough value comes first.

Kill Lines

If any of the following emerge, the direction should be narrowed:

Profile reproducibility too low: If the same model's profile fluctuates heavily across repeated measurements, numerical scoring itself has little value
Automatic matching no better than manual selection: If fitness-based selection doesn't beat baseline, it's just added complexity
Profile-based lowering produces no real improvement: If matching is valid but lowering reflection shows no difference, narrow the scope
Governance loop cost too high: If latency/cost increase exceeds quality improvement, restrict to high-risk fields only

Kill Lines don't mean abandoning everything — they signal stop at this stage if it can't be proven. If Stage 4 (matching) works but Stage 5 (lowering) doesn't, using only matching is the right call.

Reflection

The part I spent the most time on while writing this was the boundary of quantification. Measuring model behavioral traits as scores and matching them to roles is a reasonable next step, but pushing further into embeddings or vector spaces requires local models and adapters. For API-based experiments, interpretable behavioral scores are the realistic starting point.

The core comes down to this:

Declarative IR handles contracts and verification. Numerical profiles handle selection and deployment.

This is still an experiment plan, not validated results. But the direction itself feels realistic.

The next post will look back at the entire series as a closing piece.

Measuring and Deploying Models — Profile-Based Selection and Matching

Measuring and Deploying Models — Profile-Based Selection and Matching

Core Distinction: Policies Stay Declarative, Selection Goes Numerical

Premise: API-Centric Experimentation

Hypothesis Layers

Representation Hypothesis

Matching Hypothesis

Lowering Hypothesis

Governance Hypothesis

Experiment Stages

Stage 1: Establish Baseline

Stage 2: API-Based Model Profiling

Stage 3: Numerical Role Requirements

Stage 4: Role-Model Fitness Matching

Stage 5: Profile-Based Lowering Adjustment

Stage 6: Field-Level Routing

Stage 7: Adversarial Governance MVP

Unstructured Models: Valuable as Stress Test Targets

Idea: Extending to Vector Representations

Kill Lines

Reflection

Comments

Role-IR + Lowering Architecture

Where This Design Could Go — The Harness Platform Possibility

More from this blog

Assembling Systems, Not Building Semiconductors — A New Learning Path in the AI Era

Korean TTS Workbench — (5) Qwen3-TTS → VoxCPM2, Swapping the Model Out

Korean TTS Workbench — (4) When the Workarounds Failed, a Hardcoded Line Inside the Library

Korean TTS Workbench — (3) The Last Syllable Sounds Cut Off — and the Cause Was the Model Itself

Korean TTS Workbench — (2) Korean-Specific Tricks and Running on 4GB VRAM

Command Palette

Measuring and Deploying Models — Profile-Based Selection and Matching

Core Distinction: Policies Stay Declarative, Selection Goes Numerical

Premise: API-Centric Experimentation

Hypothesis Layers

Representation Hypothesis

Matching Hypothesis

Lowering Hypothesis

Governance Hypothesis

Experiment Stages

Stage 1: Establish Baseline

Stage 2: API-Based Model Profiling

Stage 3: Numerical Role Requirements

Stage 4: Role-Model Fitness Matching

Stage 5: Profile-Based Lowering Adjustment

Stage 6: Field-Level Routing

Stage 7: Adversarial Governance MVP

Unstructured Models: Valuable as Stress Test Targets

Idea: Extending to Vector Representations

Kill Lines

Reflection

Comments

Role-IR + Lowering Architecture

Where This Design Could Go — The Harness Platform Possibility

More from this blog