Measuring and Deploying Models — Profile-Based Selection and Matching
Using API-based behavioral profiling to select the right model for each role
Measuring and Deploying Models — Profile-Based Selection and Matching
Hello. This is the sixth post in the Role-IR series.
Parts 1 through 5 covered design philosophy, enterprise infrastructure, integrated strategy, adversarial verification, and the POC implementation plan. The Part 5 POC focused on proving whether declarative IR outperforms hardcoded prompts. This post asks the next question:
"How can we select and deploy models more effectively?"
In the current structure, model selection for each role is a manual human decision. That works when you have 2–3 candidates, but as the pool grows, you need a systematic way to judge "which model fits this role."
This post explores building that judgment by measuring model behavioral traits as numerical scores and comparing them against role requirements. This is still at the experiment planning stage — nothing here has been validated yet.
Core Distinction: Policies Stay Declarative, Selection Goes Numerical
First, an important clarification.
null_over_guess: true
forbid_extra_fields: true
require_evidence: true
Rules like these must stay declarative for mechanical enforcement. Expressing "no speculation" as a numerical value of 0.05 blurs the boundary.
On the other hand, numerical representation is useful for:
- Model behavioral traits (instruction adherence, speculation tendency, evidence citation frequency)
- Role requirement intensity (structured output importance, speculation tolerance)
- Fitness between role and model
Policies are enforced declaratively; selection is judged numerically — that's the premise of this post.
Premise: API-Centric Experimentation
There's an important premise. Experiments in this post are mostly conducted through external API calls.
- No direct modification of model internal parameters
- Fine-tuning, LoRA, adapter training are not required at this stage
- We're validating black-box API profiling + numerical selection/control
This is not an experiment about "changing models." It's about "measuring models better at the API level and deploying them better."
Hypothesis Layers
The hypotheses for profile-based selection stack in layers.
Representation Hypothesis
Expressing role requirements and model behavioral traits as numerical scores enables more stable selection than declarative metadata alone.
Matching Hypothesis
Using fitness scores between role scores and model profiles yields better structured extraction performance than fixed model selection.
Lowering Hypothesis
Reflecting numerical selection information (model choice, reinforced instructions, per-field routing) in lowering reduces quality variance from the same Role IR.
Governance Hypothesis
Profile-based role assignment (generator/critic/evidence checker) makes the adversarial verification loop converge faster and produces better best-effort results.
Testing all four at once is impossible. They must be built up one layer at a time.
Experiment Stages
Stage 1: Establish Baseline
Confirm that the Part 5 POC serves as a stable comparison baseline.
Metrics:
- Schema match rate, field coverage, evidence presence rate, per-model variance
Success criteria:
- Difference between baseline and IR approach is reproducible per model
- Variability on repeated calls to the same backend is measurable
Stage 2: API-Based Model Profiling
Question: Can model behavioral traits be measured reliably through API calls alone?
Run probe sets hundreds to thousands of times per model to collect behavioral scores.
Measurement axes:
- instruction_adherence
- structured_output_stability
- field_completeness
- speculation_tendency
- evidence_citation_tendency
- reproducibility
Example output:
model_profile:
model_id: gpt-oss-120b
profiling_version: "2026-04-06"
total_probes: 1200
dimensions:
structured_output_stability: { score: 0.93, confidence: 0.95 }
speculation_tendency: { score: 0.22, confidence: 0.91 }
evidence_citation_tendency: { score: 0.81, confidence: 0.89 }
The key insight: these scores are interpretable behavioral scores, not embedding vectors. Each dimension must be human-readable.
Success criteria:
- Repeated profiling of the same model doesn't fluctuate significantly
- Profile differences between different models are actually separable
If profile reproducibility fails, everything downstream is meaningless. This is the first Kill Line.
Stage 3: Numerical Role Requirements
The role side needs the same axes to enable comparison with models.
Core principle:
- Hard constraints from
output_contract,tool_policy,behavior_policystay declarative - Only preferences/risks/intensity get extracted as numerical scores
role_requirement:
role_id: contract_extractor
dimensions:
structured_output_required: 0.98
speculation_tolerance: 0.05
evidence_requirement: 0.95
verbosity_preference: 0.10
Risk: Over-aggressive numerical conversion of role semantics reduces explainability. Where to quantify and where to keep declarative is the critical judgment at this stage.
Stage 4: Role-Model Fitness Matching
Once model profiles and role requirements share the same axes, fitness scores become calculable.
Fitness score example:
fit_score =
+ w1 × structured_output_match
+ w2 × evidence_match
- w3 × speculation_penalty
- w4 × refusal_penalty
Experiment groups:
- Fixed single model
- Manual human selection
- Fitness score-based automatic selection
Success criteria:
- Automatic selection outperforms fixed model on average scores
- Effect is larger for roles that need model switching
If automatic selection doesn't beat baseline, it's just added complexity. This is the second Kill Line.
Stage 5: Profile-Based Lowering Adjustment
Question: If the selected model's profile also informs the lowering strategy, does quality improve further?
Connecting profiles not just to model selection but to execution method decisions.
Examples:
- Low structured output stability → Strengthen schema description + repeat null rules + enhance post-processing
- High speculation tendency → Reinforce uncertainty instructions + emphasize evidence requirements
- Verbose model → Strong verbosity constraints
Success criteria:
- Same model with profile-based lowering outperforms basic lowering
- Particularly improved schema_pass and evidence_rate
If matching works but lowering reflection shows no difference, using only matching is the right call.
Stage 6: Field-Level Routing
Question: Does assigning different models/strategies per field rather than per document improve performance?
Validating a structure that spends more only on high-risk fields.
Examples:
termination_clause: Conservative modelpenalty_clause: Critic-friendly modelrenewal_clause: Low-cost model
Success criteria:
- Core field accuracy improves while limiting total cost increase
- Risk field failure rate decreases compared to single-model approach
Stage 7: Adversarial Governance MVP
Question: Does using profiles to assign generator/critic/evidence checker improve adversarial verification loop performance?
Connects to the adversarial assurance loop from Part 4.
Minimum loop:
- Generator selection (profile-based)
- Critic selection (profile-based — conservatism/speculation suppression traits)
- Critic issue generation
- Generator revision
- Structural/evidence validation
- Terminate when score delta becomes small
Success criteria:
- Convergence rate improves over simple retry without critic
- Best-effort result quality improves
- Cost efficiency per iteration count achieved
Unstructured Models: Valuable as Stress Test Targets
Separate from profiling, there's an interesting observation: models with weak structured output aren't useless — quite the opposite.
Models that don't reliably support JSON mode strongly stress-test:
- Prompt-only structured output stability
- JSON extraction/repair logic
- Schema validation, evidence validation
- Retry policies, fallback policies
These models may be weak as "production models," but they're excellent as stress test subjects measuring how well the harness defends.
Based on current observations, an experiment strategy:
| Model Family | Use Case |
| Gemma 3 | Unstructured output/post-processing/validation stress test |
| Gemma 4 26B | JSON capable but evidence quality wobbles — mid-tier bench |
| Gemma 4 31B | Upper baseline for structured output (reference) |
Gold model (structured baseline), Stress model (format failure inducer), Near-miss model (format passes but evidence wobbles) — this three-tier split lets you measure harness defense capability layer by layer.
Idea: Extending to Vector Representations
Everything so far has been matching based on interpretable behavioral scores. Taking it one step further, you could imagine representing role and model semantics as vectors (embeddings) themselves.
For example:
- Comparing semantic similarity between roles in embedding space
- Compressing model behavioral patterns into latent vectors
- Using LoRA or adapters to tune role-specific traits at the model level
However, this direction isn't immediately feasible in the current setup. It's implementable with adapter layers on local models, but external API-based systems don't expose model internals. At this point, it's an idea — "this is a direction we could expand into."
Practically, confirming whether API-based behavioral score profiling delivers enough value comes first.
Kill Lines
If any of the following emerge, the direction should be narrowed:
- Profile reproducibility too low: If the same model's profile fluctuates heavily across repeated measurements, numerical scoring itself has little value
- Automatic matching no better than manual selection: If fitness-based selection doesn't beat baseline, it's just added complexity
- Profile-based lowering produces no real improvement: If matching is valid but lowering reflection shows no difference, narrow the scope
- Governance loop cost too high: If latency/cost increase exceeds quality improvement, restrict to high-risk fields only
Kill Lines don't mean abandoning everything — they signal stop at this stage if it can't be proven. If Stage 4 (matching) works but Stage 5 (lowering) doesn't, using only matching is the right call.
Reflection
The part I spent the most time on while writing this was the boundary of quantification. Measuring model behavioral traits as scores and matching them to roles is a reasonable next step, but pushing further into embeddings or vector spaces requires local models and adapters. For API-based experiments, interpretable behavioral scores are the realistic starting point.
The core comes down to this:
Declarative IR handles contracts and verification. Numerical profiles handle selection and deployment.
This is still an experiment plan, not validated results. But the direction itself feels realistic.
The next post will look back at the entire series as a closing piece.

