Skip to main content

Command Palette

Search for a command to run...

Measuring and Deploying Models — Profile-Based Selection and Matching

Using API-based behavioral profiling to select the right model for each role

Published
7 min read
E
I build data and AI systems that have to survive real constraints: time, cost, memory, and messy integration boundaries.

Measuring and Deploying Models — Profile-Based Selection and Matching

Hello. This is the sixth post in the Role-IR series.

Parts 1 through 5 covered design philosophy, enterprise infrastructure, integrated strategy, adversarial verification, and the POC implementation plan. The Part 5 POC focused on proving whether declarative IR outperforms hardcoded prompts. This post asks the next question:

"How can we select and deploy models more effectively?"

In the current structure, model selection for each role is a manual human decision. That works when you have 2–3 candidates, but as the pool grows, you need a systematic way to judge "which model fits this role."

This post explores building that judgment by measuring model behavioral traits as numerical scores and comparing them against role requirements. This is still at the experiment planning stage — nothing here has been validated yet.


Core Distinction: Policies Stay Declarative, Selection Goes Numerical

First, an important clarification.

null_over_guess: true
forbid_extra_fields: true
require_evidence: true

Rules like these must stay declarative for mechanical enforcement. Expressing "no speculation" as a numerical value of 0.05 blurs the boundary.

On the other hand, numerical representation is useful for:

  • Model behavioral traits (instruction adherence, speculation tendency, evidence citation frequency)
  • Role requirement intensity (structured output importance, speculation tolerance)
  • Fitness between role and model

Policies are enforced declaratively; selection is judged numerically — that's the premise of this post.


Premise: API-Centric Experimentation

There's an important premise. Experiments in this post are mostly conducted through external API calls.

  • No direct modification of model internal parameters
  • Fine-tuning, LoRA, adapter training are not required at this stage
  • We're validating black-box API profiling + numerical selection/control

This is not an experiment about "changing models." It's about "measuring models better at the API level and deploying them better."


Hypothesis Layers

The hypotheses for profile-based selection stack in layers.

Representation Hypothesis

Expressing role requirements and model behavioral traits as numerical scores enables more stable selection than declarative metadata alone.

Matching Hypothesis

Using fitness scores between role scores and model profiles yields better structured extraction performance than fixed model selection.

Lowering Hypothesis

Reflecting numerical selection information (model choice, reinforced instructions, per-field routing) in lowering reduces quality variance from the same Role IR.

Governance Hypothesis

Profile-based role assignment (generator/critic/evidence checker) makes the adversarial verification loop converge faster and produces better best-effort results.

Testing all four at once is impossible. They must be built up one layer at a time.


Experiment Stages

Stage 1: Establish Baseline

Confirm that the Part 5 POC serves as a stable comparison baseline.

Metrics:

  • Schema match rate, field coverage, evidence presence rate, per-model variance

Success criteria:

  • Difference between baseline and IR approach is reproducible per model
  • Variability on repeated calls to the same backend is measurable

Stage 2: API-Based Model Profiling

Question: Can model behavioral traits be measured reliably through API calls alone?

Run probe sets hundreds to thousands of times per model to collect behavioral scores.

Measurement axes:

  • instruction_adherence
  • structured_output_stability
  • field_completeness
  • speculation_tendency
  • evidence_citation_tendency
  • reproducibility

Example output:

model_profile:
  model_id: gpt-oss-120b
  profiling_version: "2026-04-06"
  total_probes: 1200
  dimensions:
    structured_output_stability: { score: 0.93, confidence: 0.95 }
    speculation_tendency: { score: 0.22, confidence: 0.91 }
    evidence_citation_tendency: { score: 0.81, confidence: 0.89 }

The key insight: these scores are interpretable behavioral scores, not embedding vectors. Each dimension must be human-readable.

Success criteria:

  • Repeated profiling of the same model doesn't fluctuate significantly
  • Profile differences between different models are actually separable

If profile reproducibility fails, everything downstream is meaningless. This is the first Kill Line.


Stage 3: Numerical Role Requirements

The role side needs the same axes to enable comparison with models.

Core principle:

  • Hard constraints from output_contract, tool_policy, behavior_policy stay declarative
  • Only preferences/risks/intensity get extracted as numerical scores
role_requirement:
  role_id: contract_extractor
  dimensions:
    structured_output_required: 0.98
    speculation_tolerance: 0.05
    evidence_requirement: 0.95
    verbosity_preference: 0.10

Risk: Over-aggressive numerical conversion of role semantics reduces explainability. Where to quantify and where to keep declarative is the critical judgment at this stage.


Stage 4: Role-Model Fitness Matching

Once model profiles and role requirements share the same axes, fitness scores become calculable.

Fitness score example:

fit_score =
  + w1 × structured_output_match
  + w2 × evidence_match
  - w3 × speculation_penalty
  - w4 × refusal_penalty

Experiment groups:

  1. Fixed single model
  2. Manual human selection
  3. Fitness score-based automatic selection

Success criteria:

  • Automatic selection outperforms fixed model on average scores
  • Effect is larger for roles that need model switching

If automatic selection doesn't beat baseline, it's just added complexity. This is the second Kill Line.


Stage 5: Profile-Based Lowering Adjustment

Question: If the selected model's profile also informs the lowering strategy, does quality improve further?

Connecting profiles not just to model selection but to execution method decisions.

Examples:

  • Low structured output stability → Strengthen schema description + repeat null rules + enhance post-processing
  • High speculation tendency → Reinforce uncertainty instructions + emphasize evidence requirements
  • Verbose model → Strong verbosity constraints

Success criteria:

  • Same model with profile-based lowering outperforms basic lowering
  • Particularly improved schema_pass and evidence_rate

If matching works but lowering reflection shows no difference, using only matching is the right call.


Stage 6: Field-Level Routing

Question: Does assigning different models/strategies per field rather than per document improve performance?

Validating a structure that spends more only on high-risk fields.

Examples:

  • termination_clause: Conservative model
  • penalty_clause: Critic-friendly model
  • renewal_clause: Low-cost model

Success criteria:

  • Core field accuracy improves while limiting total cost increase
  • Risk field failure rate decreases compared to single-model approach

Stage 7: Adversarial Governance MVP

Question: Does using profiles to assign generator/critic/evidence checker improve adversarial verification loop performance?

Connects to the adversarial assurance loop from Part 4.

Minimum loop:

  1. Generator selection (profile-based)
  2. Critic selection (profile-based — conservatism/speculation suppression traits)
  3. Critic issue generation
  4. Generator revision
  5. Structural/evidence validation
  6. Terminate when score delta becomes small

Success criteria:

  • Convergence rate improves over simple retry without critic
  • Best-effort result quality improves
  • Cost efficiency per iteration count achieved

Unstructured Models: Valuable as Stress Test Targets

Separate from profiling, there's an interesting observation: models with weak structured output aren't useless — quite the opposite.

Models that don't reliably support JSON mode strongly stress-test:

  • Prompt-only structured output stability
  • JSON extraction/repair logic
  • Schema validation, evidence validation
  • Retry policies, fallback policies

These models may be weak as "production models," but they're excellent as stress test subjects measuring how well the harness defends.

Based on current observations, an experiment strategy:

Model FamilyUse Case
Gemma 3Unstructured output/post-processing/validation stress test
Gemma 4 26BJSON capable but evidence quality wobbles — mid-tier bench
Gemma 4 31BUpper baseline for structured output (reference)

Gold model (structured baseline), Stress model (format failure inducer), Near-miss model (format passes but evidence wobbles) — this three-tier split lets you measure harness defense capability layer by layer.


Idea: Extending to Vector Representations

Everything so far has been matching based on interpretable behavioral scores. Taking it one step further, you could imagine representing role and model semantics as vectors (embeddings) themselves.

For example:

  • Comparing semantic similarity between roles in embedding space
  • Compressing model behavioral patterns into latent vectors
  • Using LoRA or adapters to tune role-specific traits at the model level

However, this direction isn't immediately feasible in the current setup. It's implementable with adapter layers on local models, but external API-based systems don't expose model internals. At this point, it's an idea — "this is a direction we could expand into."

Practically, confirming whether API-based behavioral score profiling delivers enough value comes first.


Kill Lines

If any of the following emerge, the direction should be narrowed:

  1. Profile reproducibility too low: If the same model's profile fluctuates heavily across repeated measurements, numerical scoring itself has little value
  2. Automatic matching no better than manual selection: If fitness-based selection doesn't beat baseline, it's just added complexity
  3. Profile-based lowering produces no real improvement: If matching is valid but lowering reflection shows no difference, narrow the scope
  4. Governance loop cost too high: If latency/cost increase exceeds quality improvement, restrict to high-risk fields only

Kill Lines don't mean abandoning everything — they signal stop at this stage if it can't be proven. If Stage 4 (matching) works but Stage 5 (lowering) doesn't, using only matching is the right call.


Reflection

The part I spent the most time on while writing this was the boundary of quantification. Measuring model behavioral traits as scores and matching them to roles is a reasonable next step, but pushing further into embeddings or vector spaces requires local models and adapters. For API-based experiments, interpretable behavioral scores are the realistic starting point.

The core comes down to this:

Declarative IR handles contracts and verification. Numerical profiles handle selection and deployment.

This is still an experiment plan, not validated results. But the direction itself feels realistic.

The next post will look back at the entire series as a closing piece.