Scoring, tie-breaks, and decisions
How Router scores eligible endpoints, handles missing evidence, breaks near-ties, and turns ranking into a stable decision artifact.
Once Router has a clean eligible set, it compares those endpoints across multiple metrics.
What gets scored
The current baseline compares eligible endpoints across:
- quality
- latency
- throughput
- cost
- reliability
- preference
Measured evidence is preferred when it exists. Declared and catalog-derived data help fill the rest of the picture, and neutral defaults prevent unknown metrics from becoming accidental hard penalties.
What strategy actually changes
The strategy does not change which metrics exist. It changes how much each metric matters in the total score.
qualityleans hardest on benchmark-backed quality evidencelatencyleans hardest on effective latency and throughputcostleans hardest on observed or catalog cost, especially when budget context existsbalancedkeeps the broadest mix across quality, latency, cost, and reliability
For the explicit mode-by-mode guide, read /router/strategy-modes-and-tradeoffs.
Baseline weight sets
The current reference router uses these baseline weights:
| Strategy | quality | latency | throughput | cost | reliability | preference |
|---|---|---|---|---|---|---|
balanced | 0.30 | 0.20 | 0.10 | 0.20 | 0.15 | 0.05 |
quality | 0.50 | 0.10 | 0.05 | 0.10 | 0.20 | 0.05 |
latency | 0.15 | 0.45 | 0.15 | 0.05 | 0.15 | 0.05 |
cost | 0.15 | 0.10 | 0.05 | 0.50 | 0.15 | 0.05 |
These exact values are current baseline behavior, not a timeless protocol guarantee.
How evidence sources map into the score
Operators should read the baseline evidence story like this:
- benchmark results primarily improve quality evidence
- observed latency samples primarily improve latency scoring
- observed throughput samples primarily improve throughput scoring
- catalog model economics and observed cost estimates primarily improve cost scoring
- failure behavior primarily improves reliability scoring
This is why a benchmark page and a routing decision page are both needed. The benchmark explains why an endpoint may be strong on quality, while the routing decision explains whether that quality advantage actually won against latency, cost, reliability, and policy.
Important metric details
Quality
- uses
judge_scorewhen present - otherwise uses
quality_score - otherwise falls back to
0.5and marks the metric unknown
Latency
The baseline derives an effective latency from p50 and p95, then normalizes that value against target
and max latency defaults.
Throughput
tokens_per_sec is normalized logarithmically against a target throughput.
Cost
Cost can be driven by catalog-derived cost estimates or observed cost estimates, and it becomes more important when budget or cost posture is part of the request.
Reliability
Reliability uses 1 - failure_rate when present, otherwise a mildly optimistic default of 0.7.
Preference
Preference encodes locality and preferred capability matches. It also gets a bonus when an active role binding exists.
Unknown-metric redistribution
If every eligible candidate has a given metric marked unknown, the router:
- removes that metric's base weight
- redistributes the removed weight proportionally across the remaining known metrics
This prevents the score from being anchored to a dimension nobody has evidence for.
Small bonuses that refine close contests
On top of weighted metrics, the baseline adds a small 0.01 bonus each for:
- role preferred-capability matches
- task preferred-capability matches
Those bonuses are deliberately small so they refine close contests without overwhelming the main metric mix.
Near-tie behavior
When total scores are effectively tied, the current baseline resolves the order deterministically by:
- higher quality score
- lower effective latency
- higher reliability score
- stable lexical
endpoint_id
This keeps the result inspectable and reproducible even when candidates are very close.
What makes the final decision stable
The fallback chain is not a second pass. It is simply the remaining scored candidates after ranking.
A useful RouterDecision must say:
- what policy snapshot was applied
- which candidates were rejected and why
- which candidates were scored
- which endpoint won
- why that endpoint won
- whether measured evidence was used
- which scoring version produced the result
Versioned scoring matters
The current baseline stamps the decision with scoring_version: "baseline-v2".
That matters because metric formulas, weights, or tie-break logic can evolve over time. A decision artifact without a scoring version is harder to interpret historically.
Stable but not frozen
The protocol requires explainability and stable semantics. It does not require every implementation to use the same weights forever. Instead, it requires routers to make their choices legible in decision artifacts.
Read next
Candidate selection and eligibility
How candidates enter the router, which hard checks remove them, and why role-aware eligibility always happens before scoring.
Fallbacks, failures, and observability
How RouterDecision, fallback ordering, no-match outcomes, and observability artifacts fit into one inspectable routing story.