Benchmarks and evaluation

Use the Models benchmark surface to grade the configured endpoint set and feed routing-quality evidence back into the runtime.

Benchmarking is the operator bridge between model setup and routing strategy.

What the benchmark page is for

The Models -> Benchmark surface is where operators:

run full or quick benchmark suites
compare endpoint output quality
inspect recent run history
understand how benchmark results feed observed routing profiles

Why this page matters to Router

Benchmark results are not isolated lab output.

They are written back into the quality evidence that Router can later use for candidate ranking and decision explainability.

That is why the first full benchmark is part of setup, not just periodic maintenance.

Configured endpoint set

The active local and remote endpoints that actually compete for the work.

Run the full benchmark

Exercise the candidate set and grade outcomes through the benchmark judge path.

Observed profiles update

Benchmark-derived quality and health signals become routable evidence.

Choose routing strategy

Pick balanced, quality, latency, or cost from the benchmark story instead of prior assumptions.

Validate live routed requests

Confirm Router and Observe tell the same story once traffic starts flowing.

Re-benchmark after inventory changes

Any material provider, model, or role change should refresh the evidence before further tuning.

Benchmarking is the evidence loop that turns a configured inventory into an informed routing strategy and a checkable live decision.

What benchmark evidence actually affects

Benchmarking most directly strengthens the quality dimension of routing by populating judge-backed or quality-backed observed profiles.

It does not replace the other decision inputs Router uses later:

observed latency still feeds the latency side of candidate scoring
catalog model economics and observed cost still feed the cost side of candidate scoring
reliability and policy still decide whether an endpoint should even stay eligible

It also helps operators see:

whether latency tradeoffs are real enough to justify a latency strategy
whether cost spread is meaningful enough to justify a cost strategy
whether weak or unstable endpoints should be removed before any strategy choice

That means the benchmark is not the whole routing story, but it is the main evidence source that keeps strategy selection from turning into guesswork.