Skip to main content

Eval (songbird-eval)

LLM evaluation for the Songbird copilot. The eval drives the real in-app chat loop in-processml.chat against the production dispatch handlers (classifier, model tiers, orchestrator, nudges, read-before-write guard, splice handlers, deletion guard, receipts). There is no mocked loop: a change to the chat pipeline is automatically a change to what gets evaluated. The crate lives at rust/crates/ml/songbird-eval; the prompt set and the baseline project fixture are in its assets/.

Running

# Full sweep against the in-app loop (Gemini tiers, auto-routed)
cargo run -p songbird-eval --release

# Specific prompts / explicit model tier
cargo run -p songbird-eval --release -- --ids 1,2,3 --model pro

# Claude Code backend (relays through the user's `claude` CLI; tool calls
# come back via songbird-mcp → the same dispatch handlers)
cargo build --release -p songbird-mcp
cargo run -p songbird-eval --release -- --model claude-code

# Regenerate a report, or compare two result sets
cargo run -p songbird-eval --release -- --report-only --results-dir crates/ml/songbird-eval/results/auto
cargo run -p songbird-eval --release -- --compare results/auto results/claude-code
The Gemini key resolves from ~/.songbird/settings.json (same as the app); GEMINI_API_KEY overrides for the judge. Don’t run two evals concurrently with one key — the judge rate-limits, and zeroed judgments are flagged JUDGE-FAILED rather than silently scored.

Scoring (max 10 per prompt)

DimensionRangeSource
validity0–3deterministic — did a parse-clean edit actually land in real state
relevance0–4LLM judge — capped at 2 if the run was silent (tools ran, no final message)
quality0–3LLM judge — capped at 1 if the prompt’s deterministic assertion failed
Hard checks reported per run: structural violations (baseline tracks destroyed → validity 0), assertions (assert specs in prompts.json: chord brackets, note moved/removed/added, velocity and pitch bounds — delta-based against the baseline state snapshot, since real parsed state has one clip per section), and silent completions (tools executed, no final user-facing message).

Why in-process, not mocked

The previous JS harness (eval/run_eval.js, retired 2026-06-11) re-implemented the loop, tools, and guards as mocks, and they drifted: the mock taught the wrong chord syntax, couldn’t observe state destruction, and never exercised intent routing. The in-process eval found two production bugs in its first hour — a mixer intent misroute (“lower the kick volume” edited note velocities) and a panic on malformed MCP trackId input that poisoned the state mutex. The bar: if the eval can’t fail the way production fails, it isn’t evaluating production. Historical JS-era reports remain archived under eval/.