Eval (songbird-eval)
LLM evaluation for the Songbird copilot. The eval drives the real in-app
chat loop in-process — ml.chat against the production dispatch handlers
(classifier, model tiers, orchestrator, nudges, read-before-write guard,
splice handlers, deletion guard, receipts). There is no mocked loop: a
change to the chat pipeline is automatically a change to what gets
evaluated.
The crate lives at rust/crates/ml/songbird-eval; the prompt set and the
baseline project fixture are in its assets/.
Running
~/.songbird/settings.json (same as the app);
GEMINI_API_KEY overrides for the judge. Don’t run two evals concurrently
with one key — the judge rate-limits, and zeroed judgments are flagged
JUDGE-FAILED rather than silently scored.
Scoring (max 10 per prompt)
| Dimension | Range | Source |
|---|---|---|
| validity | 0–3 | deterministic — did a parse-clean edit actually land in real state |
| relevance | 0–4 | LLM judge — capped at 2 if the run was silent (tools ran, no final message) |
| quality | 0–3 | LLM judge — capped at 1 if the prompt’s deterministic assertion failed |
assert specs in prompts.json:
chord brackets, note moved/removed/added, velocity and pitch bounds —
delta-based against the baseline state snapshot, since real parsed state
has one clip per section), and silent completions (tools executed, no
final user-facing message).
Why in-process, not mocked
The previous JS harness (eval/run_eval.js, retired 2026-06-11)
re-implemented the loop, tools, and guards as mocks, and they drifted: the
mock taught the wrong chord syntax, couldn’t observe state destruction, and
never exercised intent routing. The in-process eval found two production
bugs in its first hour — a mixer intent misroute (“lower the kick volume”
edited note velocities) and a panic on malformed MCP trackId input that
poisoned the state mutex. The bar: if the eval can’t fail the way
production fails, it isn’t evaluating production. Historical JS-era reports
remain archived under eval/.