Eval (eval/)
LLM evaluation framework for testing the Songbird AI copilot’s ability to generate and edit .bird files.
Overview
Runs a suite of 50 music composition prompts through the Gemini API, simulates tool calls (no live C++ backend needed), validates the generated.bird files, and uses LLM-as-judge scoring for quality assessment.
Files
| File | Purpose |
|---|---|
run_eval.js | Main evaluation runner — sends prompts to Gemini, handles multi-turn tool simulation, validates output, scores with LLM-as-judge. |
prompts.json | 50 evaluation prompts covering music composition tasks (create beats, add tracks, change arrangements, etc.). |
baseline.bird | Baseline .bird project file used as the starting state for each evaluation prompt. |
report.md | Generated evaluation report with pass/fail rates and scoring. |
results/ | Per-prompt result files (JSON) containing generated .bird files, tool calls, scores, and timing. |
Usage
Architecture
Scoring
Each prompt is scored on:- Structural validity — Does the
.birdfile parse correctly? - Tool use correctness — Were the right tools called with valid arguments?
- Musical quality — LLM-as-judge evaluates musicality, complexity, and adherence to the prompt