Automated Testing

Songbird has four automated-testing surfaces. They cut the stack at different points, run at very different speeds, and answer different questions. Pick the cheapest one that can actually fail in a way you’d care about.

Mode	What it exercises	Typical time	Picks up failures in…
Rust tests	Engine, DSP, sync dispatch, state	seconds	Audio rendering, scheduling, undo, state mutations
WebSocket tap	Full sync engine + headless backend, no React	hundreds of ms	Command/event round-trips, state propagation, persistence
Vitest	React components in isolation	milliseconds per test	Component logic, Zustand store wiring, hook behavior
agent-browser e2e	Real UI in Chrome talking to a real headless backend	tens of seconds	Layout, drag-and-drop, pointer events, full-flow regressions

Picking the right mode

What did you touch?
├─ DSP / engine / sync dispatch logic
│     → 1. Rust tests              (cargo test)
│
├─ A sync-engine channel (defs.rs / commands.rs / state.rs)
│     → 1. Rust tests for the dispatch
│     → 2. WebSocket tap for the cross-boundary behavior, if non-trivial
│
├─ Pure React component or hook
│     → 3. Vitest
│
├─ A UI flow that spans multiple components, panels, or drag interactions
│     → 4. agent-browser e2e
│
└─ Anything else (e.g. fixed a typo, renamed a variable)
      → ./utils/validate is enough.

A few principles, in order of importance:

./utils/validate is the default. It runs Rust tests, Vitest, ESLint, and tsc --noEmit. For most changes, that’s the whole bar.
agent-browser is opt-in. Don’t spin up the headless harness unless the user explicitly asked for browser verification, or the change is one that can only be caught visually (a layout regression, a drag-and-drop gesture, a focus trap).
Prefer the layer where the bug would actually live. A bug in the clip scheduler is a Rust test; a bug in dispatch wiring is a WebSocket tap; a bug in how a knob renders is Vitest; a bug in cross-panel state flow is agent-browser.
Determinism beats realism. Rust tests on synthesized audio beat ear-tests on real audio; a WebSocket tap that asserts on specific events beats a browser snapshot you’d have to eyeball.

1. Rust tests

The Rust workspace is exhaustively unit-tested. Audio code is verified with deterministic output checks — RMS, peak amplitude, zero-crossing frequency estimation, spectral energy ratios, golden-output fingerprints — so engine work doesn’t need a sound card.

# All Rust tests + clippy
./utils/validate rust

# Just one crate
cd rust && cargo test -p songbird-engine
cd rust && cargo test -p songbird-sync

Conventions:

Tests live in sibling tests.rs files, never inline #[cfg(test)] mod tests { … } blocks. The parent file declares #[cfg(test)] mod tests;.
For DSP atoms/molecules, use the golden-file pattern documented in the dsp-golden-testing skill.
For end-to-end engine flows (load session → play → render → assert), see rust/crates/engine/songbird-engine/tests/ and the audio_pipeline_test example.

When this is enough

The change is pure backend (engine, DSP, state, dispatch handler logic).
The change is observable in the engine’s output samples, emitted events, or StateStore mutations.

When this isn’t enough

The change moves data across the Tauri / WebSocket boundary in a way Rust tests don’t exercise (use mode 2).
The change affects how the React UI renders or behaves (use mode 3 or 4).

2. WebSocket tap (spoofing sync engine commands)

The headless server speaks the same WebSocket protocol the Tauri React UI uses. A test client that opens a WebSocket connection to ws://localhost:<port> and sends framed JSON gets to drive the entire backend (sync engine, state, audio engine, plugin host) without spinning up a browser. This is the fastest way to assert on cross-boundary behavior.

Wire protocol

The protocol is documented in rust/crates/app/songbird-headless/src/main.rs. Short form: Invoke a command (request/response):

{
  "eventId": "__songbird__invoke",
  "payload": {
    "name": "transport.play",
    "params": {},
    "resultId": 1
  }
}

The server replies with:

{ "type": "result", "id": 1, "result": <json> }

Listen for events (server → client):

{ "type": "event", "event": "transport:state", "payload": { "playing": true, "..." } }

name values are the same "channel.action" strings the React UI uses (transport.play, mixer.volume, clip.add, …). event values match the channel event names (transport:state, mixer:audio_clip_peaks, …).

Example: a Node.js tap

./utils/build/agent-headless.sh --skip-build &
sleep 8
WS=$(jq -r .ws .context/ports.json)

node <<EOF
import WebSocket from 'ws';
const ws = new WebSocket('ws://localhost:${WS}');
let nextId = 1;
const pending = new Map();
const events = [];

function invoke(name, params = {}) {
  return new Promise((resolve) => {
    const id = nextId++;
    pending.set(id, resolve);
    ws.send(JSON.stringify({ eventId: '__songbird__invoke', payload: { name, params, resultId: id } }));
  });
}

ws.on('open', async () => {
  await invoke('transport.play');
  await new Promise(r => setTimeout(r, 1000));
  await invoke('transport.stop');
  const positions = events.filter(e => e.event === 'transport:position');
  console.log('captured', positions.length, 'position events, last:', positions.at(-1));
  process.exit(0);
});
ws.on('message', (raw) => {
  const m = JSON.parse(raw.toString());
  if (m.type === 'result') pending.get(m.id)?.(m.result);
  else if (m.type === 'event') events.push(m);
});
EOF

When this is the right tool

You need to assert on events, not pixels.
You’re verifying a multi-step backend flow (dispatch → state mutation → event emission → second dispatch).
You want test runs that take ~1 s, not ~30 s.
You’re exercising MCP-style scenarios (everything reachable via dispatch_command is reachable here too).

When this isn’t enough

The bug is in how React subscribes to or renders the events (use mode 3 or 4).
The bug only manifests with real user input timing (drag latency, pointer-event coalescing) — use mode 4.

Engine behavior under tests

By default ./utils/build/agent-headless.sh runs the engine on a virtual audio clock (--virtual-audio), so transport advances, the clip scheduler fires, and meters emit — without opening a real device. This means a WebSocket tap can assert on transport:position movement, level meters, clip triggers, etc. Pass --no-audio if you want the engine quiet, or --real-audio if you genuinely need cpal.

3. Vitest (React component tests)

Component-level React tests live next to the component (Foo.test.ts next to Foo.tsx). They use Vitest + Testing Library.

./utils/validate vitest
# or
cd react_ui && npx vitest run path/to/Foo.test.ts

Conventions:

Mock the sync engine surface (@/sync/api) — don’t spin up a real backend. Vitest is for component logic, not integration.
Mock Zustand selectors with useFooStore.setState({ … }) in beforeEach.
Don’t write a Vitest test that exercises real WebSocket / Tauri behavior — that’s mode 2 or mode 4.

When this is the right tool

A component renders the wrong thing for a given prop combination.
A custom hook computes the wrong value.
A Zustand selector triggers the wrong re-renders.

When this isn’t enough

The bug is in how the backend produces the data the component renders (use mode 1 or 2).
The bug only appears when multiple components interact, or when real layout / pointer events are in play (use mode 4).

4. agent-browser (end-to-end)

Drives real Chrome via CDP against a real headless backend. Slow, context-heavy, but the only mode that catches layout regressions, drag-and-drop bugs, and full-stack flow problems.

./utils/build/agent-headless.sh &
sleep 8
agent-browser open "$(jq -r .browser_url .context/ports.json)"
agent-browser wait 3000
agent-browser snapshot -i              # interactive elements with refs
agent-browser click @e7                # click by ref
agent-browser screenshot --annotate    # labeled image for vision review

The harness uses virtual audio by default, so transport actually ticks under tests (clip-trigger timing, meter movement, playhead extrapolation all work without opening a real device). Full skill: /.claude/skills/e2e-headless/SKILL.md. Browser-driver reference: /.claude/skills/agent-browser/SKILL.md.

Per-worktree ports

agent-headless.sh derives a stable port offset per worktree so multiple worktrees never collide: hash($PWD) mod 89 + 10, giving each worktree its own deterministic slot in [10, 98]. Offsets 0–9 are reserved for explicit manual use (launch.sh, dev). On collision (~22% odds with 7+ active worktrees), the script fails fast on a port-already-bound check; pass --port-offset N to override.

When this is the right tool

The user explicitly asked for browser verification.
The change crosses multiple panels / organisms.
The change involves pointer-event drag, focus management, or layout.

When this is too much

The behavior is observable via mode 1, 2, or 3. Use those first — they’re faster, more reliable, and don’t burn agent context on screenshots.

Watching live

Start the agent-browser dashboard before driving the UI:

agent-browser dashboard start    # http://localhost:4848

Every session (named or default) streams its viewport, command activity feed, network, and console into the dashboard. Useful when a human wants to watch an agent drive a flow without sharing the browser window.

Choosing in practice — a worked example

You’re fixing a bug where audio clips don’t trigger when transport starts before the clip’s onset. Here’s how the modes stack up:

Mode 1 (Rust tests). Yes — write a test in songbird-engine that builds a session with a clip at bar 3, calls transport.play() from bar 1, drives process() for enough frames to reach bar 3, and asserts the clip emitted audio. This is where the bug lives, so this is where the regression test goes. Fast, deterministic, catches the regression.
Mode 2 (WebSocket tap). Optional — could write a tap that does the same flow over WS and asserts on transport:position + a meter event. But the bug isn’t on the boundary, so this would just be a second copy of the Rust test with more moving parts. Skip.
Mode 3 (Vitest). No — the React side renders whatever the engine emits; if the engine emits the wrong samples, no component test catches it. Skip.
Mode 4 (agent-browser). Only if you also changed the UI’s transport controls or playhead rendering. For an engine-only fix, this would just be a slow screenshot of a working DAW. Skip unless the user asked.

Result: one Rust test, ~50 lines, done in seconds. The other three modes would each add cost without adding signal.

​Picking the right mode

​1. Rust tests

​When this is enough

​When this isn’t enough

​2. WebSocket tap (spoofing sync engine commands)

​Wire protocol

​Example: a Node.js tap

​When this is the right tool

​When this isn’t enough

​Engine behavior under tests

​3. Vitest (React component tests)

​When this is the right tool

​When this isn’t enough

​4. agent-browser (end-to-end)

​Per-worktree ports

​When this is the right tool

​When this is too much

​Watching live

​Choosing in practice — a worked example

Picking the right mode

1. Rust tests

When this is enough

When this isn’t enough

2. WebSocket tap (spoofing sync engine commands)

Wire protocol

Example: a Node.js tap

When this is the right tool

When this isn’t enough

Engine behavior under tests

3. Vitest (React component tests)

When this is the right tool

When this isn’t enough

4. agent-browser (end-to-end)

Per-worktree ports

When this is the right tool

When this is too much

Watching live

Choosing in practice — a worked example