Skip to main content
Songbird has four automated-testing surfaces. They cut the stack at different points, run at very different speeds, and answer different questions. Pick the cheapest one that can actually fail in a way you’d care about.
ModeWhat it exercisesTypical timePicks up failures in…
Rust testsEngine, DSP, sync dispatch, statesecondsAudio rendering, scheduling, undo, state mutations
WebSocket tapFull sync engine + headless backend, no Reacthundreds of msCommand/event round-trips, state propagation, persistence
VitestReact components in isolationmilliseconds per testComponent logic, Zustand store wiring, hook behavior
agent-browser e2eReal UI in Chrome talking to a real headless backendtens of secondsLayout, drag-and-drop, pointer events, full-flow regressions

Picking the right mode

What did you touch?
├─ DSP / engine / sync dispatch logic
│     → 1. Rust tests              (cargo test)

├─ A sync-engine channel (defs.rs / commands.rs / state.rs)
│     → 1. Rust tests for the dispatch
│     → 2. WebSocket tap for the cross-boundary behavior, if non-trivial

├─ Pure React component or hook
│     → 3. Vitest

├─ A UI flow that spans multiple components, panels, or drag interactions
│     → 4. agent-browser e2e

└─ Anything else (e.g. fixed a typo, renamed a variable)
      → ./utils/validate is enough.
A few principles, in order of importance:
  1. ./utils/validate is the default. It runs Rust tests, Vitest, ESLint, and tsc --noEmit. For most changes, that’s the whole bar.
  2. agent-browser is opt-in. Don’t spin up the headless harness unless the user explicitly asked for browser verification, or the change is one that can only be caught visually (a layout regression, a drag-and-drop gesture, a focus trap).
  3. Prefer the layer where the bug would actually live. A bug in the clip scheduler is a Rust test; a bug in dispatch wiring is a WebSocket tap; a bug in how a knob renders is Vitest; a bug in cross-panel state flow is agent-browser.
  4. Determinism beats realism. Rust tests on synthesized audio beat ear-tests on real audio; a WebSocket tap that asserts on specific events beats a browser snapshot you’d have to eyeball.

1. Rust tests

The Rust workspace is exhaustively unit-tested. Audio code is verified with deterministic output checks — RMS, peak amplitude, zero-crossing frequency estimation, spectral energy ratios, golden-output fingerprints — so engine work doesn’t need a sound card.
# All Rust tests + clippy
./utils/validate rust

# Just one crate
cd rust && cargo test -p songbird-engine
cd rust && cargo test -p songbird-sync
Conventions:
  • Tests live in sibling tests.rs files, never inline #[cfg(test)] mod tests { … } blocks. The parent file declares #[cfg(test)] mod tests;.
  • For DSP atoms/molecules, use the golden-file pattern documented in the dsp-golden-testing skill.
  • For end-to-end engine flows (load session → play → render → assert), see rust/crates/engine/songbird-engine/tests/ and the audio_pipeline_test example.

When this is enough

  • The change is pure backend (engine, DSP, state, dispatch handler logic).
  • The change is observable in the engine’s output samples, emitted events, or StateStore mutations.

When this isn’t enough

  • The change moves data across the Tauri / WebSocket boundary in a way Rust tests don’t exercise (use mode 2).
  • The change affects how the React UI renders or behaves (use mode 3 or 4).

2. WebSocket tap (spoofing sync engine commands)

The headless server speaks the same WebSocket protocol the Tauri React UI uses. A test client that opens a WebSocket connection to ws://localhost:<port> and sends framed JSON gets to drive the entire backend (sync engine, state, audio engine, plugin host) without spinning up a browser. This is the fastest way to assert on cross-boundary behavior.

Wire protocol

The protocol is documented in rust/crates/app/songbird-headless/src/main.rs. Short form: Invoke a command (request/response):
{
  "eventId": "__songbird__invoke",
  "payload": {
    "name": "transport.play",
    "params": {},
    "resultId": 1
  }
}
The server replies with:
{ "type": "result", "id": 1, "result": <json> }
Listen for events (server → client):
{ "type": "event", "event": "transport:state", "payload": { "playing": true, "..." } }
name values are the same "channel.action" strings the React UI uses (transport.play, mixer.volume, clip.add, …). event values match the channel event names (transport:state, mixer:audio_clip_peaks, …).

Example: a Node.js tap

./utils/build/agent-headless.sh --skip-build &
sleep 8
WS=$(jq -r .ws .context/ports.json)

node <<EOF
import WebSocket from 'ws';
const ws = new WebSocket('ws://localhost:${WS}');
let nextId = 1;
const pending = new Map();
const events = [];

function invoke(name, params = {}) {
  return new Promise((resolve) => {
    const id = nextId++;
    pending.set(id, resolve);
    ws.send(JSON.stringify({ eventId: '__songbird__invoke', payload: { name, params, resultId: id } }));
  });
}

ws.on('open', async () => {
  await invoke('transport.play');
  await new Promise(r => setTimeout(r, 1000));
  await invoke('transport.stop');
  const positions = events.filter(e => e.event === 'transport:position');
  console.log('captured', positions.length, 'position events, last:', positions.at(-1));
  process.exit(0);
});
ws.on('message', (raw) => {
  const m = JSON.parse(raw.toString());
  if (m.type === 'result') pending.get(m.id)?.(m.result);
  else if (m.type === 'event') events.push(m);
});
EOF

When this is the right tool

  • You need to assert on events, not pixels.
  • You’re verifying a multi-step backend flow (dispatch → state mutation → event emission → second dispatch).
  • You want test runs that take ~1 s, not ~30 s.
  • You’re exercising MCP-style scenarios (everything reachable via dispatch_command is reachable here too).

When this isn’t enough

  • The bug is in how React subscribes to or renders the events (use mode 3 or 4).
  • The bug only manifests with real user input timing (drag latency, pointer-event coalescing) — use mode 4.

Engine behavior under tests

By default ./utils/build/agent-headless.sh runs the engine on a virtual audio clock (--virtual-audio), so transport advances, the clip scheduler fires, and meters emit — without opening a real device. This means a WebSocket tap can assert on transport:position movement, level meters, clip triggers, etc. Pass --no-audio if you want the engine quiet, or --real-audio if you genuinely need cpal.

3. Vitest (React component tests)

Component-level React tests live next to the component (Foo.test.ts next to Foo.tsx). They use Vitest + Testing Library.
./utils/validate vitest
# or
cd react_ui && npx vitest run path/to/Foo.test.ts
Conventions:
  • Mock the sync engine surface (@/sync/api) — don’t spin up a real backend. Vitest is for component logic, not integration.
  • Mock Zustand selectors with useFooStore.setState({ … }) in beforeEach.
  • Don’t write a Vitest test that exercises real WebSocket / Tauri behavior — that’s mode 2 or mode 4.

When this is the right tool

  • A component renders the wrong thing for a given prop combination.
  • A custom hook computes the wrong value.
  • A Zustand selector triggers the wrong re-renders.

When this isn’t enough

  • The bug is in how the backend produces the data the component renders (use mode 1 or 2).
  • The bug only appears when multiple components interact, or when real layout / pointer events are in play (use mode 4).

4. agent-browser (end-to-end)

Drives real Chrome via CDP against a real headless backend. Slow, context-heavy, but the only mode that catches layout regressions, drag-and-drop bugs, and full-stack flow problems.
./utils/build/agent-headless.sh &
sleep 8
agent-browser open "$(jq -r .browser_url .context/ports.json)"
agent-browser wait 3000
agent-browser snapshot -i              # interactive elements with refs
agent-browser click @e7                # click by ref
agent-browser screenshot --annotate    # labeled image for vision review
The harness uses virtual audio by default, so transport actually ticks under tests (clip-trigger timing, meter movement, playhead extrapolation all work without opening a real device). Full skill: /.claude/skills/e2e-headless/SKILL.md. Browser-driver reference: /.claude/skills/agent-browser/SKILL.md.

Per-worktree ports

agent-headless.sh derives a stable port offset per worktree so multiple worktrees never collide: hash($PWD) mod 89 + 10, giving each worktree its own deterministic slot in [10, 98]. Offsets 0–9 are reserved for explicit manual use (launch.sh, dev). On collision (~22% odds with 7+ active worktrees), the script fails fast on a port-already-bound check; pass --port-offset N to override.

When this is the right tool

  • The user explicitly asked for browser verification.
  • The change crosses multiple panels / organisms.
  • The change involves pointer-event drag, focus management, or layout.

When this is too much

  • The behavior is observable via mode 1, 2, or 3. Use those first — they’re faster, more reliable, and don’t burn agent context on screenshots.

Watching live

Start the agent-browser dashboard before driving the UI:
agent-browser dashboard start    # http://localhost:4848
Every session (named or default) streams its viewport, command activity feed, network, and console into the dashboard. Useful when a human wants to watch an agent drive a flow without sharing the browser window.

Choosing in practice — a worked example

You’re fixing a bug where audio clips don’t trigger when transport starts before the clip’s onset. Here’s how the modes stack up:
  • Mode 1 (Rust tests). Yes — write a test in songbird-engine that builds a session with a clip at bar 3, calls transport.play() from bar 1, drives process() for enough frames to reach bar 3, and asserts the clip emitted audio. This is where the bug lives, so this is where the regression test goes. Fast, deterministic, catches the regression.
  • Mode 2 (WebSocket tap). Optional — could write a tap that does the same flow over WS and asserts on transport:position + a meter event. But the bug isn’t on the boundary, so this would just be a second copy of the Rust test with more moving parts. Skip.
  • Mode 3 (Vitest). No — the React side renders whatever the engine emits; if the engine emits the wrong samples, no component test catches it. Skip.
  • Mode 4 (agent-browser). Only if you also changed the UI’s transport controls or playhead rendering. For an engine-only fix, this would just be a slow screenshot of a working DAW. Skip unless the user asked.
Result: one Rust test, ~50 lines, done in seconds. The other three modes would each add cost without adding signal.