Skip to main content

Node Graph Extension Plan

Living document — phases land incrementally. This top section is the at-a-glance status; phase sections below have the details. Scope: extend the existing audio graph with texture (visual) signals, a control-rate tier for off-audio-thread compute, and protocol-adapter nodes (OSC, etc.) so that visualizers, Veo video, image-conditioned ML generation, and external control surfaces all live in one model. Implementation strategy: incremental — ship one port type and one use case at a time, not a big-bang rewrite.

Phase status

#PhaseStatusNotes
1Texture port + Veo viewer✅ shippedVisualSlice + visual channel; Veo creates a VideoSource node
wired to ViewportSink; framerate-aware drift controller; FE-side
metadata probing (no hardcoded duration/fps).
2Control-rate scheduler✅ shippedControlRateScheduler on DispatchContext; Veo LRO migrated;
real ml.cancel_veo (cancel-by-nonce). No fixed tick rate — nodes
self-pace.
3Visualizer-as-node✅ partialKind + input-source picks persist via VisualizerSource node.
Per-kind detailed settings + true audio_buffer port deferred.
4Protocols channel (OSC + future)✅ scaffoldingOne channel hosts all external-input adapters (OSC today; HID /
gamepad / WebSocket / sensors as future adapter variants). UDP
listener runs as a control-rate task; protocols:message events
end-to-end. Routing UI deferred.
5three.js compositor✅ shippedSingle shared <GpuCanvas> drives both visualizer and video modes.
Video frames feed THREE.VideoTexture via VideoTextureGL. The
<video> element stays in the DOM at `opacity-0 pointer-events-
noneso the browser still does GPU decode +rVFC`, but the visible
output comes from the canvas — frame-synced through the GPU
pipeline. (visibility:hidden would stop WebKit from compositing
the layer, which suspends rVFC; opacity-0 keeps it firing.)
Image → ML audio (Lyria/Magenta)🚫 out of scopeArchitecture preserves room (see “Out of scope” below).
Visual node editor (Max/PD)🚫 out of scopeGraph is real and addressable from code; UI a separate question.
Distributed / networked nodes🚫 out of scopeSongbird is single-process today.

Motivation

The immediate forcing functions are Veo video clips (need a way to play generated MP4s synced to the song) and visualizers (currently a hardcoded panel with no routing). Looking forward, Lyria/Magenta-style ML models that take image embeddings as input are coming, and the team wants OSC for external control surfaces. Each of these has been treated as a one-off so far. They share enough structure that a common abstraction pays off, and the existing audio graph is already richer than “audio + MIDI” — it just doesn’t span all the signal types we now need.

What the graph is today

The Songbird audio engine (rust/crates/engine/songbird-engine) runs a unified DAG on the audio thread. Nodes wrap a Processor trait (PluginChainProcessor or SingleProcessor). Connections carry typed signals between nodes:
Signal typeHow it flows today
Audio buffersPort 0 (main stereo), ports 1+ (sidechain/aux). Multi-channel.
MIDI eventsSame graph as audio, midi_per_node parallel buffer.
Sidechain audioFirst-class via SidechainPort + SidechainConnection.
Sends/returnsExplicit nodes + Connections between fader and bus inputs.
Hardware I/OOutputRouting::HardwareOutput(n) per node.
ModulationLfo, EnvelopeFollower, StepSequencerModulationRouting → plugin params. Audio-thread, sample-accurate.
AutomationBezier/step/exp curves on plugin params, audio-thread.
What’s not in the graph:
  • Texture / visual signals — visualizers read from RT meter buffers via ad-hoc subscriptions; the Veo player is wired to a global videoFilePath.
  • Off-audio-thread compute — Lyria streaming, ML inference, audio separator, time-stretch run around the graph as standalone modules that pre/post-process clip data. They’re not nodes.
  • External I/O — OSC, network, gamepad/HID, sensors. Nothing.
  • A formal control-rate tier — modulation and automation are control-rate in spirit but run on the audio thread because they’re cheap pure DSP. ML/OSC/decode can’t.

What we add

Two architectural primitives, in this order:

1. Texture as a port type

A texture<2D, RGBA> signal. Same shape as TouchDesigner TOPs: any node that produces a per-frame texture, any node that consumes one. The Veo player and the visualizer both become texture-source nodes feeding a viewport sink (the Video tab and Visualizer tab respectively). Why this is the right primitive: video and shaders and image generators all output the same thing — a 2D RGBA buffer per frame. They differ only in how the pixels are sourced. Downstream (compositing, blending, sampling, displaying) doesn’t care. So the abstraction isn’t “video node,” it’s “node with a texture port.” Sub-distinctions inside texture-producing nodes:
  • TextureSource (pure producer): video player, image, generator (noise, gradient), camera feed.
  • TextureTransform (consumer + producer): shader, blend, color correct, compositor.
  • TextureSink: viewport (Video tab, Visualizer tab), recorder, texture → audio (e.g. spectrogram-as-texture written back).
GPU resource ownership, scheduling, and presentation are the actual architectural lift. The first two nodes (VeoSource, ViewportSink) prove the abstraction; subsequent texture nodes (shader, compositor) should be 50–100 LOC each.

2. A control-rate scheduler tier

A second graph tick rate, running on a worker thread (not the audio thread). Targets:
  • 30–60 Hz for visual updates and OSC drain.
  • On-demand for ML inference (caller schedules a tick when input changes).
Communication into the audio-rate graph uses the existing SPSC ring infrastructure that Lyria already uses today. Nodes that live on this tier:
  • ML inference (Lyria stream wrapper, image→audio gen, embedding extractors)
  • Veo decode driver (frame timing computed here, GPU upload happens on the render thread)
  • OSC adapter (UDP socket reader)
  • Network adapters (WebSocket, etc.)
  • Sensor adapters (HID, gyroscope)
  • Image processing (texture → embedding)
These nodes are free to allocate, await, hold locks, do file I/O, talk to the network. None of that is allowed on the audio thread.

Texture and control-rate are independent

Importantly: the texture port type and the control-rate tier are orthogonal. You can have audio-thread texture work (rare; e.g. an offline render) and you can have control-rate audio work (e.g. ML that emits audio buffers via the existing Lyria ring). We name them together because both are needed for the immediate use cases, but the design keeps them separable.

Cross-tier texture access (gesture / camera / vision use cases)

A class of future nodes needs to consume textures from the control-rate tier — not just produce them. The driving example is laptop / webcam input for gesture control: a Webcam node captures camera frames as a texture stream, a PoseExtractor (or hand-tracker, face-mesh, optical-flow) node runs ML inference on those frames to emit a vector or event<trigger> port, and downstream the existing modulation-routing layer maps those values to plugin parameters or triggers. The same pipeline shape covers depth sensors, AR markers, and any computer-vision-driven control. For this to work, the design needs three properties:
  1. Texture sources can run on the control-rate tier. Webcam capture isn’t part of the audio-rate world and shouldn’t pretend to be. The texture port type is independent of which tier produces it — a control-rate node can have a texture output port and present a new frame on each tick.
  2. Control-rate nodes can read texture ports. Pose detection, embedding extraction, and any vision ML need to sample texture data on the worker thread. This means the GPU/texture resource layer must allow reads from off-render-thread consumers (typically via readback into a CPU buffer; on browsers, <canvas> or getImageData from a video element; on native, wgpu staging buffers).
  3. Texture → vector/event/scalar conversions are first-class. Vision ML emits structured data (landmarks, embeddings, classifications), not pixels. Those become standard vector / event<trigger> / scalar ports — no new port type — so a hand-position landmark driving a synth filter is the same kind of wire as an LFO or an OSC value driving the same parameter.
Practical implication for the texture port design: don’t treat textures as render-thread-only opaque GPU handles. The port representation has to expose a CPU-readable view (or an explicit readback API) so that Phase-1 design doesn’t lock out gesture/vision in subsequent phases. Readback is expensive — that’s fine, control-rate can absorb it. Worth noting: the immediate Veo + visualizer use cases never need texture readback (they go GPU → display sink directly), so this constraint costs us nothing on day one. But getting the abstraction right early means webcam/gesture is “another control-rate node” later, not an architectural retrofit.

Why OSC needs the control-rate tier specifically

This came up in design discussion and is worth pinning down because it’s the cleanest illustration of the tier split. OSC arrives over UDP at unpredictable rates and times — a control surface might burst 30 messages in 5 ms during a fader sweep, an iPad patch might send at 60 Hz, a sensor might send at 100 Hz. Two constraints force OSC off the audio thread:
  1. Socket I/O is forbidden on the audio thread. recv() is a syscall and can block on the network stack. Even non-blocking reads require system calls and kernel buffer interaction. The audio thread’s ~5 ms block budget can’t absorb a syscall storm.
  2. OSC packet arrival isn’t aligned with audio blocks. Even if reads were free, you’d still have to buffer arbitrary-size bursts and drain them at some rate that isn’t necessarily a multiple of the audio block rate.
So the OSC node must run on a worker thread. The control-rate scheduler is just the formalized version of “drain the OSC socket ~120–240 Hz, decode messages, emit them as typed events into the graph.” The rate question is real but secondary — the primary reason is “you can’t do socket I/O on the audio thread, full stop.” Once the messages are decoded, they cross into the audio-rate graph through the same SPSC ring pattern Lyria uses. From the audio thread’s perspective an OSC-sourced control value is indistinguishable from an LFO-sourced one — both arrive as scalar samples on a modulation port.

Port type discipline

A small, semantic set of port types beats a per-protocol explosion.
Port typeCarries
audio_bufferPer-block PCM samples (existing).
midi_eventPer-block MIDI events (existing).
texturePer-frame 2D RGBA (new).
scalarFloat — automation, mod, knob position.
vectorFixed-size float array — embeddings, MFCC, EQ curves.
event<note>Discrete pitched events (any source).
event<trigger>Discrete fire-and-forget events.
stringPrompts, labels, OSC addresses, chat content.
OSC, MIDI, HID, WebSocket are adapter nodes that translate protocol → these typed ports. Anything consuming event<note> doesn’t know or care if the source is a hardware controller, an OSC /synth/note message, or a step sequencer. This keeps the type system small and protocols pluggable.

Phasing

Each phase ships independently and proves the abstraction before the next is built.

Phase 1 — Texture port type + Veo viewer (validates texture) ✅

Status: shipped. What landed:
  • VisualSlice in songbird-state with VisualNode / VisualEdge / VisualNodeKind { VideoSource, ViewportSink } / VisualPortKind { Texture }. Side-routed persistence to daw.visual.json mirroring the mixer/plugin/ai/sections pattern (#[serde(skip)] in StateManager).
  • New visual sync channel under songbird-sync/src/channels/visual/ with add_video_source, add_viewport_sink, connect, disconnect, remove_node, move_node, update_video_source_metadata, get_state commands and a single visual:state event (full slice broadcast). All mutations are idempotent where it matters (sinks unique per target, edges dedupe on same from/to).
  • Veo handler now creates a VideoSource node with a start_beat anchor + auto-connects to the singleton video_tab ViewportSink, then pushes visual:state before ml:veo_complete so the FE Zustand mirror is hot when the panel sees completion.
  • FE: useVisualStore (Zustand mirror), visual channel registered in wiring.ts, VideoPlayer reads a VideoSource node from the graph and falls back to the legacy videoFilePath for drag-drop.
  • Drift controller is now framerate-aware: FRAME_THRESHOLD = max(33ms, 1.5/fps). The 24 fps Veo source no longer stutters because the threshold is wider than one frame interval. Audio position is also clamped to [0, durationSeconds] so 8 s clips in 30 s songs don’t constantly hard-seek past their own end.
  • No hardcoded source metadata: Veo creates the node with zero values for duration / framerate / dimensions. The FE probes real values via <video>.loadedmetadata (duration + dimensions) and requestVideoFrameCallback interval averaging (framerate), then patches the node via the new visual.update_video_source_metadata command. Same path covers any container the browser plays — drag-dropped arbitrary videos populate the same fields.
Open follow-ups carried into later phases:
  • Multi-clip / playhead-based source selection. Phase 1 picks the latest VideoSource by Map insertion order; Phase 3+ should pick the active clip whose [startBeat, startBeat + duration] window contains the playhead.
  • Snap start_beat to the live transport position when Veo generation is dispatched. Today it defaults to 0 because transport.position_beats lives on a runtime atomic, not the persisted slice. Wiring it through the dispatch context is a small follow-up, deferred so it doesn’t block Phase 2.
  • Render video clips as blocks in the Arrangement view (so users can drag/move/delete them without the dispatch CLI).

Phase 2 — Control-rate scheduler (validates the second tier) ✅

Status: shipped. What landed:
  • songbird-sync/src/control_rate.rsControlRateScheduler is a managed pool of long-running async tasks keyed by TaskKey, built on a shared multi-threaded tokio runtime (one worker per CPU). Cheap to clone (internals are Arc-wrapped), embedded in DispatchContext so every dispatch handler can reach it.
  • API: register(key, future) → key, cancel(&key) → bool, is_registered, keys, forget. Registration is last-writer-wins — re-registering the same key aborts the prior task. Idempotent semantics for “ensure exactly one X is running” UI patterns.
  • Veo’s LRO poll migrated: the dispatch handler now calls ctx.control_rate.register(TaskKey::new(format!("veo:{nonce}")), ...) instead of fire-and-forget spawn_async. ml.cancel_veo actually cancels the in-flight task now (was a no-op stub before) — by nonce when the FE supplies one, otherwise cancels every veo:*-prefixed task as a safety net.
  • Unit tests cover register_and_cancel (in-flight task is aborted before it can complete) and register_replaces_existing_key (re-registering aborts the prior task).
Design note — no fixed tick rate. The original plan called for a 60 Hz tick loop with per-node rate overrides. Concrete experience with the node kinds we actually want (Veo LRO, OSC drain, ML inference, webcam capture) showed each one knows its own pacing better than a uniform tick: Veo’s LRO polls every 5 s, OSC drains on incoming UDP, ML on input change. So the scheduler has no tick; nodes implement their own tokio::time::sleep, tokio::select!, or stream loops. The scheduler is just “registry of long-lived async tasks with cancellation handles.” Simpler and more flexible than a fixed cadence. Future nodes (OSC, ML, webcam) all slot in by registering an async function with the scheduler — same pattern Veo now uses. SPSC ring marshalling into the audio thread is unchanged from how Lyria already does it; no new infrastructure.

Phase 3 — Visualizer-as-node (validates cross-tier composition) ✅ (partial)

Status: kind + input-source persistence shipped. Per-kind detailed settings and a true audio_buffer port type are deferred. What landed:
  • New VisualizerSource { panel_id, kind, input_source } variant in VisualNodeKind. One node per panel id (the only panel today is "main"); future per-track inline visualizers mint their own ids.
  • visual.upsert_visualizer_source command — singleton-per-panel, last-writer-wins, returns the node id. Idempotent so re-mounts don’t leak nodes.
  • The Visualizer panel hydrates visKind and inputSource from the graph on first availability and writes them back on every change. Hydration guard prevents the writeback from clobbering the initial pull and from re-hydrating after the user edits locally.
  • input_source is round-tripped as JSON because VisInputSource is a structural FE-side type with three variants (main, track, input). Carrying it as opaque JSON avoids re-deriving it Rust-side for a feature that never reads it Rust-side.
What’s deferred:
  • Per-kind settings (routing matrices, base params). The Visualizer panel’s nearfieldRouting, scatterBase, etc. stay React-side. Migrating these is a bigger refactor with no behavioral win for v1 (they reset to defaults on reload today, which is acceptable). When persistence becomes a requirement, model them as per-node settings: serde_json::Value and patch via a new visual.update_visualizer_source_settings command.
  • AudioTap as a separate node. The plan called for splitting VisualizerSource into AudioTap → VisualizerSource over an audio_buffer port. We held input_source as a field on VisualizerSource for now: there is no FE-side or BE-side runtime use of an audio_buffer port (visualizers read getRtBuffer() directly), so introducing it as state today buys nothing but bookkeeping. When ML/embedding nodes need real audio_buffer routing, model AudioTap then.
  • Visualizers don’t actually produce textures yet. They render to canvases owned by the panel, not to a ViewportSink consuming a texture port. Phase 1 + Phase 3 share the design intent; the full picture (every visualizer kind is a control-rate node producing texture) is a follow-up that lands once we want multi-output (recording the visualizer to disk, layering with Veo, etc.).
Done-when (partial): the Visualizer’s kind + input-source picks are durable state on the visual graph. The “thin UI on top of which audio tap is wired” framing is true in spirit (state lives on the graph), but doesn’t have multi-tap routing wired through yet.

Phase 4 — Protocols channel + OSC adapter ✅ (scaffolding)

Status: scaffolding shipped. UDP listener works end-to-end via the control-rate scheduler. Routing UI deferred. Channel naming. Songbird’s existing channels map to concepts (mixer, clip, transport, ml, recording), not technologies — so OSC shouldn’t be its own channel any more than “VST3” or “ReWire” should be. The shared concept is “protocol adapters that translate external traffic into Songbird’s typed ports.” OSC, HID, gamepad, WebSocket, sensors all fit, share the same lifecycle (start a listener, decode incoming traffic, emit typed events), and share the same downstream consumer (a routing UI that maps incoming addresses → Songbird parameters). One protocols channel hosts them all. What landed:
  • rosc 0.10 added to workspace deps; pure-Rust OSC parser.
  • New protocols sync channel under songbird-sync/src/channels/protocols/:
    • AdapterConfig enum — one variant per protocol (AdapterConfig::Osc { port } today; Hid { … }, Websocket { … }, etc. land as new arms). Statically exhaustive in dispatch; FE wire-encodes as JSON so adding an adapter doesn’t require a TS migration.
    • protocols/adapters/ subdirectory — one file per protocol’s runtime. Adding a new adapter is one new file + one new arm.
    • protocols.start_listener { config } — config is the JSON-serialized AdapterConfig. Registers a control-rate task keyed protocols:listener:<listener_id> where listener_id is derived deterministically from the config ("osc:8000" etc.) so registration is idempotent.
    • protocols.stop_listener { listener_id } — real cancellation via Phase 2’s scheduler.
    • protocols.list_listeners — diagnostics.
    • protocols:message { protocol, listenerId, source, payload } event — protocol lets subscribers route by adapter without parsing payload.
    • protocols:listener_status { protocol, listenerId, listening, error? } event — bind/unbind/error lifecycle.
  • OSC adapter (protocols/adapters/osc.rs) projects every standard OscType onto JSON: primitives (Int / Float / String / Long / Double / Bool / Char) become primitive JSON; structured types (Time / Color / Midi / Blob / Array) use a {type, ...} envelope.
  • FE protocols channel is registered in wiring.ts with a diagnostic subscriber that logs to the sync:protocols console namespace.
What’s deferred (the actual user-facing feature):
  • Routing UI. Map /track1/volume → fader on track 1, with ranges, smoothing, learn-mode (move a controller, song captures the address). This is its own substantial UX surface — picking the routing model (per-track? per-param? per-route persistence?) warrants its own design pass.
  • Address namespacing / templates. TouchOSC-style “layouts” that pre-define a controller’s expected addresses. Out of scope for v1.
  • Outbound OSC (sending values back to the controller for motorized faders / LED feedback). Same channel, mirror command shape — left for whoever builds the routing UI.
  • Other adapters (HID, gamepad, WebSocket, sensors). Same control-rate-task pattern; nothing structurally new. Add them when you have a use case driving the design.
Done-when (scaffolding): starting protocols.start_listener { config: '{"protocol":"osc","port":8000}' } and sending OSC messages from any controller (TouchOSC, Max, Python’s python-osc, etc.) produces protocols:message events visible in the DevTools console. The architectural pattern (UDP → control-rate node → typed event → FE channel) is validated end-to-end. Building on this to wire concrete parameters is straightforward; whoever ships the routing UI inherits a working signal pipeline.

Phase 5 — three.js compositor ✅ (shipped)

Status: the Video tab and the existing GL visualizers share a single <GpuCanvas> at the Visualizer panel level. Switching between mode === 'visualizer' and mode === 'video' doesn’t churn GL/GPU contexts — the canvas stays mounted, only its content swaps. Video display goes through THREE.VideoTexture on a fullscreen quad; the bare <video> is kept in the DOM but rendered invisible. What landed:
  • VideoTextureGL.tsx (r3f component) wraps an HTMLVideoElement in THREE.VideoTexture, renders on a fullscreen-aligned plane with object-contain semantics. Always positions the plane (canvas-size fallback before the video has decoded its first frame) so it never renders as a 1px dot at origin.
  • A module-scope useActiveVideoElementStore (Zustand) holds the currently-mounted <video> element. VideoPlayer writes to it via a stable useCallback ref-callback; the parent’s GpuCanvas reads from it via a small <VideoTextureSlot> helper. No prop drilling, no per-render ref-callback churn (which earlier caused an infinite render loop).
  • The <video> is rendered at opacity-0 pointer-events-none so the browser still does GPU decode (VideoToolbox / equivalent) and requestVideoFrameCallback keeps firing on the same element the drift controller drives. Don’t switch to visibility:hidden: WebKit stops compositing hidden video layers, which suspends rVFC and leaves the canvas black.
  • The shared GpuCanvas mounts whenever mode === 'visualizer' || mode === 'video' — single context for the lifetime of the Visualizer panel.
  • Drift controller tuned for canvas display: ±1 % max playbackRate nudge (was ±5 %) and a 2-frame deadband. The canvas presents every frame straight from the texture, so rate changes that were masked by the browser’s compositor are now directly visible.
Fallback: when glRenderer is toggled off (debug setting), no GpuCanvas mounts in video mode and the <video> element flips to visible so the tab isn’t blank. Path to compositor nodes from here:
  1. Pipeline executor (~200 lines): walks the visual graph in topological order, allocates THREE.WebGLRenderTarget per intermediate node, binds inputs, dispatches passes. Lets a Veo → ColorCorrect → Composite → ViewportSink chain “just work.”
  2. First shader node (~50 lines): THREE.ShaderMaterial wrapping a fragment shader + uniforms; input/output texture ports.
  3. Visualizers as overlays: existing GL visualizers can layer over Veo by rendering in the same scene as VideoTextureGL — they already use the same canvas now, so it’s a structural change inside one component, not cross-component plumbing.
Why “seed”: this lands the substrate — the right rendering foundation. The pipeline executor + concrete shader/compositor nodes are still ahead. A follow-up iteration:
  1. Pipeline executor (~200 lines): walks the visual graph in topological order, allocates THREE.WebGLRenderTarget per intermediate node, binds input textures to materials, dispatches draws. Lets a Veo → ColorCorrect → Composite → ViewportSink chain “just work.”
  2. First shader node (~50 lines): a THREE.ShaderMaterial wrapper that takes a fragment shader + uniform schema and exposes input/output texture ports.
  3. Existing GL visualizers can layer over Veo by mounting in the same GpuCanvas scene — natural compositing without bespoke glue.
Done-when (seed): Veo video plays via three.js VideoTexture in the Video tab with audio sync intact, and adding a one-shot color filter (sepia, grayscale, threshold) on top is genuinely 50 lines. Open questions for the next iteration:
  • Does the visualizer tab’s <GpuCanvas> and the Video tab’s <GpuCanvas> share a renderer / scene, or stay separate? Sharing buys natural overlay (visualizer-on-top-of-video) but the existing visualizer infrastructure is already mounted per panel. Probably keep separate scenes per panel and let the pipeline executor stitch when needed.
  • WebGPU vs WebGL2 backend: three.js auto-picks WebGL2; opting into the WebGPU renderer is a one-line change but only worth it once a node demands compute shaders / multi-target rendering that WebGL2 can’t handle.
Why three.js, not canvas2D or raw WebGPU. Canvas2D gives basic layered drawing (drawImage, blend modes) but no programmable shaders — hits a ceiling fast for compositing video + visualizers + shader effects. Raw WebGPU is the right substrate but requires building a full pipeline framework. Three.js is already in the codebase (react_ui/src/lib/gpu/GpuCanvas.tsx and the existing SpectrumGL / WaveformGL / CosmosGL / FlowShieldGL / LightCubeGL / GeometricGL visualizers), abstracts WebGPU vs WebGL2, and gives you VideoTexture, render targets, ShaderMaterial, and a scene graph out of the box. The compositor we’d build on raw WebGPU is roughly what three.js already is. So Phase 5 lands on three.js — same substrate the existing GL visualizers use. Scope:
  • VideoSource → three.js VideoTexture. The Veo node still sources its frames from a hidden <video> element (browser does the GPU decode), but the texture lives in three.js, not in a DOM <video> rendered to the page. Audio sync stays the same — requestVideoFrameCallback is on the underlying <video>, the drift controller already uses it.
  • ViewportSink<GpuCanvas> render target. Replaces the current bare <video> element in the Video tab with a three.js scene that renders the VideoTexture. The same GpuCanvas already hosts the visualizers — sharing the canvas means the existing visualizer scenes can layer over Veo without glue code.
  • Pipeline executor (~200 lines): walks the visual graph in topological order, allocates render targets, binds input textures to materials, dispatches draws. This is the piece that makes shader/compositor nodes “just work” once added.
  • Future nodes become trivial: a shader node is a THREE.ShaderMaterial wrapping a fragment shader; a blend node is a two-input material; a feedback-buffer node is a render target ping-pong. Each is ~50 lines.
Done-when: generated Veo video plays in the Video tab via a three.js VideoTexture, audio sync intact (drift controller still hooks the underlying <video> element), and a follow-up “add a colour-correct shader node” exercise is genuinely 50 lines. The existing GpuCanvas-based visualizers can render in the same scene as the Veo video without bespoke code. Open questions for implementation time:
  • Should the visualizer scene and the Veo scene share one GpuCanvas or be separate three.js scenes composited at the viewport level? Sharing means cheaper GPU upload and natural layering; separating means cleaner per-node ownership. Probably separate scenes that the pipeline executor stitches into a composite.
  • Does the FE-side audio drift controller need to move into the three.js render loop, or stay in its current requestVideoFrameCallback
    • subscribeRtBuffer form? Today’s form works fine and is decoupled from rendering — keep as-is.
  • WebGPU backend selection: three.js auto-picks WebGL2 today; manually opting into the WebGPU renderer is a one-line change but only worth it if a node turns out to need WebGPU-only features (compute shaders, etc.). Defer until a concrete node demands it.

Out of scope (for now)

  • Image → ML audio generation. ImageSource/WebcamTextureToEmbeddingLyriaInput/Magenta is exactly the composition this graph is designed to enable, but it’s deferred — it only ships once Phase 1 (texture) and Phase 2 (control-rate tier) prove themselves on simpler ground. The architecture below explicitly preserves room for it; we’re not building it yet.
  • Replacing the audio-thread DAG. The existing track/insert/send/master topology stays as-is; new node types extend it, they don’t replace it.
  • A user-facing visual node editor (Max/PD style). The graph is real and addressable from code; whether it gets a UI is a separate question.
  • Multi-output sinks (e.g. recording the texture stream to disk). Easy to add once the abstraction holds, but not part of v1.
  • Distributed / networked nodes. Songbird is a single-process app today.

Open questions

  • GPU resource ownership. Does each texture-producing node own its output texture, or does the scheduler pool textures? The latter scales better but requires reference counting. Constraint from the vision/gesture use case: textures must be readable from the control-rate tier, not just the render thread — so whatever ownership model wins, it has to expose a readback path.
  • Time semantics for the control-rate tier. Does a control-rate node tick produce a value tagged with a future audio-block timestamp (sample-accurate scheduling), or is it just “latest value wins”? Modulation is sample-accurate today; OSC probably can’t be.
  • Persistence. Do graph topology changes go through the existing sync engine like everything else? (Probably yes — it’s just state.) Per-node settings serialize to the project file.
  • Threading model. One worker thread for control-rate, or a pool? Network and ML probably want separate threads to avoid head-of-line blocking.

References

  • Existing graph: rust/crates/engine/songbird-engine/src/
  • Existing modulation/automation: same crate
  • Lyria off-graph compute pattern (the closest thing to a control-rate node today): rust/crates/integration/songbird-lyria/
  • Veo (this plan’s forcing function): rust/crates/integration/songbird-veo/
  • TouchDesigner TOP/CHOP architecture is the design north star for the texture port type and node-graph composition discipline.