MITUS

Stream viewer + agent — architecture

GOAL & WALKTHROUGH

Mitus records a remote desktop, transcribes its audio, extracts scene-change frames, and exposes both to an LLM agent for ad-hoc Q&A.

What it is

A two-machine setup: the sender (a Wayland desktop) captures screen + audio and ships an encoded stream to the receiver. The receiver records to disk, runs scene detection on the live feed to extract per-event JPEG frames, transcribes the audio, and presents the result in a GTK4 GUI. The GUI doubles as an LLM client: select a frame or transcript span, hit Enter, and an agent (Claude SDK or any OpenAI-compatible endpoint) answers using the selected media as context.

Why the split

Capture wants Wayland + a VAAPI-friendly GPU; analysis wants CUDA for both faster-whisper and ffmpeg scene detection. Different machines, different drivers — the network stream is the seam. The receiver also runs the GUI because the recordings are stored locally and the agent talks to large frames as files, not blobs over a wire.

Two transport modes

Both modes produce the same on-disk session layout (data/<session_id>/stream/, frames/, audio/, transcript.json) so the GUI doesn't care which path the bytes took. The choice is a CLI flag.

  • Python (default). Sender is a bash watchdog wrapping ffmpeg CLI. Receiver is cht/stream/recorder.py: an ffmpeg listener that writes fragmented MP4 + relays UDP to mpv + emits scene frames out of an showinfo stdout pipe. Simple, all in one process, every restart costs a few seconds.
  • Rust (--rust). A standalone Rust workspace under media/: cht-client on the sender, cht-server on the receiver. Wire protocol is a typed WirePacket framing instead of raw mpegts. Scene detection still runs in Python via a Unix-socket relay from the server. Connect time drops from ~20s to ~3s; session reload from disk is 1–2s.
The media/ directory holds the Rust transport. While both modes coexist, that name is a misnomer — a future rename is planned. For now, "Rust transport" and "media/" mean the same thing.

What the agent sees

Two reference syntaxes resolve to media when sent: @F0001@F0042 for frames, @T0001@T0010 for transcript segments. Single-word verbs describe and answer are sent verbatim — no system prompt, no boilerplate. If you want detail, you type it. The agent runner injects only the referenced frame paths and transcript text alongside the user message.

USAGE

How to start a session — sender side, receiver side, both transports.

Both ctrl/client.sh and ctrl/app.sh take a transport flag — --python (default) or --rust. The ctrl/ wrappers are the entrypoints; media/ctrl/* and sender/stream_av.py are implementation details they dispatch to.

Receiver (mcrn) — GUI

Python transport (default):

./ctrl/app.sh --python

Rust transport:

./ctrl/server.sh         # cht-server on TCP :4447 (Rust mode only)
./ctrl/app.sh --rust

Python mode does its own TCP listening inside the GUI process — no separate server step.

Sender

Python transport:

./ctrl/client.sh --python [RECEIVER_IP] [PORT]   # default port 4444

(Runs sudo python3 sender/stream_av.py under the hood — sudo is required for kmsgrab.)

Rust transport:

./ctrl/client.sh --rust [server_addr]            # default mcrndeb:4447

Sync

Both machines share the same source tree; ctrl/sync.sh rsyncs from the dev host to mcrndeb. The receiver's filesystem is also bind-mounted at ~/mcrn on the dev host for quick file access.

Inside the GUI

  • Frames panel — click to select; ←/→ navigate.
  • Transcript panel — click to select; ↑/↓ navigate; Shift to extend.
  • Enter — sends answer + selected refs to the agent.
  • Describe / Answer buttons — same idea, single-word verb prepended.
  • Agent input — type freely; @F1-3 and @T5 attach refs.
  • Esc — clear selection. Del — clear agent output.
  • Ctrl+R — manual segment cut.

Agent provider

Resolution order in cht/agent/runner.py:

  • GROQ_API_KEY → OpenAI-compatible client against Groq.
  • OPENAI_API_KEY → OpenAI / OpenAI-compatible.
  • (default) → Claude Code SDK using your local CC subscription.

SYSTEM ARCHITECTURE

End-to-end view: sender capture → network → receiver record + analyse → GUI + agent. Both transports converge on the same on-disk session layout.

System architecture
Python Rust Hardware / external Filesystem

PYTHON PIPELINE

Default mode. Bash + ffmpeg CLI on the sender; StreamRecorder + SessionProcessor in cht/stream/ on the receiver. Scene detection rides the recorder's ffmpeg stdout pipe — sub-second latency, no extra process.

Python pipeline
Python module External binary (ffmpeg) Hardware / OS source Filesystem output

RUST CLIENT — sender

media/client/ — replaces sender/stream_av.sh when running with --rust. Two backends: subprocess (default, wraps ffmpeg CLI) and an experimental direct VAAPI capture/encoder.

Rust client pipeline

RUST SERVER — receiver

media/server/ — replaces StreamRecorder when running with --rust. TCP listener with a typed WirePacket framing; routes Video/Audio/Control packets to ffmpeg recording, ADTS audio, and a Unix-socket scene relay.

Rust server pipeline

RUST CRATES

Cargo workspace under media/: three crates (cht-common, cht-client, cht-server) and their external deps. Designed to be reusable as a standalone tool — mpr is expected to depend on it too.

Rust crates

REPOSITORY STRUCTURE

Top-level layout. Python app under cht/; Rust transport under media/; sender bash under sender/; ops scripts under ctrl/.

cht/
├── cht/                    Python app (GTK4 GUI, recording, transcribe, agent)
│   ├── app.py · window.py     entrypoint + main window
│   ├── config.py · session.py app config, session manifest
│   ├── stream/                recorder · processor · tracker · lifecycle · ffmpeg helpers
│   ├── audio/                 waveform engine
│   ├── transcriber/           faster-whisper engine
│   ├── scrub/                 proxy manager (scrub-mode preview)
│   ├── index/                 frame index helpers
│   ├── agent/                 runner · base · tools · claude_sdk_connection · openai_connection
│   └── ui/                    timeline · monitor · scrub_bar · frames_panel · transcript_panelagent_input · agent_output · markdown · keyboard · mpv · waveform
├── media/                  Rust transport workspace (Cargo) — to be renamed once both modes coexist
│   ├── common/                cht-common  — WirePacket, ControlMessage, logging
│   ├── client/                cht-client  — sender (Wayland, VAAPI)
│   ├── server/                cht-server  — receiver (TCP listener, ffmpeg fan-out)
│   └── ctrl/                  build.sh · client.sh · server.sh
├── sender/                 Python-mode sender — stream_av.sh (bash watchdog around ffmpeg CLI)
├── ctrl/                   app.sh · server.sh · client.sh · sync.sh · bench.py · e2e_test.sh
├── tests/                  pytest suites — config · ffmpeg · manager · processor · timeline · tracker
├── data/                   runtime — sessions, active-session pointer (gitignored)
├── logs/                   runtime logs (gitignored)
├── docs/                   this site — index.html · viewer.html · graphs/ · render.sh
└── pyproject.toml · uv.lock   Python deps via uv

DESIGN NOTES

Why some non-obvious choices look the way they do.

Same on-disk layout from both transports

The GUI, transcript, scene index, and agent never branch on transport mode — they only read files. The recording layout is the contract; the network protocol underneath is replaceable. This is what made the Rust port feasible without rewriting the analysis side.

Scene detection lives in the recorder, not the processor

In Python mode, scene-change frames come straight off the recorder's ffmpeg stdout pipe — sub-second, single process. Polling the fragmented MP4 from a separate process would add 3–5 s of disk-IPC latency. In Rust mode the same property is approximated by relaying raw H.264 over scene.sock to a separate ffmpeg, but that relay turns out to be the source of most current scene-detection pain (see The scene detection saga below).

Why bother with the Rust port

Two measured wins drove the work: connect time dropped from ~20 s (CLI ffmpeg startup + mpegts negotiation) to ~3 s (typed handshake), and session reload from disk dropped to 1–2 s. The Python recorder still works fine for development; the Rust path matters when you reconnect a lot.

One-word verbs, no system prompt

Pressing Enter sends answer + selected refs verbatim. There is no system prompt and no instruction template wrapping the message. If a question needs detail, the user types it — the model sees exactly what you'd see, not a contract you'd have to debug.

Subprocess backend over a custom encoder

The Rust client wraps the same ffmpeg CLI the Python sender uses, demuxes its NUT output in-process, and ships EncodedPackets. Less code to own than a direct VAAPI encode path, and it inherits ffmpeg's robustness around odd Wayland/DRM transitions. The direct VAAPI backend exists but is experimental.

Sender as a watchdog, not a daemon

Python-mode stream_av.sh is a bash loop that restarts ffmpeg on stall (no progress for 10 s) and restarts immediately on the DRM-plane format change that fullscreen apps trigger. Cheaper and more reliable than building stall detection into a long-lived process.

Struggles — the scene detection saga

Scene detection is the part of the system that has fought back the hardest. The short version: scene detection wants to live in the same ffmpeg process that does the decoding, and every architecture change has had to relearn that.

1. The "one behind" bug and the flush trick

Original Python pipeline ran scene detection as a branch of the same ffmpeg that records: select='gt(scene,T)'showinfo → MJPEG. The MJPEG encoder + muxer holds the selected frame in its internal buffer until another selected frame pushes it out — so the JPEG you receive at time T is actually the previous scene change, not the current one. Classic "one behind".

Workaround: a flush trick — select extra adjacent frames after each scene change so the real frame gets pushed through immediately (SCENE_FLUSH_FRAMES, see cht/config.py, used in cht/stream/ffmpeg.py :: receive_record_relay_and_detect). Worked reliably only because everything was in one ffmpeg process.

2. The Rust relay broke it

When transport moved to Rust, the recorder split into two processes: Rust-side ffmpeg writes fMP4 + UDP, and a separate Python-side ffmpeg consumes raw H.264 from scene.sock for scene detection. Two new failure modes appeared:

  • The flush trick stopped flushing. The MJPEG encoder behaves differently in a standalone pipe-fed ffmpeg vs. as a branch of a multi-output process — adjacent extra frames no longer reliably push the previous selection through.
  • Decoder corruption from dropped packets. The Rust relay uses try_send with a 100 ms socket write timeout (media/server/src/session.rs). On any backpressure the relay drops H.264 packets, which corrupts the downstream decoder until the next keyframe — and missed keyframes mean missed scene detections.

3. Three dead ends

  • fMP4-tip extraction. Trigger on showinfo, then extract the frame from the just-written fragmented MP4. Fragments only finalize at keyframe boundaries (~2 s with GOP 30), so ffprobe reports stale duration and the extracted frame comes from the previous scene.
  • Single Rust ffmpeg with mixed outputs. The clean fix would be one ffmpeg in Rust doing record (-c:v copy) + relay (-c:v copy) + scene detect (decode + filter). It doesn't work — ffmpeg won't mix -c:v copy outputs with -filter_complex on a pipe input under -hwaccel cuda.
  • Tighter retry intervals on the extractor. Dropping retry from 1 s to 0.3 s made things worse — concurrent ffmpeg processes thrashing the GPU rather than completing.

4. Where it actually landed

Current working approach (Rust mode): the relay-fed scene detector fires showinfo with a timestamp, then Python extracts the frame from the recording file at that timestamp, with a wall-clock offset computed from the session-dir name. Reliable frames; ~1 s latency per scene from fMP4 fragment lag plus the per-extract ffmpeg spawn (~0.5 s). It's the system limping along until the proper fix lands. See def/10-scene-detect-to-rust.md and def/ISSUES.md R1, R3 for the full record.

Lesson. The flush hack is a dead end in any pipe-fed context. Don't try to make it work over relay — move scene detection back into the same process that has the decoded frames. That's the only configuration that has ever been quiet.

Future work

Near term — scene detection as a 3rd output of the Rust server's ffmpeg

Spec: def/10-scene-detect-to-rust.md. Add a third branch to the existing ffmpeg the Rust server already runs:

  • Output 1: -c:v copy → fMP4 (unchanged)
  • Output 2: -c:v copy → UDP relay (unchanged)
  • Output 3: CUDA decode → select='gt(scene,T)'showinfo → MJPEG out a second pipe / second Unix socket

This restores the single-process invariant — scene detection sees the same decoded frames as the recording branch, the flush behavior matches, no relay packet drops. Removes detect_scenes_from_pipe() in cht/stream/ffmpeg.py, the stdin-feeder thread in cht/stream/processor.py, and scene_relay_task in media/server/src/session.rs.

Adjacent improvements once that lands:

  • Long-running extractor. Keep one ffmpeg open and pipe seek commands rather than spawning per frame — eliminates the ~0.5 s startup hit.
  • PTS on the wire. Have the Rust server send recording PTS alongside scene events so Python doesn't have to guess a wall-clock offset from the session-dir name (which is also why the first scene frame currently lands 7–10 s late in Rust mode — def/ISSUES.md R1).

End goal — in-process libav filter graph

Spec: def/09-media-transport.md. Rust server decodes via NVDEC, runs the scene filter in-process via the libav API, and writes JPEGs directly. No ffmpeg subprocess, no pipe, no relay, no extraction — scene-to-frame latency drops to near zero. The 3rd-output step above is the bridge: same single-process discipline, easier to land, and a clean rewrite target once it works.

Other items deferred to that broader port:

  • Frame buffer / fast scrub. GPU ring buffer of the last N decoded frames exposed over shared memory to the Python scrub UI — replaces the mpv proxy MJPEG hack (see def/07-scrub-perf-ceiling.md).
  • Typed control protocol. The current WirePacket framing covers session lifecycle but not parameter changes; spec 09 sketches a control-message channel for things like live scene_threshold updates and reconnect-with-PTS.
  • Audio in the live UDP relay. Rust mode currently has no audio in the live monitor (def/ISSUES.md R2) because the server's ffmpeg only takes video on its stdin. Resolved naturally once the server's ffmpeg also receives the audio track.