03 - Embed Images for LLM Analysis

Date

2025-10-28

Context

Hybrid OCR approach was fast and accurate but formatting was messy. Vision models hallucinated text. Rather than fighting with text extraction, a better approach is to embed the actual frame images in the enhanced transcript and let the end-user's LLM analyze them with full audio context.

Problem

OCR/vision models either hallucinate or produce messy text
Code formatting/indentation is hard to preserve
User wants to analyze frames with their own LLM (Claude, GPT, etc.)
Need to keep file size reasonable (~200KB per image is too big)

Solution: Image Embedding

Instead of extracting text, embed the actual frame images as base64 in the enhanced transcript. The LLM can then:

See the actual screen content (no hallucination)
Understand code structure, layout, and formatting visually
Have full audio transcript context for each frame
Analyze dashboards, terminals, editors with perfect accuracy

Implementation

Quality Optimization:

Default JPEG quality: 80 (good tradeoff between size and readability)
Configurable via --embed-quality (0-100)
Typical sizes at quality 80: ~40-80KB per image (vs 200KB original)

Format:

[MM:SS] SPEAKER:
  Audio transcript text here

[MM:SS] SCREEN CONTENT:
  IMAGE (base64, 52KB):
  <image>data:image/jpeg;base64,/9j/4AAQSkZJRg...</image>

  TEXT:
  | Optional OCR text for reference

Features:

Base64 encoding for easy embedding
Size tracking and reporting
Optional text content alongside images
Works with scene detection for smart frame selection

Usage

# Basic: Embed images at quality 80 (default)
python process_meeting.py samples/video.mkv --run-whisper --embed-images --scene-detection --no-cache -v

# Lower quality for smaller files (still readable)
python process_meeting.py samples/video.mkv --run-whisper --embed-images --embed-quality 60 --scene-detection --no-cache -v

# Higher quality for detailed code
python process_meeting.py samples/video.mkv --run-whisper --embed-images --embed-quality 90 --scene-detection --no-cache -v

# Iterate on scene threshold (reuse whisper)
python process_meeting.py samples/video.mkv --embed-images --scene-detection --scene-threshold 5 --skip-cache-frames --skip-cache-analysis -v

File Sizes

Example for 20 frames:

Quality 60: ~30-50KB per image = 0.6-1MB total
Quality 80: ~40-80KB per image = 0.8-1.6MB total (recommended)
Quality 90: ~80-120KB per image = 1.6-2.4MB total
Original: ~200KB per image = 4MB total

Benefits

✓ No hallucination: LLM sees actual pixels ✓ Perfect formatting: Code structure preserved visually ✓ Full context: Audio transcript + visual frame together ✓ User's choice: Use your preferred LLM (Claude, GPT, etc.) ✓ Reasonable size: Quality 80 gives 4x smaller files vs original ✓ Simple workflow: One file contains everything

Use Cases

Code walkthroughs: LLM can see actual code structure and indentation Dashboard analysis: Charts, graphs, metrics visible to LLM Terminal sessions: Commands and output in proper context UI reviews: Actual interface visible with audio commentary

Files Modified

meetus/transcript_merger.py - Image encoding and embedding
meetus/workflow.py - Wire through config
process_meeting.py - CLI flags
meetus/output_manager.py - Cleaner directory naming (date + increment)

Output Directory Naming

Also changed output directory format for clarity:

Old: 20251028_054553-video (confusing timestamps)
New: 20251028-001-video (clear date + run number)

3.6 KiB Raw Permalink Blame History