Files
mitus/def/03-embed-images-for-llm.md
Mariano Gabriel 118ef04223 embed images
2025-10-28 08:02:45 -03:00

3.6 KiB

03 - Embed Images for LLM Analysis

Date

2025-10-28

Context

Hybrid OCR approach was fast and accurate but formatting was messy. Vision models hallucinated text. Rather than fighting with text extraction, a better approach is to embed the actual frame images in the enhanced transcript and let the end-user's LLM analyze them with full audio context.

Problem

  • OCR/vision models either hallucinate or produce messy text
  • Code formatting/indentation is hard to preserve
  • User wants to analyze frames with their own LLM (Claude, GPT, etc.)
  • Need to keep file size reasonable (~200KB per image is too big)

Solution: Image Embedding

Instead of extracting text, embed the actual frame images as base64 in the enhanced transcript. The LLM can then:

  • See the actual screen content (no hallucination)
  • Understand code structure, layout, and formatting visually
  • Have full audio transcript context for each frame
  • Analyze dashboards, terminals, editors with perfect accuracy

Implementation

Quality Optimization:

  • Default JPEG quality: 80 (good tradeoff between size and readability)
  • Configurable via --embed-quality (0-100)
  • Typical sizes at quality 80: ~40-80KB per image (vs 200KB original)

Format:

[MM:SS] SPEAKER:
  Audio transcript text here

[MM:SS] SCREEN CONTENT:
  IMAGE (base64, 52KB):
  <image>data:image/jpeg;base64,/9j/4AAQSkZJRg...</image>

  TEXT:
  | Optional OCR text for reference

Features:

  • Base64 encoding for easy embedding
  • Size tracking and reporting
  • Optional text content alongside images
  • Works with scene detection for smart frame selection

Usage

# Basic: Embed images at quality 80 (default)
python process_meeting.py samples/video.mkv --run-whisper --embed-images --scene-detection --no-cache -v

# Lower quality for smaller files (still readable)
python process_meeting.py samples/video.mkv --run-whisper --embed-images --embed-quality 60 --scene-detection --no-cache -v

# Higher quality for detailed code
python process_meeting.py samples/video.mkv --run-whisper --embed-images --embed-quality 90 --scene-detection --no-cache -v

# Iterate on scene threshold (reuse whisper)
python process_meeting.py samples/video.mkv --embed-images --scene-detection --scene-threshold 5 --skip-cache-frames --skip-cache-analysis -v

File Sizes

Example for 20 frames:

  • Quality 60: ~30-50KB per image = 0.6-1MB total
  • Quality 80: ~40-80KB per image = 0.8-1.6MB total (recommended)
  • Quality 90: ~80-120KB per image = 1.6-2.4MB total
  • Original: ~200KB per image = 4MB total

Benefits

No hallucination: LLM sees actual pixels ✓ Perfect formatting: Code structure preserved visually ✓ Full context: Audio transcript + visual frame together ✓ User's choice: Use your preferred LLM (Claude, GPT, etc.) ✓ Reasonable size: Quality 80 gives 4x smaller files vs original ✓ Simple workflow: One file contains everything

Use Cases

Code walkthroughs: LLM can see actual code structure and indentation Dashboard analysis: Charts, graphs, metrics visible to LLM Terminal sessions: Commands and output in proper context UI reviews: Actual interface visible with audio commentary

Files Modified

  • meetus/transcript_merger.py - Image encoding and embedding
  • meetus/workflow.py - Wire through config
  • process_meeting.py - CLI flags
  • meetus/output_manager.py - Cleaner directory naming (date + increment)

Output Directory Naming

Also changed output directory format for clarity:

  • Old: 20251028_054553-video (confusing timestamps)
  • New: 20251028-001-video (clear date + run number)