05 - Reference Frame Files Instead of Embedding

Date

2025-10-28

Context

Embedding base64 images made the enhanced transcript files very large (3.7MB for ~40 frames). This made them harder to work with and slower to process.

Problem

Enhanced transcript with embedded base64 images was 3.7MB
Large file size makes it slow to read/process
Difficult to inspect individual frames
Harder to share and version control

Solution: Reference Frame Paths

Instead of embedding base64 image data, reference the frame files by their relative paths.

Before (Embedded):

[00:08] SCREEN CONTENT:
  IMAGE (base64, 85KB):
  <image>data:image/jpeg;base64,/9j/4AAQSkZJRg...</image>

File size: 3.7MB

After (Referenced):

[00:08] SCREEN CONTENT:
  Frame: frames/zaca-run-scrapers_00257.jpg

File size: ~50KB

Implementation

Directory Structure:

output/20251028-003-zaca-run-scrapers/
├── frames/
│   ├── zaca-run-scrapers_00257.jpg
│   ├── zaca-run-scrapers_00487.jpg
│   └── ...
├── zaca-run-scrapers.json (whisper transcript)
└── zaca-run-scrapers_enhanced.txt (references frames/ directory)

Enhanced Transcript Format:

================================================================================
ENHANCED MEETING TRANSCRIPT
Audio transcript + Screen frames
================================================================================

[00:30] SPEAKER:
  Bueno, te dio un tour para el proyecto...

[00:08] SCREEN CONTENT:
  Frame: frames/zaca-run-scrapers_00257.jpg

[01:00] SPEAKER:
  Mayormente en Scrapping lo que tenemos...

[01:15] SCREEN CONTENT:
  Frame: frames/zaca-run-scrapers_00487.jpg
  TEXT:
  | Code snippet from screen (if OCR was used)

Benefits

✓ Much smaller files: ~50KB vs 3.7MB (74x smaller!) ✓ Easier to inspect: Can view individual frames directly ✓ LLM can access images: Frame paths allow LLM to load images on demand ✓ Better version control: Text files are small and diffable ✓ Cleaner structure: Frames organized in dedicated directory ✓ Flexible: Can still do OCR/vision analysis if needed (adds TEXT section)

Flags

--embed-images: Skip OCR/vision analysis, just reference frame files

Faster (no analysis needed)
Lets LLM analyze raw images
Enhanced transcript only contains frame references

Without --embed-images: Run OCR/vision analysis

Extracts text from frames
Enhanced transcript includes both frame reference AND extracted text
Useful for code/dashboard analysis

Usage

# Reference frames only (no OCR, faster)
python process_meeting.py samples/video.mkv --run-whisper --embed-images --scene-detection -v

# Reference frames + OCR text extraction
python process_meeting.py samples/video.mkv --run-whisper --use-hybrid --scene-detection -v

# Adjust frame quality (smaller files)
python process_meeting.py samples/video.mkv --run-whisper --embed-images --embed-quality 60 --scene-detection -v

Files Modified

meetus/transcript_merger.py - Modified _format_detailed() to output frame paths instead of base64
process_meeting.py - Updated help text and examples to reflect frame referencing
All processors (OCR, vision, hybrid) already include frame_path in results (no changes needed)

Workflow Example

# First run: Generate everything
python process_meeting.py samples/meeting.mkv --run-whisper --embed-images --scene-detection -v

# Result:
# - output/20251028-004-meeting/
#   - frames/ (40 frames, ~80KB each)
#   - meeting.json (whisper transcript)
#   - meeting_enhanced.txt (~50KB, references frames/)

# LLM can now:
# 1. Read enhanced transcript
# 2. See timeline of audio + screen changes
# 3. Load individual frames as needed from frames/ directory

3.7 KiB Raw Blame History