# 05 - Reference Frame Files Instead of Embedding ## Date 2025-10-28 ## Context Embedding base64 images made the enhanced transcript files very large (3.7MB for ~40 frames). This made them harder to work with and slower to process. ## Problem - Enhanced transcript with embedded base64 images was 3.7MB - Large file size makes it slow to read/process - Difficult to inspect individual frames - Harder to share and version control ## Solution: Reference Frame Paths Instead of embedding base64 image data, reference the frame files by their relative paths. ### Before (Embedded): ``` [00:08] SCREEN CONTENT: IMAGE (base64, 85KB): data:image/jpeg;base64,/9j/4AAQSkZJRg... ``` File size: 3.7MB ### After (Referenced): ``` [00:08] SCREEN CONTENT: Frame: frames/zaca-run-scrapers_00257.jpg ``` File size: ~50KB ## Implementation **Directory Structure:** ``` output/20251028-003-zaca-run-scrapers/ ├── frames/ │ ├── zaca-run-scrapers_00257.jpg │ ├── zaca-run-scrapers_00487.jpg │ └── ... ├── zaca-run-scrapers.json (whisper transcript) └── zaca-run-scrapers_enhanced.txt (references frames/ directory) ``` **Enhanced Transcript Format:** ``` ================================================================================ ENHANCED MEETING TRANSCRIPT Audio transcript + Screen frames ================================================================================ [00:30] SPEAKER: Bueno, te dio un tour para el proyecto... [00:08] SCREEN CONTENT: Frame: frames/zaca-run-scrapers_00257.jpg [01:00] SPEAKER: Mayormente en Scrapping lo que tenemos... [01:15] SCREEN CONTENT: Frame: frames/zaca-run-scrapers_00487.jpg TEXT: | Code snippet from screen (if OCR was used) ``` ## Benefits ✓ **Much smaller files**: ~50KB vs 3.7MB (74x smaller!) ✓ **Easier to inspect**: Can view individual frames directly ✓ **LLM can access images**: Frame paths allow LLM to load images on demand ✓ **Better version control**: Text files are small and diffable ✓ **Cleaner structure**: Frames organized in dedicated directory ✓ **Flexible**: Can still do OCR/vision analysis if needed (adds TEXT section) ## Flags **`--embed-images`**: Skip OCR/vision analysis, just reference frame files - Faster (no analysis needed) - Lets LLM analyze raw images - Enhanced transcript only contains frame references **Without `--embed-images`**: Run OCR/vision analysis - Extracts text from frames - Enhanced transcript includes both frame reference AND extracted text - Useful for code/dashboard analysis ## Usage ```bash # Reference frames only (no OCR, faster) python process_meeting.py samples/video.mkv --run-whisper --embed-images --scene-detection -v # Reference frames + OCR text extraction python process_meeting.py samples/video.mkv --run-whisper --use-hybrid --scene-detection -v # Adjust frame quality (smaller files) python process_meeting.py samples/video.mkv --run-whisper --embed-images --embed-quality 60 --scene-detection -v ``` ## Files Modified - `meetus/transcript_merger.py` - Modified `_format_detailed()` to output frame paths instead of base64 - `process_meeting.py` - Updated help text and examples to reflect frame referencing - All processors (OCR, vision, hybrid) already include `frame_path` in results (no changes needed) ## Workflow Example ```bash # First run: Generate everything python process_meeting.py samples/meeting.mkv --run-whisper --embed-images --scene-detection -v # Result: # - output/20251028-004-meeting/ # - frames/ (40 frames, ~80KB each) # - meeting.json (whisper transcript) # - meeting_enhanced.txt (~50KB, references frames/) # LLM can now: # 1. Read enhanced transcript # 2. See timeline of audio + screen changes # 3. Load individual frames as needed from frames/ directory ```