add whisperx support

2025-12-02 02:33:39 -03:00
parent 118ef04223
commit 7b919beda6
4 changed files with 155 additions and 38 deletions
--- a/def/05-reference-frames-instead-of-embedding.md
+++ b/def/05-reference-frames-instead-of-embedding.md
@@ -0,0 +1,124 @@
+# 05 - Reference Frame Files Instead of Embedding
+
+## Date
+2025-10-28
+
+## Context
+Embedding base64 images made the enhanced transcript files very large (3.7MB for ~40 frames). This made them harder to work with and slower to process.
+
+## Problem
+- Enhanced transcript with embedded base64 images was 3.7MB
+- Large file size makes it slow to read/process
+- Difficult to inspect individual frames
+- Harder to share and version control
+
+## Solution: Reference Frame Paths
+Instead of embedding base64 image data, reference the frame files by their relative paths.
+
+### Before (Embedded):
+```
+[00:08] SCREEN CONTENT:
+  IMAGE (base64, 85KB):
+  <image>data:image/jpeg;base64,/9j/4AAQSkZJRg...</image>
+```
+File size: 3.7MB
+
+### After (Referenced):
+```
+[00:08] SCREEN CONTENT:
+  Frame: frames/zaca-run-scrapers_00257.jpg
+```
+File size: ~50KB
+
+## Implementation
+
+**Directory Structure:**
+```
+output/20251028-003-zaca-run-scrapers/
+├── frames/
+│   ├── zaca-run-scrapers_00257.jpg
+│   ├── zaca-run-scrapers_00487.jpg
+│   └── ...
+├── zaca-run-scrapers.json (whisper transcript)
+└── zaca-run-scrapers_enhanced.txt (references frames/ directory)
+```
+
+**Enhanced Transcript Format:**
+```
+================================================================================
+ENHANCED MEETING TRANSCRIPT
+Audio transcript + Screen frames
+================================================================================
+
+[00:30] SPEAKER:
+  Bueno, te dio un tour para el proyecto...
+
+[00:08] SCREEN CONTENT:
+  Frame: frames/zaca-run-scrapers_00257.jpg
+
+[01:00] SPEAKER:
+  Mayormente en Scrapping lo que tenemos...
+
+[01:15] SCREEN CONTENT:
+  Frame: frames/zaca-run-scrapers_00487.jpg
+  TEXT:
+  | Code snippet from screen (if OCR was used)
+```
+
+## Benefits
+
+✓ **Much smaller files**: ~50KB vs 3.7MB (74x smaller!)
+✓ **Easier to inspect**: Can view individual frames directly
+✓ **LLM can access images**: Frame paths allow LLM to load images on demand
+✓ **Better version control**: Text files are small and diffable
+✓ **Cleaner structure**: Frames organized in dedicated directory
+✓ **Flexible**: Can still do OCR/vision analysis if needed (adds TEXT section)
+
+## Flags
+
+**`--embed-images`**: Skip OCR/vision analysis, just reference frame files
+- Faster (no analysis needed)
+- Lets LLM analyze raw images
+- Enhanced transcript only contains frame references
+
+**Without `--embed-images`**: Run OCR/vision analysis
+- Extracts text from frames
+- Enhanced transcript includes both frame reference AND extracted text
+- Useful for code/dashboard analysis
+
+## Usage
+
+```bash
+# Reference frames only (no OCR, faster)
+python process_meeting.py samples/video.mkv --run-whisper --embed-images --scene-detection -v
+
+# Reference frames + OCR text extraction
+python process_meeting.py samples/video.mkv --run-whisper --use-hybrid --scene-detection -v
+
+# Adjust frame quality (smaller files)
+python process_meeting.py samples/video.mkv --run-whisper --embed-images --embed-quality 60 --scene-detection -v
+```
+
+## Files Modified
+
+- `meetus/transcript_merger.py` - Modified `_format_detailed()` to output frame paths instead of base64
+- `process_meeting.py` - Updated help text and examples to reflect frame referencing
+- All processors (OCR, vision, hybrid) already include `frame_path` in results (no changes needed)
+
+## Workflow Example
+
+```bash
+# First run: Generate everything
+python process_meeting.py samples/meeting.mkv --run-whisper --embed-images --scene-detection -v
+
+# Result:
+# - output/20251028-004-meeting/
+#   - frames/ (40 frames, ~80KB each)
+#   - meeting.json (whisper transcript)
+#   - meeting_enhanced.txt (~50KB, references frames/)
+
+# LLM can now:
+# 1. Read enhanced transcript
+# 2. See timeline of audio + screen changes
+# 3. Load individual frames as needed from frames/ directory
+```