add whisperx support
This commit is contained in:
124
def/05-reference-frames-instead-of-embedding.md
Normal file
124
def/05-reference-frames-instead-of-embedding.md
Normal file
@@ -0,0 +1,124 @@
|
||||
# 05 - Reference Frame Files Instead of Embedding
|
||||
|
||||
## Date
|
||||
2025-10-28
|
||||
|
||||
## Context
|
||||
Embedding base64 images made the enhanced transcript files very large (3.7MB for ~40 frames). This made them harder to work with and slower to process.
|
||||
|
||||
## Problem
|
||||
- Enhanced transcript with embedded base64 images was 3.7MB
|
||||
- Large file size makes it slow to read/process
|
||||
- Difficult to inspect individual frames
|
||||
- Harder to share and version control
|
||||
|
||||
## Solution: Reference Frame Paths
|
||||
Instead of embedding base64 image data, reference the frame files by their relative paths.
|
||||
|
||||
### Before (Embedded):
|
||||
```
|
||||
[00:08] SCREEN CONTENT:
|
||||
IMAGE (base64, 85KB):
|
||||
<image>data:image/jpeg;base64,/9j/4AAQSkZJRg...</image>
|
||||
```
|
||||
File size: 3.7MB
|
||||
|
||||
### After (Referenced):
|
||||
```
|
||||
[00:08] SCREEN CONTENT:
|
||||
Frame: frames/zaca-run-scrapers_00257.jpg
|
||||
```
|
||||
File size: ~50KB
|
||||
|
||||
## Implementation
|
||||
|
||||
**Directory Structure:**
|
||||
```
|
||||
output/20251028-003-zaca-run-scrapers/
|
||||
├── frames/
|
||||
│ ├── zaca-run-scrapers_00257.jpg
|
||||
│ ├── zaca-run-scrapers_00487.jpg
|
||||
│ └── ...
|
||||
├── zaca-run-scrapers.json (whisper transcript)
|
||||
└── zaca-run-scrapers_enhanced.txt (references frames/ directory)
|
||||
```
|
||||
|
||||
**Enhanced Transcript Format:**
|
||||
```
|
||||
================================================================================
|
||||
ENHANCED MEETING TRANSCRIPT
|
||||
Audio transcript + Screen frames
|
||||
================================================================================
|
||||
|
||||
[00:30] SPEAKER:
|
||||
Bueno, te dio un tour para el proyecto...
|
||||
|
||||
[00:08] SCREEN CONTENT:
|
||||
Frame: frames/zaca-run-scrapers_00257.jpg
|
||||
|
||||
[01:00] SPEAKER:
|
||||
Mayormente en Scrapping lo que tenemos...
|
||||
|
||||
[01:15] SCREEN CONTENT:
|
||||
Frame: frames/zaca-run-scrapers_00487.jpg
|
||||
TEXT:
|
||||
| Code snippet from screen (if OCR was used)
|
||||
```
|
||||
|
||||
## Benefits
|
||||
|
||||
✓ **Much smaller files**: ~50KB vs 3.7MB (74x smaller!)
|
||||
✓ **Easier to inspect**: Can view individual frames directly
|
||||
✓ **LLM can access images**: Frame paths allow LLM to load images on demand
|
||||
✓ **Better version control**: Text files are small and diffable
|
||||
✓ **Cleaner structure**: Frames organized in dedicated directory
|
||||
✓ **Flexible**: Can still do OCR/vision analysis if needed (adds TEXT section)
|
||||
|
||||
## Flags
|
||||
|
||||
**`--embed-images`**: Skip OCR/vision analysis, just reference frame files
|
||||
- Faster (no analysis needed)
|
||||
- Lets LLM analyze raw images
|
||||
- Enhanced transcript only contains frame references
|
||||
|
||||
**Without `--embed-images`**: Run OCR/vision analysis
|
||||
- Extracts text from frames
|
||||
- Enhanced transcript includes both frame reference AND extracted text
|
||||
- Useful for code/dashboard analysis
|
||||
|
||||
## Usage
|
||||
|
||||
```bash
|
||||
# Reference frames only (no OCR, faster)
|
||||
python process_meeting.py samples/video.mkv --run-whisper --embed-images --scene-detection -v
|
||||
|
||||
# Reference frames + OCR text extraction
|
||||
python process_meeting.py samples/video.mkv --run-whisper --use-hybrid --scene-detection -v
|
||||
|
||||
# Adjust frame quality (smaller files)
|
||||
python process_meeting.py samples/video.mkv --run-whisper --embed-images --embed-quality 60 --scene-detection -v
|
||||
```
|
||||
|
||||
## Files Modified
|
||||
|
||||
- `meetus/transcript_merger.py` - Modified `_format_detailed()` to output frame paths instead of base64
|
||||
- `process_meeting.py` - Updated help text and examples to reflect frame referencing
|
||||
- All processors (OCR, vision, hybrid) already include `frame_path` in results (no changes needed)
|
||||
|
||||
## Workflow Example
|
||||
|
||||
```bash
|
||||
# First run: Generate everything
|
||||
python process_meeting.py samples/meeting.mkv --run-whisper --embed-images --scene-detection -v
|
||||
|
||||
# Result:
|
||||
# - output/20251028-004-meeting/
|
||||
# - frames/ (40 frames, ~80KB each)
|
||||
# - meeting.json (whisper transcript)
|
||||
# - meeting_enhanced.txt (~50KB, references frames/)
|
||||
|
||||
# LLM can now:
|
||||
# 1. Read enhanced transcript
|
||||
# 2. See timeline of audio + screen changes
|
||||
# 3. Load individual frames as needed from frames/ directory
|
||||
```
|
||||
Reference in New Issue
Block a user