3.6 KiB
03 - Embed Images for LLM Analysis
Date
2025-10-28
Context
Hybrid OCR approach was fast and accurate but formatting was messy. Vision models hallucinated text. Rather than fighting with text extraction, a better approach is to embed the actual frame images in the enhanced transcript and let the end-user's LLM analyze them with full audio context.
Problem
- OCR/vision models either hallucinate or produce messy text
- Code formatting/indentation is hard to preserve
- User wants to analyze frames with their own LLM (Claude, GPT, etc.)
- Need to keep file size reasonable (~200KB per image is too big)
Solution: Image Embedding
Instead of extracting text, embed the actual frame images as base64 in the enhanced transcript. The LLM can then:
- See the actual screen content (no hallucination)
- Understand code structure, layout, and formatting visually
- Have full audio transcript context for each frame
- Analyze dashboards, terminals, editors with perfect accuracy
Implementation
Quality Optimization:
- Default JPEG quality: 80 (good tradeoff between size and readability)
- Configurable via
--embed-quality(0-100) - Typical sizes at quality 80: ~40-80KB per image (vs 200KB original)
Format:
[MM:SS] SPEAKER:
Audio transcript text here
[MM:SS] SCREEN CONTENT:
IMAGE (base64, 52KB):
<image>data:image/jpeg;base64,/9j/4AAQSkZJRg...</image>
TEXT:
| Optional OCR text for reference
Features:
- Base64 encoding for easy embedding
- Size tracking and reporting
- Optional text content alongside images
- Works with scene detection for smart frame selection
Usage
# Basic: Embed images at quality 80 (default)
python process_meeting.py samples/video.mkv --run-whisper --embed-images --scene-detection --no-cache -v
# Lower quality for smaller files (still readable)
python process_meeting.py samples/video.mkv --run-whisper --embed-images --embed-quality 60 --scene-detection --no-cache -v
# Higher quality for detailed code
python process_meeting.py samples/video.mkv --run-whisper --embed-images --embed-quality 90 --scene-detection --no-cache -v
# Iterate on scene threshold (reuse whisper)
python process_meeting.py samples/video.mkv --embed-images --scene-detection --scene-threshold 5 --skip-cache-frames --skip-cache-analysis -v
File Sizes
Example for 20 frames:
- Quality 60: ~30-50KB per image = 0.6-1MB total
- Quality 80: ~40-80KB per image = 0.8-1.6MB total (recommended)
- Quality 90: ~80-120KB per image = 1.6-2.4MB total
- Original: ~200KB per image = 4MB total
Benefits
✓ No hallucination: LLM sees actual pixels ✓ Perfect formatting: Code structure preserved visually ✓ Full context: Audio transcript + visual frame together ✓ User's choice: Use your preferred LLM (Claude, GPT, etc.) ✓ Reasonable size: Quality 80 gives 4x smaller files vs original ✓ Simple workflow: One file contains everything
Use Cases
Code walkthroughs: LLM can see actual code structure and indentation Dashboard analysis: Charts, graphs, metrics visible to LLM Terminal sessions: Commands and output in proper context UI reviews: Actual interface visible with audio commentary
Files Modified
meetus/transcript_merger.py- Image encoding and embeddingmeetus/workflow.py- Wire through configprocess_meeting.py- CLI flagsmeetus/output_manager.py- Cleaner directory naming (date + increment)
Output Directory Naming
Also changed output directory format for clarity:
- Old:
20251028_054553-video(confusing timestamps) - New:
20251028-001-video(clear date + run number)