# 03 - Embed Images for LLM Analysis ## Date 2025-10-28 ## Context Hybrid OCR approach was fast and accurate but formatting was messy. Vision models hallucinated text. Rather than fighting with text extraction, a better approach is to embed the actual frame images in the enhanced transcript and let the end-user's LLM analyze them with full audio context. ## Problem - OCR/vision models either hallucinate or produce messy text - Code formatting/indentation is hard to preserve - User wants to analyze frames with their own LLM (Claude, GPT, etc.) - Need to keep file size reasonable (~200KB per image is too big) ## Solution: Image Embedding Instead of extracting text, embed the actual frame images as base64 in the enhanced transcript. The LLM can then: - See the actual screen content (no hallucination) - Understand code structure, layout, and formatting visually - Have full audio transcript context for each frame - Analyze dashboards, terminals, editors with perfect accuracy ## Implementation **Quality Optimization:** - Default JPEG quality: 80 (good tradeoff between size and readability) - Configurable via `--embed-quality` (0-100) - Typical sizes at quality 80: ~40-80KB per image (vs 200KB original) **Format:** ``` [MM:SS] SPEAKER: Audio transcript text here [MM:SS] SCREEN CONTENT: IMAGE (base64, 52KB): data:image/jpeg;base64,/9j/4AAQSkZJRg... TEXT: | Optional OCR text for reference ``` **Features:** - Base64 encoding for easy embedding - Size tracking and reporting - Optional text content alongside images - Works with scene detection for smart frame selection ## Usage ```bash # Basic: Embed images at quality 80 (default) python process_meeting.py samples/video.mkv --run-whisper --embed-images --scene-detection --no-cache -v # Lower quality for smaller files (still readable) python process_meeting.py samples/video.mkv --run-whisper --embed-images --embed-quality 60 --scene-detection --no-cache -v # Higher quality for detailed code python process_meeting.py samples/video.mkv --run-whisper --embed-images --embed-quality 90 --scene-detection --no-cache -v # Iterate on scene threshold (reuse whisper) python process_meeting.py samples/video.mkv --embed-images --scene-detection --scene-threshold 5 --skip-cache-frames --skip-cache-analysis -v ``` ## File Sizes **Example for 20 frames:** - Quality 60: ~30-50KB per image = 0.6-1MB total - Quality 80: ~40-80KB per image = 0.8-1.6MB total (recommended) - Quality 90: ~80-120KB per image = 1.6-2.4MB total - Original: ~200KB per image = 4MB total ## Benefits ✓ **No hallucination**: LLM sees actual pixels ✓ **Perfect formatting**: Code structure preserved visually ✓ **Full context**: Audio transcript + visual frame together ✓ **User's choice**: Use your preferred LLM (Claude, GPT, etc.) ✓ **Reasonable size**: Quality 80 gives 4x smaller files vs original ✓ **Simple workflow**: One file contains everything ## Use Cases **Code walkthroughs:** LLM can see actual code structure and indentation **Dashboard analysis:** Charts, graphs, metrics visible to LLM **Terminal sessions:** Commands and output in proper context **UI reviews:** Actual interface visible with audio commentary ## Files Modified - `meetus/transcript_merger.py` - Image encoding and embedding - `meetus/workflow.py` - Wire through config - `process_meeting.py` - CLI flags - `meetus/output_manager.py` - Cleaner directory naming (date + increment) ## Output Directory Naming Also changed output directory format for clarity: - Old: `20251028_054553-video` (confusing timestamps) - New: `20251028-001-video` (clear date + run number)