embed images

2025-10-28 08:02:45 -03:00
parent b1e1daf278
commit 118ef04223
12 changed files with 1016 additions and 61 deletions
--- a/def/03-embed-images-for-llm.md
+++ b/def/03-embed-images-for-llm.md
@@ -0,0 +1,100 @@
+# 03 - Embed Images for LLM Analysis
+
+## Date
+2025-10-28
+
+## Context
+Hybrid OCR approach was fast and accurate but formatting was messy. Vision models hallucinated text. Rather than fighting with text extraction, a better approach is to embed the actual frame images in the enhanced transcript and let the end-user's LLM analyze them with full audio context.
+
+## Problem
+- OCR/vision models either hallucinate or produce messy text
+- Code formatting/indentation is hard to preserve
+- User wants to analyze frames with their own LLM (Claude, GPT, etc.)
+- Need to keep file size reasonable (~200KB per image is too big)
+
+## Solution: Image Embedding
+
+Instead of extracting text, embed the actual frame images as base64 in the enhanced transcript. The LLM can then:
+- See the actual screen content (no hallucination)
+- Understand code structure, layout, and formatting visually
+- Have full audio transcript context for each frame
+- Analyze dashboards, terminals, editors with perfect accuracy
+
+## Implementation
+
+**Quality Optimization:**
+- Default JPEG quality: 80 (good tradeoff between size and readability)
+- Configurable via `--embed-quality` (0-100)
+- Typical sizes at quality 80: ~40-80KB per image (vs 200KB original)
+
+**Format:**
+```
+[MM:SS] SPEAKER:
+  Audio transcript text here
+
+[MM:SS] SCREEN CONTENT:
+  IMAGE (base64, 52KB):
+  <image>data:image/jpeg;base64,/9j/4AAQSkZJRg...</image>
+
+  TEXT:
+  | Optional OCR text for reference
+```
+
+**Features:**
+- Base64 encoding for easy embedding
+- Size tracking and reporting
+- Optional text content alongside images
+- Works with scene detection for smart frame selection
+
+## Usage
+
+```bash
+# Basic: Embed images at quality 80 (default)
+python process_meeting.py samples/video.mkv --run-whisper --embed-images --scene-detection --no-cache -v
+
+# Lower quality for smaller files (still readable)
+python process_meeting.py samples/video.mkv --run-whisper --embed-images --embed-quality 60 --scene-detection --no-cache -v
+
+# Higher quality for detailed code
+python process_meeting.py samples/video.mkv --run-whisper --embed-images --embed-quality 90 --scene-detection --no-cache -v
+
+# Iterate on scene threshold (reuse whisper)
+python process_meeting.py samples/video.mkv --embed-images --scene-detection --scene-threshold 5 --skip-cache-frames --skip-cache-analysis -v
+```
+
+## File Sizes
+
+**Example for 20 frames:**
+- Quality 60: ~30-50KB per image = 0.6-1MB total
+- Quality 80: ~40-80KB per image = 0.8-1.6MB total (recommended)
+- Quality 90: ~80-120KB per image = 1.6-2.4MB total
+- Original: ~200KB per image = 4MB total
+
+## Benefits
+
+✓ **No hallucination**: LLM sees actual pixels
+✓ **Perfect formatting**: Code structure preserved visually
+✓ **Full context**: Audio transcript + visual frame together
+✓ **User's choice**: Use your preferred LLM (Claude, GPT, etc.)
+✓ **Reasonable size**: Quality 80 gives 4x smaller files vs original
+✓ **Simple workflow**: One file contains everything
+
+## Use Cases
+
+**Code walkthroughs:** LLM can see actual code structure and indentation
+**Dashboard analysis:** Charts, graphs, metrics visible to LLM
+**Terminal sessions:** Commands and output in proper context
+**UI reviews:** Actual interface visible with audio commentary
+
+## Files Modified
+
+- `meetus/transcript_merger.py` - Image encoding and embedding
+- `meetus/workflow.py` - Wire through config
+- `process_meeting.py` - CLI flags
+- `meetus/output_manager.py` - Cleaner directory naming (date + increment)
+
+## Output Directory Naming
+
+Also changed output directory format for clarity:
+- Old: `20251028_054553-video` (confusing timestamps)
+- New: `20251028-001-video` (clear date + run number)