embed images
This commit is contained in:
100
def/03-embed-images-for-llm.md
Normal file
100
def/03-embed-images-for-llm.md
Normal file
@@ -0,0 +1,100 @@
|
||||
# 03 - Embed Images for LLM Analysis
|
||||
|
||||
## Date
|
||||
2025-10-28
|
||||
|
||||
## Context
|
||||
Hybrid OCR approach was fast and accurate but formatting was messy. Vision models hallucinated text. Rather than fighting with text extraction, a better approach is to embed the actual frame images in the enhanced transcript and let the end-user's LLM analyze them with full audio context.
|
||||
|
||||
## Problem
|
||||
- OCR/vision models either hallucinate or produce messy text
|
||||
- Code formatting/indentation is hard to preserve
|
||||
- User wants to analyze frames with their own LLM (Claude, GPT, etc.)
|
||||
- Need to keep file size reasonable (~200KB per image is too big)
|
||||
|
||||
## Solution: Image Embedding
|
||||
|
||||
Instead of extracting text, embed the actual frame images as base64 in the enhanced transcript. The LLM can then:
|
||||
- See the actual screen content (no hallucination)
|
||||
- Understand code structure, layout, and formatting visually
|
||||
- Have full audio transcript context for each frame
|
||||
- Analyze dashboards, terminals, editors with perfect accuracy
|
||||
|
||||
## Implementation
|
||||
|
||||
**Quality Optimization:**
|
||||
- Default JPEG quality: 80 (good tradeoff between size and readability)
|
||||
- Configurable via `--embed-quality` (0-100)
|
||||
- Typical sizes at quality 80: ~40-80KB per image (vs 200KB original)
|
||||
|
||||
**Format:**
|
||||
```
|
||||
[MM:SS] SPEAKER:
|
||||
Audio transcript text here
|
||||
|
||||
[MM:SS] SCREEN CONTENT:
|
||||
IMAGE (base64, 52KB):
|
||||
<image>data:image/jpeg;base64,/9j/4AAQSkZJRg...</image>
|
||||
|
||||
TEXT:
|
||||
| Optional OCR text for reference
|
||||
```
|
||||
|
||||
**Features:**
|
||||
- Base64 encoding for easy embedding
|
||||
- Size tracking and reporting
|
||||
- Optional text content alongside images
|
||||
- Works with scene detection for smart frame selection
|
||||
|
||||
## Usage
|
||||
|
||||
```bash
|
||||
# Basic: Embed images at quality 80 (default)
|
||||
python process_meeting.py samples/video.mkv --run-whisper --embed-images --scene-detection --no-cache -v
|
||||
|
||||
# Lower quality for smaller files (still readable)
|
||||
python process_meeting.py samples/video.mkv --run-whisper --embed-images --embed-quality 60 --scene-detection --no-cache -v
|
||||
|
||||
# Higher quality for detailed code
|
||||
python process_meeting.py samples/video.mkv --run-whisper --embed-images --embed-quality 90 --scene-detection --no-cache -v
|
||||
|
||||
# Iterate on scene threshold (reuse whisper)
|
||||
python process_meeting.py samples/video.mkv --embed-images --scene-detection --scene-threshold 5 --skip-cache-frames --skip-cache-analysis -v
|
||||
```
|
||||
|
||||
## File Sizes
|
||||
|
||||
**Example for 20 frames:**
|
||||
- Quality 60: ~30-50KB per image = 0.6-1MB total
|
||||
- Quality 80: ~40-80KB per image = 0.8-1.6MB total (recommended)
|
||||
- Quality 90: ~80-120KB per image = 1.6-2.4MB total
|
||||
- Original: ~200KB per image = 4MB total
|
||||
|
||||
## Benefits
|
||||
|
||||
✓ **No hallucination**: LLM sees actual pixels
|
||||
✓ **Perfect formatting**: Code structure preserved visually
|
||||
✓ **Full context**: Audio transcript + visual frame together
|
||||
✓ **User's choice**: Use your preferred LLM (Claude, GPT, etc.)
|
||||
✓ **Reasonable size**: Quality 80 gives 4x smaller files vs original
|
||||
✓ **Simple workflow**: One file contains everything
|
||||
|
||||
## Use Cases
|
||||
|
||||
**Code walkthroughs:** LLM can see actual code structure and indentation
|
||||
**Dashboard analysis:** Charts, graphs, metrics visible to LLM
|
||||
**Terminal sessions:** Commands and output in proper context
|
||||
**UI reviews:** Actual interface visible with audio commentary
|
||||
|
||||
## Files Modified
|
||||
|
||||
- `meetus/transcript_merger.py` - Image encoding and embedding
|
||||
- `meetus/workflow.py` - Wire through config
|
||||
- `process_meeting.py` - CLI flags
|
||||
- `meetus/output_manager.py` - Cleaner directory naming (date + increment)
|
||||
|
||||
## Output Directory Naming
|
||||
|
||||
Also changed output directory format for clarity:
|
||||
- Old: `20251028_054553-video` (confusing timestamps)
|
||||
- New: `20251028-001-video` (clear date + run number)
|
||||
Reference in New Issue
Block a user