embed images

2025-10-28 08:02:45 -03:00
parent b1e1daf278
commit 118ef04223
12 changed files with 1016 additions and 61 deletions
--- a/def/02-hybrid-opencv-ocr-llm.md
+++ b/def/02-hybrid-opencv-ocr-llm.md
@@ -0,0 +1,111 @@
+# 02 - Hybrid OpenCV + OCR + LLM Approach
+
+## Date
+2025-10-28
+
+## Context
+Vision models (llava) were hallucinating text content badly - showing HTML code when there was none, inventing text that didn't exist. Pure OCR was fast and accurate but lost code formatting and structure.
+
+## Problem
+- **Vision models**: Hallucinate text content, can't be trusted for accurate extraction
+- **Pure OCR**: Accurate text but messy output, lost indentation/formatting
+- **Need**: Accurate text extraction + preserved code structure
+
+## Solution: Three-Stage Hybrid Approach
+
+### Stage 1: OpenCV Text Detection
+Use morphological operations to find text regions:
+- Adaptive thresholding (handles varying lighting)
+- Dilation with horizontal kernel to connect text lines
+- Contour detection to find bounding boxes
+- Filter by area and aspect ratio
+- Merge overlapping regions
+
+### Stage 2: Region-Based OCR
+- Sort regions by reading order (top-to-bottom, left-to-right)
+- Crop each region from original image
+- Run OCR on cropped regions (more accurate than full frame)
+- Tesseract with PSM 6 mode to preserve layout
+- Preserve indentation in cleaning step
+
+### Stage 3: Optional LLM Cleanup
+- Take accurate OCR output (no hallucination)
+- Use lightweight LLM (llama3.2:3b for speed) to:
+  - Fix obvious OCR errors (l→1, O→0)
+  - Restore code indentation and structure
+  - Preserve exact text content
+  - No added explanations or hallucinated content
+
+## Benefits
+✓ **Accurate**: OCR reads actual pixels, no hallucination
+✓ **Fast**: OpenCV detection is instant, focused OCR is quick
+✓ **Structured**: Regions separated with headers showing position
+✓ **Formatted**: Optional LLM cleanup preserves/restores code structure
+✓ **Deterministic**: Same input = same output (unlike vision models)
+
+## Implementation
+
+**New file:** `meetus/hybrid_processor.py`
+- `HybridProcessor` class with OpenCV detection + OCR + optional LLM
+- Region sorting for proper reading order
+- Visual separators between regions
+
+**CLI flags:**
+```bash
+--use-hybrid                 # Enable hybrid mode
+--hybrid-llm-cleanup        # Add LLM post-processing (optional)
+--hybrid-llm-model MODEL    # LLM model (default: llama3.2:3b)
+```
+
+**OCR improvements:**
+- Tesseract PSM 6 mode for better layout preservation
+- Modified text cleaning to keep indentation
+- `preserve_layout` parameter
+
+## Usage
+
+```bash
+# Basic hybrid (OpenCV + OCR)
+python process_meeting.py samples/video.mkv --use-hybrid --scene-detection
+
+# With LLM cleanup for best code formatting
+python process_meeting.py samples/video.mkv --use-hybrid --hybrid-llm-cleanup --scene-detection -v
+
+# Iterate on threshold
+python process_meeting.py samples/video.mkv --use-hybrid --scene-detection --scene-threshold 5 --skip-cache-frames --skip-cache-analysis
+```
+
+## Output Format
+
+```
+[Region 1 at y=120]
+function calculateTotal(items) {
+  return items.reduce((sum, item) => sum + item.price, 0);
+}
+
+============================================================
+
+[Region 2 at y=450]
+const result = calculateTotal(cartItems);
+console.log('Total:', result);
+```
+
+## Performance
+- **Without LLM cleanup**: Very fast (~2-3s per frame)
+- **With LLM cleanup**: Slower but still faster than vision models (~5-8s per frame)
+- **Accuracy**: Much better than vision model hallucinations
+
+## When to Use What
+
+| Method | Best For | Pros | Cons |
+|--------|----------|------|------|
+| **Hybrid** | Code/terminal text extraction | Accurate, fast, no hallucination | Formatting may be messy |
+| **Hybrid + LLM** | Code with preserved structure | Accurate + formatted | Slower, needs Ollama |
+| **Vision** | Understanding layout/context | Semantic understanding | Hallucinates text |
+| **Pure OCR** | Simple text, no structure needed | Fast, simple | Full-frame, no region detection |
+
+## Files Modified
+- `meetus/hybrid_processor.py` - New hybrid processor
+- `meetus/ocr_processor.py` - Layout preservation
+- `meetus/workflow.py` - Hybrid mode integration
+- `process_meeting.py` - CLI flags and examples
--- a/def/03-embed-images-for-llm.md
+++ b/def/03-embed-images-for-llm.md
@@ -0,0 +1,100 @@
+# 03 - Embed Images for LLM Analysis
+
+## Date
+2025-10-28
+
+## Context
+Hybrid OCR approach was fast and accurate but formatting was messy. Vision models hallucinated text. Rather than fighting with text extraction, a better approach is to embed the actual frame images in the enhanced transcript and let the end-user's LLM analyze them with full audio context.
+
+## Problem
+- OCR/vision models either hallucinate or produce messy text
+- Code formatting/indentation is hard to preserve
+- User wants to analyze frames with their own LLM (Claude, GPT, etc.)
+- Need to keep file size reasonable (~200KB per image is too big)
+
+## Solution: Image Embedding
+
+Instead of extracting text, embed the actual frame images as base64 in the enhanced transcript. The LLM can then:
+- See the actual screen content (no hallucination)
+- Understand code structure, layout, and formatting visually
+- Have full audio transcript context for each frame
+- Analyze dashboards, terminals, editors with perfect accuracy
+
+## Implementation
+
+**Quality Optimization:**
+- Default JPEG quality: 80 (good tradeoff between size and readability)
+- Configurable via `--embed-quality` (0-100)
+- Typical sizes at quality 80: ~40-80KB per image (vs 200KB original)
+
+**Format:**
+```
+[MM:SS] SPEAKER:
+  Audio transcript text here
+
+[MM:SS] SCREEN CONTENT:
+  IMAGE (base64, 52KB):
+  <image>data:image/jpeg;base64,/9j/4AAQSkZJRg...</image>
+
+  TEXT:
+  | Optional OCR text for reference
+```
+
+**Features:**
+- Base64 encoding for easy embedding
+- Size tracking and reporting
+- Optional text content alongside images
+- Works with scene detection for smart frame selection
+
+## Usage
+
+```bash
+# Basic: Embed images at quality 80 (default)
+python process_meeting.py samples/video.mkv --run-whisper --embed-images --scene-detection --no-cache -v
+
+# Lower quality for smaller files (still readable)
+python process_meeting.py samples/video.mkv --run-whisper --embed-images --embed-quality 60 --scene-detection --no-cache -v
+
+# Higher quality for detailed code
+python process_meeting.py samples/video.mkv --run-whisper --embed-images --embed-quality 90 --scene-detection --no-cache -v
+
+# Iterate on scene threshold (reuse whisper)
+python process_meeting.py samples/video.mkv --embed-images --scene-detection --scene-threshold 5 --skip-cache-frames --skip-cache-analysis -v
+```
+
+## File Sizes
+
+**Example for 20 frames:**
+- Quality 60: ~30-50KB per image = 0.6-1MB total
+- Quality 80: ~40-80KB per image = 0.8-1.6MB total (recommended)
+- Quality 90: ~80-120KB per image = 1.6-2.4MB total
+- Original: ~200KB per image = 4MB total
+
+## Benefits
+
+✓ **No hallucination**: LLM sees actual pixels
+✓ **Perfect formatting**: Code structure preserved visually
+✓ **Full context**: Audio transcript + visual frame together
+✓ **User's choice**: Use your preferred LLM (Claude, GPT, etc.)
+✓ **Reasonable size**: Quality 80 gives 4x smaller files vs original
+✓ **Simple workflow**: One file contains everything
+
+## Use Cases
+
+**Code walkthroughs:** LLM can see actual code structure and indentation
+**Dashboard analysis:** Charts, graphs, metrics visible to LLM
+**Terminal sessions:** Commands and output in proper context
+**UI reviews:** Actual interface visible with audio commentary
+
+## Files Modified
+
+- `meetus/transcript_merger.py` - Image encoding and embedding
+- `meetus/workflow.py` - Wire through config
+- `process_meeting.py` - CLI flags
+- `meetus/output_manager.py` - Cleaner directory naming (date + increment)
+
+## Output Directory Naming
+
+Also changed output directory format for clarity:
+- Old: `20251028_054553-video` (confusing timestamps)
+- New: `20251028-001-video` (clear date + run number)
--- a/def/04-fix-whisper-cache-loading.md
+++ b/def/04-fix-whisper-cache-loading.md
@@ -0,0 +1,78 @@
+# 04 - Fix Whisper Cache Loading
+
+## Date
+2025-10-28
+
+## Problem
+Enhanced transcript was not including the audio segments from cached whisper transcripts when running without the `--run-whisper` flag.
+
+Example command that failed:
+```bash
+python process_meeting.py samples/zaca-run-scrapers.mkv --embed-images --scene-detection --scene-threshold 10 --skip-cache-frames -v
+```
+
+Result: Enhanced transcript only contained embedded images, no audio segments (0 "SPEAKER" entries).
+
+## Root Cause
+In `workflow.py`, the `_run_whisper()` method was checking the `run_whisper` flag **before** checking the cache:
+
+```python
+def _run_whisper(self) -> Optional[str]:
+    if not self.config.run_whisper:
+        return self.config.transcript_path  # Returns None if --transcript not specified
+
+    # Cache check NEVER REACHED if run_whisper is False
+    cached = self.cache_mgr.get_whisper_cache()
+    if cached:
+        return str(cached)
+```
+
+This meant:
+- User runs command without `--run-whisper`
+- Method returns None immediately
+- Cached whisper transcript is never discovered
+- No audio segments in enhanced output
+
+## Solution
+Reorder the logic to check cache **first**, regardless of flags:
+
+```python
+def _run_whisper(self) -> Optional[str]:
+    """Run Whisper transcription if requested, or use cached/provided transcript."""
+    # First, check cache (regardless of run_whisper flag)
+    cached = self.cache_mgr.get_whisper_cache()
+    if cached:
+        return str(cached)
+
+    # If no cache and not running whisper, use provided transcript path (if any)
+    if not self.config.run_whisper:
+        return self.config.transcript_path
+
+    # If no cache and run_whisper is True, run whisper transcription
+    # ... rest of whisper code
+```
+
+## New Behavior
+1. Cache is checked first (regardless of `--run-whisper` flag)
+2. If cached whisper exists, use it
+3. If no cache and `--run-whisper` not specified, use `--transcript` path (or None)
+4. If no cache and `--run-whisper` specified, run whisper
+
+## Benefits
+✓ Cached whisper transcripts are always discovered and used
+✓ User can iterate on frame extraction/analysis without re-running whisper
+✓ Enhanced transcripts now properly include both audio + visual content
+✓ Granular cache flags (`--skip-cache-frames`, `--skip-cache-whisper`) work as expected
+
+## Use Case
+```bash
+# First run: Generate whisper transcript + extract frames
+python process_meeting.py samples/video.mkv --run-whisper --embed-images --scene-detection -v
+
+# Second run: Iterate on scene threshold without re-running whisper
+python process_meeting.py samples/video.mkv --embed-images --scene-detection --scene-threshold 10 --skip-cache-frames -v
+# Now correctly includes cached whisper transcript in enhanced output!
+```
+
+## Files Modified
+- `meetus/workflow.py` - Reordered logic in `_run_whisper()` method (lines 172-181)