embed images

2025-10-28 08:02:45 -03:00
parent b1e1daf278
commit 118ef04223
12 changed files with 1016 additions and 61 deletions
--- a/def/02-hybrid-opencv-ocr-llm.md
+++ b/def/02-hybrid-opencv-ocr-llm.md
@@ -0,0 +1,111 @@
+# 02 - Hybrid OpenCV + OCR + LLM Approach
+
+## Date
+2025-10-28
+
+## Context
+Vision models (llava) were hallucinating text content badly - showing HTML code when there was none, inventing text that didn't exist. Pure OCR was fast and accurate but lost code formatting and structure.
+
+## Problem
+- **Vision models**: Hallucinate text content, can't be trusted for accurate extraction
+- **Pure OCR**: Accurate text but messy output, lost indentation/formatting
+- **Need**: Accurate text extraction + preserved code structure
+
+## Solution: Three-Stage Hybrid Approach
+
+### Stage 1: OpenCV Text Detection
+Use morphological operations to find text regions:
+- Adaptive thresholding (handles varying lighting)
+- Dilation with horizontal kernel to connect text lines
+- Contour detection to find bounding boxes
+- Filter by area and aspect ratio
+- Merge overlapping regions
+
+### Stage 2: Region-Based OCR
+- Sort regions by reading order (top-to-bottom, left-to-right)
+- Crop each region from original image
+- Run OCR on cropped regions (more accurate than full frame)
+- Tesseract with PSM 6 mode to preserve layout
+- Preserve indentation in cleaning step
+
+### Stage 3: Optional LLM Cleanup
+- Take accurate OCR output (no hallucination)
+- Use lightweight LLM (llama3.2:3b for speed) to:
+  - Fix obvious OCR errors (l→1, O→0)
+  - Restore code indentation and structure
+  - Preserve exact text content
+  - No added explanations or hallucinated content
+
+## Benefits
+✓ **Accurate**: OCR reads actual pixels, no hallucination
+✓ **Fast**: OpenCV detection is instant, focused OCR is quick
+✓ **Structured**: Regions separated with headers showing position
+✓ **Formatted**: Optional LLM cleanup preserves/restores code structure
+✓ **Deterministic**: Same input = same output (unlike vision models)
+
+## Implementation
+
+**New file:** `meetus/hybrid_processor.py`
+- `HybridProcessor` class with OpenCV detection + OCR + optional LLM
+- Region sorting for proper reading order
+- Visual separators between regions
+
+**CLI flags:**
+```bash
+--use-hybrid                 # Enable hybrid mode
+--hybrid-llm-cleanup        # Add LLM post-processing (optional)
+--hybrid-llm-model MODEL    # LLM model (default: llama3.2:3b)
+```
+
+**OCR improvements:**
+- Tesseract PSM 6 mode for better layout preservation
+- Modified text cleaning to keep indentation
+- `preserve_layout` parameter
+
+## Usage
+
+```bash
+# Basic hybrid (OpenCV + OCR)
+python process_meeting.py samples/video.mkv --use-hybrid --scene-detection
+
+# With LLM cleanup for best code formatting
+python process_meeting.py samples/video.mkv --use-hybrid --hybrid-llm-cleanup --scene-detection -v
+
+# Iterate on threshold
+python process_meeting.py samples/video.mkv --use-hybrid --scene-detection --scene-threshold 5 --skip-cache-frames --skip-cache-analysis
+```
+
+## Output Format
+
+```
+[Region 1 at y=120]
+function calculateTotal(items) {
+  return items.reduce((sum, item) => sum + item.price, 0);
+}
+
+============================================================
+
+[Region 2 at y=450]
+const result = calculateTotal(cartItems);
+console.log('Total:', result);
+```
+
+## Performance
+- **Without LLM cleanup**: Very fast (~2-3s per frame)
+- **With LLM cleanup**: Slower but still faster than vision models (~5-8s per frame)
+- **Accuracy**: Much better than vision model hallucinations
+
+## When to Use What
+
+| Method | Best For | Pros | Cons |
+|--------|----------|------|------|
+| **Hybrid** | Code/terminal text extraction | Accurate, fast, no hallucination | Formatting may be messy |
+| **Hybrid + LLM** | Code with preserved structure | Accurate + formatted | Slower, needs Ollama |
+| **Vision** | Understanding layout/context | Semantic understanding | Hallucinates text |
+| **Pure OCR** | Simple text, no structure needed | Fast, simple | Full-frame, no region detection |
+
+## Files Modified
+- `meetus/hybrid_processor.py` - New hybrid processor
+- `meetus/ocr_processor.py` - Layout preservation
+- `meetus/workflow.py` - Hybrid mode integration
+- `process_meeting.py` - CLI flags and examples