# 02 - Hybrid OpenCV + OCR + LLM Approach ## Date 2025-10-28 ## Context Vision models (llava) were hallucinating text content badly - showing HTML code when there was none, inventing text that didn't exist. Pure OCR was fast and accurate but lost code formatting and structure. ## Problem - **Vision models**: Hallucinate text content, can't be trusted for accurate extraction - **Pure OCR**: Accurate text but messy output, lost indentation/formatting - **Need**: Accurate text extraction + preserved code structure ## Solution: Three-Stage Hybrid Approach ### Stage 1: OpenCV Text Detection Use morphological operations to find text regions: - Adaptive thresholding (handles varying lighting) - Dilation with horizontal kernel to connect text lines - Contour detection to find bounding boxes - Filter by area and aspect ratio - Merge overlapping regions ### Stage 2: Region-Based OCR - Sort regions by reading order (top-to-bottom, left-to-right) - Crop each region from original image - Run OCR on cropped regions (more accurate than full frame) - Tesseract with PSM 6 mode to preserve layout - Preserve indentation in cleaning step ### Stage 3: Optional LLM Cleanup - Take accurate OCR output (no hallucination) - Use lightweight LLM (llama3.2:3b for speed) to: - Fix obvious OCR errors (l→1, O→0) - Restore code indentation and structure - Preserve exact text content - No added explanations or hallucinated content ## Benefits ✓ **Accurate**: OCR reads actual pixels, no hallucination ✓ **Fast**: OpenCV detection is instant, focused OCR is quick ✓ **Structured**: Regions separated with headers showing position ✓ **Formatted**: Optional LLM cleanup preserves/restores code structure ✓ **Deterministic**: Same input = same output (unlike vision models) ## Implementation **New file:** `meetus/hybrid_processor.py` - `HybridProcessor` class with OpenCV detection + OCR + optional LLM - Region sorting for proper reading order - Visual separators between regions **CLI flags:** ```bash --use-hybrid # Enable hybrid mode --hybrid-llm-cleanup # Add LLM post-processing (optional) --hybrid-llm-model MODEL # LLM model (default: llama3.2:3b) ``` **OCR improvements:** - Tesseract PSM 6 mode for better layout preservation - Modified text cleaning to keep indentation - `preserve_layout` parameter ## Usage ```bash # Basic hybrid (OpenCV + OCR) python process_meeting.py samples/video.mkv --use-hybrid --scene-detection # With LLM cleanup for best code formatting python process_meeting.py samples/video.mkv --use-hybrid --hybrid-llm-cleanup --scene-detection -v # Iterate on threshold python process_meeting.py samples/video.mkv --use-hybrid --scene-detection --scene-threshold 5 --skip-cache-frames --skip-cache-analysis ``` ## Output Format ``` [Region 1 at y=120] function calculateTotal(items) { return items.reduce((sum, item) => sum + item.price, 0); } ============================================================ [Region 2 at y=450] const result = calculateTotal(cartItems); console.log('Total:', result); ``` ## Performance - **Without LLM cleanup**: Very fast (~2-3s per frame) - **With LLM cleanup**: Slower but still faster than vision models (~5-8s per frame) - **Accuracy**: Much better than vision model hallucinations ## When to Use What | Method | Best For | Pros | Cons | |--------|----------|------|------| | **Hybrid** | Code/terminal text extraction | Accurate, fast, no hallucination | Formatting may be messy | | **Hybrid + LLM** | Code with preserved structure | Accurate + formatted | Slower, needs Ollama | | **Vision** | Understanding layout/context | Semantic understanding | Hallucinates text | | **Pure OCR** | Simple text, no structure needed | Fast, simple | Full-frame, no region detection | ## Files Modified - `meetus/hybrid_processor.py` - New hybrid processor - `meetus/ocr_processor.py` - Layout preservation - `meetus/workflow.py` - Hybrid mode integration - `process_meeting.py` - CLI flags and examples