Files
mitus/def/02-hybrid-opencv-ocr-llm.md
Mariano Gabriel 118ef04223 embed images
2025-10-28 08:02:45 -03:00

3.9 KiB

02 - Hybrid OpenCV + OCR + LLM Approach

Date

2025-10-28

Context

Vision models (llava) were hallucinating text content badly - showing HTML code when there was none, inventing text that didn't exist. Pure OCR was fast and accurate but lost code formatting and structure.

Problem

  • Vision models: Hallucinate text content, can't be trusted for accurate extraction
  • Pure OCR: Accurate text but messy output, lost indentation/formatting
  • Need: Accurate text extraction + preserved code structure

Solution: Three-Stage Hybrid Approach

Stage 1: OpenCV Text Detection

Use morphological operations to find text regions:

  • Adaptive thresholding (handles varying lighting)
  • Dilation with horizontal kernel to connect text lines
  • Contour detection to find bounding boxes
  • Filter by area and aspect ratio
  • Merge overlapping regions

Stage 2: Region-Based OCR

  • Sort regions by reading order (top-to-bottom, left-to-right)
  • Crop each region from original image
  • Run OCR on cropped regions (more accurate than full frame)
  • Tesseract with PSM 6 mode to preserve layout
  • Preserve indentation in cleaning step

Stage 3: Optional LLM Cleanup

  • Take accurate OCR output (no hallucination)
  • Use lightweight LLM (llama3.2:3b for speed) to:
    • Fix obvious OCR errors (l→1, O→0)
    • Restore code indentation and structure
    • Preserve exact text content
    • No added explanations or hallucinated content

Benefits

Accurate: OCR reads actual pixels, no hallucination ✓ Fast: OpenCV detection is instant, focused OCR is quick ✓ Structured: Regions separated with headers showing position ✓ Formatted: Optional LLM cleanup preserves/restores code structure ✓ Deterministic: Same input = same output (unlike vision models)

Implementation

New file: meetus/hybrid_processor.py

  • HybridProcessor class with OpenCV detection + OCR + optional LLM
  • Region sorting for proper reading order
  • Visual separators between regions

CLI flags:

--use-hybrid                 # Enable hybrid mode
--hybrid-llm-cleanup        # Add LLM post-processing (optional)
--hybrid-llm-model MODEL    # LLM model (default: llama3.2:3b)

OCR improvements:

  • Tesseract PSM 6 mode for better layout preservation
  • Modified text cleaning to keep indentation
  • preserve_layout parameter

Usage

# Basic hybrid (OpenCV + OCR)
python process_meeting.py samples/video.mkv --use-hybrid --scene-detection

# With LLM cleanup for best code formatting
python process_meeting.py samples/video.mkv --use-hybrid --hybrid-llm-cleanup --scene-detection -v

# Iterate on threshold
python process_meeting.py samples/video.mkv --use-hybrid --scene-detection --scene-threshold 5 --skip-cache-frames --skip-cache-analysis

Output Format

[Region 1 at y=120]
function calculateTotal(items) {
  return items.reduce((sum, item) => sum + item.price, 0);
}

============================================================

[Region 2 at y=450]
const result = calculateTotal(cartItems);
console.log('Total:', result);

Performance

  • Without LLM cleanup: Very fast (~2-3s per frame)
  • With LLM cleanup: Slower but still faster than vision models (~5-8s per frame)
  • Accuracy: Much better than vision model hallucinations

When to Use What

Method Best For Pros Cons
Hybrid Code/terminal text extraction Accurate, fast, no hallucination Formatting may be messy
Hybrid + LLM Code with preserved structure Accurate + formatted Slower, needs Ollama
Vision Understanding layout/context Semantic understanding Hallucinates text
Pure OCR Simple text, no structure needed Fast, simple Full-frame, no region detection

Files Modified

  • meetus/hybrid_processor.py - New hybrid processor
  • meetus/ocr_processor.py - Layout preservation
  • meetus/workflow.py - Hybrid mode integration
  • process_meeting.py - CLI flags and examples