02 - Hybrid OpenCV + OCR + LLM Approach

Date

2025-10-28

Context

Vision models (llava) were hallucinating text content badly - showing HTML code when there was none, inventing text that didn't exist. Pure OCR was fast and accurate but lost code formatting and structure.

Problem

Vision models: Hallucinate text content, can't be trusted for accurate extraction
Pure OCR: Accurate text but messy output, lost indentation/formatting
Need: Accurate text extraction + preserved code structure

Solution: Three-Stage Hybrid Approach

Stage 1: OpenCV Text Detection

Use morphological operations to find text regions:

Adaptive thresholding (handles varying lighting)
Dilation with horizontal kernel to connect text lines
Contour detection to find bounding boxes
Filter by area and aspect ratio
Merge overlapping regions

Stage 2: Region-Based OCR

Sort regions by reading order (top-to-bottom, left-to-right)
Crop each region from original image
Run OCR on cropped regions (more accurate than full frame)
Tesseract with PSM 6 mode to preserve layout
Preserve indentation in cleaning step

Stage 3: Optional LLM Cleanup

Take accurate OCR output (no hallucination)
Use lightweight LLM (llama3.2:3b for speed) to:
- Fix obvious OCR errors (l→1, O→0)
- Restore code indentation and structure
- Preserve exact text content
- No added explanations or hallucinated content

Benefits

✓ Accurate: OCR reads actual pixels, no hallucination ✓ Fast: OpenCV detection is instant, focused OCR is quick ✓ Structured: Regions separated with headers showing position ✓ Formatted: Optional LLM cleanup preserves/restores code structure ✓ Deterministic: Same input = same output (unlike vision models)

Implementation

New file: meetus/hybrid_processor.py

HybridProcessor class with OpenCV detection + OCR + optional LLM
Region sorting for proper reading order
Visual separators between regions

CLI flags:

--use-hybrid                 # Enable hybrid mode
--hybrid-llm-cleanup        # Add LLM post-processing (optional)
--hybrid-llm-model MODEL    # LLM model (default: llama3.2:3b)

OCR improvements:

Tesseract PSM 6 mode for better layout preservation
Modified text cleaning to keep indentation
preserve_layout parameter

Usage

# Basic hybrid (OpenCV + OCR)
python process_meeting.py samples/video.mkv --use-hybrid --scene-detection

# With LLM cleanup for best code formatting
python process_meeting.py samples/video.mkv --use-hybrid --hybrid-llm-cleanup --scene-detection -v

# Iterate on threshold
python process_meeting.py samples/video.mkv --use-hybrid --scene-detection --scene-threshold 5 --skip-cache-frames --skip-cache-analysis

Output Format

[Region 1 at y=120]
function calculateTotal(items) {
  return items.reduce((sum, item) => sum + item.price, 0);
}

============================================================

[Region 2 at y=450]
const result = calculateTotal(cartItems);
console.log('Total:', result);

Performance

Without LLM cleanup: Very fast (~2-3s per frame)
With LLM cleanup: Slower but still faster than vision models (~5-8s per frame)
Accuracy: Much better than vision model hallucinations

When to Use What

Method	Best For	Pros	Cons
Hybrid	Code/terminal text extraction	Accurate, fast, no hallucination	Formatting may be messy
Hybrid + LLM	Code with preserved structure	Accurate + formatted	Slower, needs Ollama
Vision	Understanding layout/context	Semantic understanding	Hallucinates text
Pure OCR	Simple text, no structure needed	Fast, simple	Full-frame, no region detection

Files Modified

meetus/hybrid_processor.py - New hybrid processor
meetus/ocr_processor.py - Layout preservation
meetus/workflow.py - Hybrid mode integration
process_meeting.py - CLI flags and examples

3.9 KiB Raw Blame History