3.9 KiB
3.9 KiB
02 - Hybrid OpenCV + OCR + LLM Approach
Date
2025-10-28
Context
Vision models (llava) were hallucinating text content badly - showing HTML code when there was none, inventing text that didn't exist. Pure OCR was fast and accurate but lost code formatting and structure.
Problem
- Vision models: Hallucinate text content, can't be trusted for accurate extraction
- Pure OCR: Accurate text but messy output, lost indentation/formatting
- Need: Accurate text extraction + preserved code structure
Solution: Three-Stage Hybrid Approach
Stage 1: OpenCV Text Detection
Use morphological operations to find text regions:
- Adaptive thresholding (handles varying lighting)
- Dilation with horizontal kernel to connect text lines
- Contour detection to find bounding boxes
- Filter by area and aspect ratio
- Merge overlapping regions
Stage 2: Region-Based OCR
- Sort regions by reading order (top-to-bottom, left-to-right)
- Crop each region from original image
- Run OCR on cropped regions (more accurate than full frame)
- Tesseract with PSM 6 mode to preserve layout
- Preserve indentation in cleaning step
Stage 3: Optional LLM Cleanup
- Take accurate OCR output (no hallucination)
- Use lightweight LLM (llama3.2:3b for speed) to:
- Fix obvious OCR errors (l→1, O→0)
- Restore code indentation and structure
- Preserve exact text content
- No added explanations or hallucinated content
Benefits
✓ Accurate: OCR reads actual pixels, no hallucination ✓ Fast: OpenCV detection is instant, focused OCR is quick ✓ Structured: Regions separated with headers showing position ✓ Formatted: Optional LLM cleanup preserves/restores code structure ✓ Deterministic: Same input = same output (unlike vision models)
Implementation
New file: meetus/hybrid_processor.py
HybridProcessorclass with OpenCV detection + OCR + optional LLM- Region sorting for proper reading order
- Visual separators between regions
CLI flags:
--use-hybrid # Enable hybrid mode
--hybrid-llm-cleanup # Add LLM post-processing (optional)
--hybrid-llm-model MODEL # LLM model (default: llama3.2:3b)
OCR improvements:
- Tesseract PSM 6 mode for better layout preservation
- Modified text cleaning to keep indentation
preserve_layoutparameter
Usage
# Basic hybrid (OpenCV + OCR)
python process_meeting.py samples/video.mkv --use-hybrid --scene-detection
# With LLM cleanup for best code formatting
python process_meeting.py samples/video.mkv --use-hybrid --hybrid-llm-cleanup --scene-detection -v
# Iterate on threshold
python process_meeting.py samples/video.mkv --use-hybrid --scene-detection --scene-threshold 5 --skip-cache-frames --skip-cache-analysis
Output Format
[Region 1 at y=120]
function calculateTotal(items) {
return items.reduce((sum, item) => sum + item.price, 0);
}
============================================================
[Region 2 at y=450]
const result = calculateTotal(cartItems);
console.log('Total:', result);
Performance
- Without LLM cleanup: Very fast (~2-3s per frame)
- With LLM cleanup: Slower but still faster than vision models (~5-8s per frame)
- Accuracy: Much better than vision model hallucinations
When to Use What
| Method | Best For | Pros | Cons |
|---|---|---|---|
| Hybrid | Code/terminal text extraction | Accurate, fast, no hallucination | Formatting may be messy |
| Hybrid + LLM | Code with preserved structure | Accurate + formatted | Slower, needs Ollama |
| Vision | Understanding layout/context | Semantic understanding | Hallucinates text |
| Pure OCR | Simple text, no structure needed | Fast, simple | Full-frame, no region detection |
Files Modified
meetus/hybrid_processor.py- New hybrid processormeetus/ocr_processor.py- Layout preservationmeetus/workflow.py- Hybrid mode integrationprocess_meeting.py- CLI flags and examples