112 lines
3.9 KiB
Markdown
112 lines
3.9 KiB
Markdown
# 02 - Hybrid OpenCV + OCR + LLM Approach
|
|
|
|
## Date
|
|
2025-10-28
|
|
|
|
## Context
|
|
Vision models (llava) were hallucinating text content badly - showing HTML code when there was none, inventing text that didn't exist. Pure OCR was fast and accurate but lost code formatting and structure.
|
|
|
|
## Problem
|
|
- **Vision models**: Hallucinate text content, can't be trusted for accurate extraction
|
|
- **Pure OCR**: Accurate text but messy output, lost indentation/formatting
|
|
- **Need**: Accurate text extraction + preserved code structure
|
|
|
|
## Solution: Three-Stage Hybrid Approach
|
|
|
|
### Stage 1: OpenCV Text Detection
|
|
Use morphological operations to find text regions:
|
|
- Adaptive thresholding (handles varying lighting)
|
|
- Dilation with horizontal kernel to connect text lines
|
|
- Contour detection to find bounding boxes
|
|
- Filter by area and aspect ratio
|
|
- Merge overlapping regions
|
|
|
|
### Stage 2: Region-Based OCR
|
|
- Sort regions by reading order (top-to-bottom, left-to-right)
|
|
- Crop each region from original image
|
|
- Run OCR on cropped regions (more accurate than full frame)
|
|
- Tesseract with PSM 6 mode to preserve layout
|
|
- Preserve indentation in cleaning step
|
|
|
|
### Stage 3: Optional LLM Cleanup
|
|
- Take accurate OCR output (no hallucination)
|
|
- Use lightweight LLM (llama3.2:3b for speed) to:
|
|
- Fix obvious OCR errors (l→1, O→0)
|
|
- Restore code indentation and structure
|
|
- Preserve exact text content
|
|
- No added explanations or hallucinated content
|
|
|
|
## Benefits
|
|
✓ **Accurate**: OCR reads actual pixels, no hallucination
|
|
✓ **Fast**: OpenCV detection is instant, focused OCR is quick
|
|
✓ **Structured**: Regions separated with headers showing position
|
|
✓ **Formatted**: Optional LLM cleanup preserves/restores code structure
|
|
✓ **Deterministic**: Same input = same output (unlike vision models)
|
|
|
|
## Implementation
|
|
|
|
**New file:** `meetus/hybrid_processor.py`
|
|
- `HybridProcessor` class with OpenCV detection + OCR + optional LLM
|
|
- Region sorting for proper reading order
|
|
- Visual separators between regions
|
|
|
|
**CLI flags:**
|
|
```bash
|
|
--use-hybrid # Enable hybrid mode
|
|
--hybrid-llm-cleanup # Add LLM post-processing (optional)
|
|
--hybrid-llm-model MODEL # LLM model (default: llama3.2:3b)
|
|
```
|
|
|
|
**OCR improvements:**
|
|
- Tesseract PSM 6 mode for better layout preservation
|
|
- Modified text cleaning to keep indentation
|
|
- `preserve_layout` parameter
|
|
|
|
## Usage
|
|
|
|
```bash
|
|
# Basic hybrid (OpenCV + OCR)
|
|
python process_meeting.py samples/video.mkv --use-hybrid --scene-detection
|
|
|
|
# With LLM cleanup for best code formatting
|
|
python process_meeting.py samples/video.mkv --use-hybrid --hybrid-llm-cleanup --scene-detection -v
|
|
|
|
# Iterate on threshold
|
|
python process_meeting.py samples/video.mkv --use-hybrid --scene-detection --scene-threshold 5 --skip-cache-frames --skip-cache-analysis
|
|
```
|
|
|
|
## Output Format
|
|
|
|
```
|
|
[Region 1 at y=120]
|
|
function calculateTotal(items) {
|
|
return items.reduce((sum, item) => sum + item.price, 0);
|
|
}
|
|
|
|
============================================================
|
|
|
|
[Region 2 at y=450]
|
|
const result = calculateTotal(cartItems);
|
|
console.log('Total:', result);
|
|
```
|
|
|
|
## Performance
|
|
- **Without LLM cleanup**: Very fast (~2-3s per frame)
|
|
- **With LLM cleanup**: Slower but still faster than vision models (~5-8s per frame)
|
|
- **Accuracy**: Much better than vision model hallucinations
|
|
|
|
## When to Use What
|
|
|
|
| Method | Best For | Pros | Cons |
|
|
|--------|----------|------|------|
|
|
| **Hybrid** | Code/terminal text extraction | Accurate, fast, no hallucination | Formatting may be messy |
|
|
| **Hybrid + LLM** | Code with preserved structure | Accurate + formatted | Slower, needs Ollama |
|
|
| **Vision** | Understanding layout/context | Semantic understanding | Hallucinates text |
|
|
| **Pure OCR** | Simple text, no structure needed | Fast, simple | Full-frame, no region detection |
|
|
|
|
## Files Modified
|
|
- `meetus/hybrid_processor.py` - New hybrid processor
|
|
- `meetus/ocr_processor.py` - Layout preservation
|
|
- `meetus/workflow.py` - Hybrid mode integration
|
|
- `process_meeting.py` - CLI flags and examples
|