embed images
This commit is contained in:
111
def/02-hybrid-opencv-ocr-llm.md
Normal file
111
def/02-hybrid-opencv-ocr-llm.md
Normal file
@@ -0,0 +1,111 @@
|
||||
# 02 - Hybrid OpenCV + OCR + LLM Approach
|
||||
|
||||
## Date
|
||||
2025-10-28
|
||||
|
||||
## Context
|
||||
Vision models (llava) were hallucinating text content badly - showing HTML code when there was none, inventing text that didn't exist. Pure OCR was fast and accurate but lost code formatting and structure.
|
||||
|
||||
## Problem
|
||||
- **Vision models**: Hallucinate text content, can't be trusted for accurate extraction
|
||||
- **Pure OCR**: Accurate text but messy output, lost indentation/formatting
|
||||
- **Need**: Accurate text extraction + preserved code structure
|
||||
|
||||
## Solution: Three-Stage Hybrid Approach
|
||||
|
||||
### Stage 1: OpenCV Text Detection
|
||||
Use morphological operations to find text regions:
|
||||
- Adaptive thresholding (handles varying lighting)
|
||||
- Dilation with horizontal kernel to connect text lines
|
||||
- Contour detection to find bounding boxes
|
||||
- Filter by area and aspect ratio
|
||||
- Merge overlapping regions
|
||||
|
||||
### Stage 2: Region-Based OCR
|
||||
- Sort regions by reading order (top-to-bottom, left-to-right)
|
||||
- Crop each region from original image
|
||||
- Run OCR on cropped regions (more accurate than full frame)
|
||||
- Tesseract with PSM 6 mode to preserve layout
|
||||
- Preserve indentation in cleaning step
|
||||
|
||||
### Stage 3: Optional LLM Cleanup
|
||||
- Take accurate OCR output (no hallucination)
|
||||
- Use lightweight LLM (llama3.2:3b for speed) to:
|
||||
- Fix obvious OCR errors (l→1, O→0)
|
||||
- Restore code indentation and structure
|
||||
- Preserve exact text content
|
||||
- No added explanations or hallucinated content
|
||||
|
||||
## Benefits
|
||||
✓ **Accurate**: OCR reads actual pixels, no hallucination
|
||||
✓ **Fast**: OpenCV detection is instant, focused OCR is quick
|
||||
✓ **Structured**: Regions separated with headers showing position
|
||||
✓ **Formatted**: Optional LLM cleanup preserves/restores code structure
|
||||
✓ **Deterministic**: Same input = same output (unlike vision models)
|
||||
|
||||
## Implementation
|
||||
|
||||
**New file:** `meetus/hybrid_processor.py`
|
||||
- `HybridProcessor` class with OpenCV detection + OCR + optional LLM
|
||||
- Region sorting for proper reading order
|
||||
- Visual separators between regions
|
||||
|
||||
**CLI flags:**
|
||||
```bash
|
||||
--use-hybrid # Enable hybrid mode
|
||||
--hybrid-llm-cleanup # Add LLM post-processing (optional)
|
||||
--hybrid-llm-model MODEL # LLM model (default: llama3.2:3b)
|
||||
```
|
||||
|
||||
**OCR improvements:**
|
||||
- Tesseract PSM 6 mode for better layout preservation
|
||||
- Modified text cleaning to keep indentation
|
||||
- `preserve_layout` parameter
|
||||
|
||||
## Usage
|
||||
|
||||
```bash
|
||||
# Basic hybrid (OpenCV + OCR)
|
||||
python process_meeting.py samples/video.mkv --use-hybrid --scene-detection
|
||||
|
||||
# With LLM cleanup for best code formatting
|
||||
python process_meeting.py samples/video.mkv --use-hybrid --hybrid-llm-cleanup --scene-detection -v
|
||||
|
||||
# Iterate on threshold
|
||||
python process_meeting.py samples/video.mkv --use-hybrid --scene-detection --scene-threshold 5 --skip-cache-frames --skip-cache-analysis
|
||||
```
|
||||
|
||||
## Output Format
|
||||
|
||||
```
|
||||
[Region 1 at y=120]
|
||||
function calculateTotal(items) {
|
||||
return items.reduce((sum, item) => sum + item.price, 0);
|
||||
}
|
||||
|
||||
============================================================
|
||||
|
||||
[Region 2 at y=450]
|
||||
const result = calculateTotal(cartItems);
|
||||
console.log('Total:', result);
|
||||
```
|
||||
|
||||
## Performance
|
||||
- **Without LLM cleanup**: Very fast (~2-3s per frame)
|
||||
- **With LLM cleanup**: Slower but still faster than vision models (~5-8s per frame)
|
||||
- **Accuracy**: Much better than vision model hallucinations
|
||||
|
||||
## When to Use What
|
||||
|
||||
| Method | Best For | Pros | Cons |
|
||||
|--------|----------|------|------|
|
||||
| **Hybrid** | Code/terminal text extraction | Accurate, fast, no hallucination | Formatting may be messy |
|
||||
| **Hybrid + LLM** | Code with preserved structure | Accurate + formatted | Slower, needs Ollama |
|
||||
| **Vision** | Understanding layout/context | Semantic understanding | Hallucinates text |
|
||||
| **Pure OCR** | Simple text, no structure needed | Fast, simple | Full-frame, no region detection |
|
||||
|
||||
## Files Modified
|
||||
- `meetus/hybrid_processor.py` - New hybrid processor
|
||||
- `meetus/ocr_processor.py` - Layout preservation
|
||||
- `meetus/workflow.py` - Hybrid mode integration
|
||||
- `process_meeting.py` - CLI flags and examples
|
||||
100
def/03-embed-images-for-llm.md
Normal file
100
def/03-embed-images-for-llm.md
Normal file
@@ -0,0 +1,100 @@
|
||||
# 03 - Embed Images for LLM Analysis
|
||||
|
||||
## Date
|
||||
2025-10-28
|
||||
|
||||
## Context
|
||||
Hybrid OCR approach was fast and accurate but formatting was messy. Vision models hallucinated text. Rather than fighting with text extraction, a better approach is to embed the actual frame images in the enhanced transcript and let the end-user's LLM analyze them with full audio context.
|
||||
|
||||
## Problem
|
||||
- OCR/vision models either hallucinate or produce messy text
|
||||
- Code formatting/indentation is hard to preserve
|
||||
- User wants to analyze frames with their own LLM (Claude, GPT, etc.)
|
||||
- Need to keep file size reasonable (~200KB per image is too big)
|
||||
|
||||
## Solution: Image Embedding
|
||||
|
||||
Instead of extracting text, embed the actual frame images as base64 in the enhanced transcript. The LLM can then:
|
||||
- See the actual screen content (no hallucination)
|
||||
- Understand code structure, layout, and formatting visually
|
||||
- Have full audio transcript context for each frame
|
||||
- Analyze dashboards, terminals, editors with perfect accuracy
|
||||
|
||||
## Implementation
|
||||
|
||||
**Quality Optimization:**
|
||||
- Default JPEG quality: 80 (good tradeoff between size and readability)
|
||||
- Configurable via `--embed-quality` (0-100)
|
||||
- Typical sizes at quality 80: ~40-80KB per image (vs 200KB original)
|
||||
|
||||
**Format:**
|
||||
```
|
||||
[MM:SS] SPEAKER:
|
||||
Audio transcript text here
|
||||
|
||||
[MM:SS] SCREEN CONTENT:
|
||||
IMAGE (base64, 52KB):
|
||||
<image>data:image/jpeg;base64,/9j/4AAQSkZJRg...</image>
|
||||
|
||||
TEXT:
|
||||
| Optional OCR text for reference
|
||||
```
|
||||
|
||||
**Features:**
|
||||
- Base64 encoding for easy embedding
|
||||
- Size tracking and reporting
|
||||
- Optional text content alongside images
|
||||
- Works with scene detection for smart frame selection
|
||||
|
||||
## Usage
|
||||
|
||||
```bash
|
||||
# Basic: Embed images at quality 80 (default)
|
||||
python process_meeting.py samples/video.mkv --run-whisper --embed-images --scene-detection --no-cache -v
|
||||
|
||||
# Lower quality for smaller files (still readable)
|
||||
python process_meeting.py samples/video.mkv --run-whisper --embed-images --embed-quality 60 --scene-detection --no-cache -v
|
||||
|
||||
# Higher quality for detailed code
|
||||
python process_meeting.py samples/video.mkv --run-whisper --embed-images --embed-quality 90 --scene-detection --no-cache -v
|
||||
|
||||
# Iterate on scene threshold (reuse whisper)
|
||||
python process_meeting.py samples/video.mkv --embed-images --scene-detection --scene-threshold 5 --skip-cache-frames --skip-cache-analysis -v
|
||||
```
|
||||
|
||||
## File Sizes
|
||||
|
||||
**Example for 20 frames:**
|
||||
- Quality 60: ~30-50KB per image = 0.6-1MB total
|
||||
- Quality 80: ~40-80KB per image = 0.8-1.6MB total (recommended)
|
||||
- Quality 90: ~80-120KB per image = 1.6-2.4MB total
|
||||
- Original: ~200KB per image = 4MB total
|
||||
|
||||
## Benefits
|
||||
|
||||
✓ **No hallucination**: LLM sees actual pixels
|
||||
✓ **Perfect formatting**: Code structure preserved visually
|
||||
✓ **Full context**: Audio transcript + visual frame together
|
||||
✓ **User's choice**: Use your preferred LLM (Claude, GPT, etc.)
|
||||
✓ **Reasonable size**: Quality 80 gives 4x smaller files vs original
|
||||
✓ **Simple workflow**: One file contains everything
|
||||
|
||||
## Use Cases
|
||||
|
||||
**Code walkthroughs:** LLM can see actual code structure and indentation
|
||||
**Dashboard analysis:** Charts, graphs, metrics visible to LLM
|
||||
**Terminal sessions:** Commands and output in proper context
|
||||
**UI reviews:** Actual interface visible with audio commentary
|
||||
|
||||
## Files Modified
|
||||
|
||||
- `meetus/transcript_merger.py` - Image encoding and embedding
|
||||
- `meetus/workflow.py` - Wire through config
|
||||
- `process_meeting.py` - CLI flags
|
||||
- `meetus/output_manager.py` - Cleaner directory naming (date + increment)
|
||||
|
||||
## Output Directory Naming
|
||||
|
||||
Also changed output directory format for clarity:
|
||||
- Old: `20251028_054553-video` (confusing timestamps)
|
||||
- New: `20251028-001-video` (clear date + run number)
|
||||
78
def/04-fix-whisper-cache-loading.md
Normal file
78
def/04-fix-whisper-cache-loading.md
Normal file
@@ -0,0 +1,78 @@
|
||||
# 04 - Fix Whisper Cache Loading
|
||||
|
||||
## Date
|
||||
2025-10-28
|
||||
|
||||
## Problem
|
||||
Enhanced transcript was not including the audio segments from cached whisper transcripts when running without the `--run-whisper` flag.
|
||||
|
||||
Example command that failed:
|
||||
```bash
|
||||
python process_meeting.py samples/zaca-run-scrapers.mkv --embed-images --scene-detection --scene-threshold 10 --skip-cache-frames -v
|
||||
```
|
||||
|
||||
Result: Enhanced transcript only contained embedded images, no audio segments (0 "SPEAKER" entries).
|
||||
|
||||
## Root Cause
|
||||
In `workflow.py`, the `_run_whisper()` method was checking the `run_whisper` flag **before** checking the cache:
|
||||
|
||||
```python
|
||||
def _run_whisper(self) -> Optional[str]:
|
||||
if not self.config.run_whisper:
|
||||
return self.config.transcript_path # Returns None if --transcript not specified
|
||||
|
||||
# Cache check NEVER REACHED if run_whisper is False
|
||||
cached = self.cache_mgr.get_whisper_cache()
|
||||
if cached:
|
||||
return str(cached)
|
||||
```
|
||||
|
||||
This meant:
|
||||
- User runs command without `--run-whisper`
|
||||
- Method returns None immediately
|
||||
- Cached whisper transcript is never discovered
|
||||
- No audio segments in enhanced output
|
||||
|
||||
## Solution
|
||||
Reorder the logic to check cache **first**, regardless of flags:
|
||||
|
||||
```python
|
||||
def _run_whisper(self) -> Optional[str]:
|
||||
"""Run Whisper transcription if requested, or use cached/provided transcript."""
|
||||
# First, check cache (regardless of run_whisper flag)
|
||||
cached = self.cache_mgr.get_whisper_cache()
|
||||
if cached:
|
||||
return str(cached)
|
||||
|
||||
# If no cache and not running whisper, use provided transcript path (if any)
|
||||
if not self.config.run_whisper:
|
||||
return self.config.transcript_path
|
||||
|
||||
# If no cache and run_whisper is True, run whisper transcription
|
||||
# ... rest of whisper code
|
||||
```
|
||||
|
||||
## New Behavior
|
||||
1. Cache is checked first (regardless of `--run-whisper` flag)
|
||||
2. If cached whisper exists, use it
|
||||
3. If no cache and `--run-whisper` not specified, use `--transcript` path (or None)
|
||||
4. If no cache and `--run-whisper` specified, run whisper
|
||||
|
||||
## Benefits
|
||||
✓ Cached whisper transcripts are always discovered and used
|
||||
✓ User can iterate on frame extraction/analysis without re-running whisper
|
||||
✓ Enhanced transcripts now properly include both audio + visual content
|
||||
✓ Granular cache flags (`--skip-cache-frames`, `--skip-cache-whisper`) work as expected
|
||||
|
||||
## Use Case
|
||||
```bash
|
||||
# First run: Generate whisper transcript + extract frames
|
||||
python process_meeting.py samples/video.mkv --run-whisper --embed-images --scene-detection -v
|
||||
|
||||
# Second run: Iterate on scene threshold without re-running whisper
|
||||
python process_meeting.py samples/video.mkv --embed-images --scene-detection --scene-threshold 10 --skip-cache-frames -v
|
||||
# Now correctly includes cached whisper transcript in enhanced output!
|
||||
```
|
||||
|
||||
## Files Modified
|
||||
- `meetus/workflow.py` - Reordered logic in `_run_whisper()` method (lines 172-181)
|
||||
Reference in New Issue
Block a user