add vision processor

2025-10-19 22:58:28 -03:00
parent ae89564373
commit a999bc9093
4 changed files with 511 additions and 107 deletions
--- a/README.md
+++ b/README.md
@@ -1,12 +1,17 @@
 # Meeting Processor

-Extract screen content from meeting recordings and merge with Whisper transcripts for better Claude summarization.
+Extract screen content from meeting recordings and merge with Whisper transcripts for better AI summarization.

 ## Overview

 This tool enhances meeting transcripts by combining:
 - **Audio transcription** (from Whisper)
- **Screen content** (OCR from screen shares)
+- **Screen content analysis** (Vision models or OCR)
+
+### Vision Analysis vs OCR
+
+- **Vision Models** (recommended): Uses local LLaVA model via Ollama to understand context - great for dashboards, code, consoles
+- **OCR**: Traditional text extraction - faster but less context-aware

 The result is a rich, timestamped transcript that provides full context for AI summarization.

@@ -14,16 +19,13 @@ The result is a rich, timestamped transcript that provides full context for AI s

 ### 1. System Dependencies

-**Tesseract OCR** (recommended):
+**Ollama** (required for vision analysis):
 ```bash
-# Ubuntu/Debian
-sudo apt-get install tesseract-ocr
-
-# macOS
-brew install tesseract
-
-# Arch Linux
-sudo pacman -S tesseract
+# Install from https://ollama.ai/download
+# Then pull a vision model:
+ollama pull llava:13b
+# or for lighter model:
+ollama pull llava:7b
 ```

 **FFmpeg** (for scene detection):
@@ -35,6 +37,18 @@ sudo apt-get install ffmpeg
 brew install ffmpeg
 ```

+**Tesseract OCR** (optional, if not using vision):
+```bash
+# Ubuntu/Debian
+sudo apt-get install tesseract-ocr
+
+# macOS
+brew install tesseract
+
+# Arch Linux
+sudo pacman -S tesseract
+```
+
 ### 2. Python Dependencies

 ```bash
@@ -49,6 +63,7 @@ pip install openai-whisper

 ### 4. Optional: Install Alternative OCR Engines

+If you prefer OCR over vision analysis:
 ```bash
 # EasyOCR (better for rotated/handwritten text)
 pip install easyocr
@@ -59,118 +74,173 @@ pip install paddleocr

 ## Quick Start

-### Recommended: Run Everything in One Command
+### Recommended: Vision Analysis (Best for Code/Dashboards)

 ```bash
-python process_meeting.py samples/meeting.mkv --run-whisper
+python process_meeting.py samples/meeting.mkv --run-whisper --use-vision
 ```

 This will:
 1. Run Whisper transcription (audio → text)
 2. Extract frames every 5 seconds
-3. Run OCR to extract screen text
+3. Use LLaVA vision model to analyze frames with context
 4. Merge audio + screen content
 5. Save everything to `output/` folder

-### Alternative: Use Existing Whisper Transcript
+### Re-run with Cached Results

-If you already have a Whisper transcript:
+Already ran it once? Re-run instantly using cached results:
 ```bash
-python process_meeting.py samples/meeting.mkv --transcript output/meeting.json
+# Uses cached transcript, frames, and analysis
+python process_meeting.py samples/meeting.mkv --use-vision
+
+# Force reprocessing
+python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --no-cache
 ```

-### Screen Content Only (No Audio)
+### Traditional OCR (Faster, Less Context-Aware)

 ```bash
-python process_meeting.py samples/meeting.mkv
+python process_meeting.py samples/meeting.mkv --run-whisper
 ```

 ## Usage Examples

-### Run with different Whisper models
+### Vision Analysis with Context Hints
 ```bash
-# Tiny model (fastest, less accurate)
-python process_meeting.py samples/meeting.mkv --run-whisper --whisper-model tiny
+# For code-heavy meetings
+python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --vision-context code

-# Small model (balanced)
-python process_meeting.py samples/meeting.mkv --run-whisper --whisper-model small
+# For dashboard/monitoring meetings (Grafana, GCP, etc.)
+python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --vision-context dashboard

-# Large model (slowest, most accurate)
-python process_meeting.py samples/meeting.mkv --run-whisper --whisper-model large
+# For console/terminal sessions
+python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --vision-context console
+```
+
+### Different Vision Models
+```bash
+# Lighter/faster model (7B parameters)
+python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --vision-model llava:7b
+
+# Default model (13B parameters, better quality)
+python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --vision-model llava:13b
+
+# Alternative models
+python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --vision-model bakllava
 ```

 ### Extract frames at different intervals
 ```bash
-# Every 10 seconds (with Whisper)
-python process_meeting.py samples/meeting.mkv --run-whisper --interval 10
+# Every 10 seconds
+python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --interval 10

 # Every 3 seconds (more detailed)
-python process_meeting.py samples/meeting.mkv --run-whisper --interval 3
+python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --interval 3
 ```

 ### Use scene detection (smarter, fewer frames)
 ```bash
-python process_meeting.py samples/meeting.mkv --run-whisper --scene-detection
+python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --scene-detection
 ```

-### Use different OCR engines
+### Traditional OCR (if you prefer)
 ```bash
-# EasyOCR (good for varied layouts)
+# Tesseract (default)
+python process_meeting.py samples/meeting.mkv --run-whisper
+
+# EasyOCR
 python process_meeting.py samples/meeting.mkv --run-whisper --ocr-engine easyocr

-# PaddleOCR (good for code/terminal)
+# PaddleOCR
 python process_meeting.py samples/meeting.mkv --run-whisper --ocr-engine paddleocr
 ```

-### Extract frames only (no merging)
+### Caching Examples
 ```bash
-python process_meeting.py samples/meeting.mkv --extract-only
+# First run - processes everything
+python process_meeting.py samples/meeting.mkv --run-whisper --use-vision
+
+# Second run - uses cached transcript and frames, only re-merges
+python process_meeting.py samples/meeting.mkv
+
+# Switch from OCR to vision using existing frames
+python process_meeting.py samples/meeting.mkv --use-vision
+
+# Force complete reprocessing
+python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --no-cache
 ```

 ### Custom output location
 ```bash
-python process_meeting.py samples/meeting.mkv --run-whisper --output-dir my_outputs/
+python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --output-dir my_outputs/
 ```

 ### Enable verbose logging
 ```bash
 # Show detailed debug information
-python process_meeting.py samples/meeting.mkv --run-whisper --verbose
+python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --verbose
 ```

 ## Output Files

 All output files are saved to the `output/` directory by default:

- **`output/<video>_enhanced.txt`** - Enhanced transcript ready for Claude
+- **`output/<video>_enhanced.txt`** - Enhanced transcript ready for AI summarization
 - **`output/<video>.json`** - Whisper transcript (if `--run-whisper` was used)
- **`output/<video>_ocr.json`** - Raw OCR data with timestamps
+- **`output/<video>_vision.json`** - Vision analysis results with timestamps (if `--use-vision`)
+- **`output/<video>_ocr.json`** - OCR results with timestamps (if using OCR)
 - **`frames/`** - Extracted video frames (JPG files)

+### Caching Behavior
+
+The tool automatically caches intermediate results to speed up re-runs:
+- **Whisper transcript**: Cached as `output/<video>.json`
+- **Extracted frames**: Cached in `frames/<video>_*.jpg`
+- **Analysis results**: Cached as `output/<video>_vision.json` or `output/<video>_ocr.json`
+
+Re-running with the same video will use cached results unless `--no-cache` is specified.
+
 ## Workflow for Meeting Analysis

 ### Complete Workflow (One Command!)

 ```bash
-# Process everything in one step
-python process_meeting.py samples/alo-intro1.mkv --run-whisper --scene-detection
+# Process everything in one step with vision analysis
+python process_meeting.py samples/alo-intro1.mkv --run-whisper --use-vision --scene-detection

 # Output will be in output/alo-intro1_enhanced.txt
 ```

+### Typical Iterative Workflow
+
+```bash
+# First run - full processing
+python process_meeting.py samples/meeting.mkv --run-whisper --use-vision
+
+# Review results, then re-run with different context if needed
+python process_meeting.py samples/meeting.mkv --use-vision --vision-context code
+
+# Or switch to a different vision model
+python process_meeting.py samples/meeting.mkv --use-vision --vision-model llava:7b
+
+# All use cached frames and transcript!
+```
+
 ### Traditional Workflow (Separate Steps)

 ```bash
 # 1. Extract audio and transcribe with Whisper (optional, if not using --run-whisper)
 whisper samples/alo-intro1.mkv --model base --output_format json --output_dir output

-# 2. Process video to extract screen content
+# 2. Process video to extract screen content with vision
 python process_meeting.py samples/alo-intro1.mkv \
    --transcript output/alo-intro1.json \
+    --use-vision \
    --scene-detection

-# 3. Use the enhanced transcript with Claude
-# Copy the content from output/alo-intro1_enhanced.txt and paste into Claude
+# 3. Use the enhanced transcript with AI
+# Copy the content from output/alo-intro1_enhanced.txt and paste into Claude or your LLM
 ```

 ### Example Prompt for Claude
@@ -217,42 +287,99 @@ Options:

 ## Tips for Best Results

+### Vision vs OCR: When to Use Each
+
+**Use Vision Models (`--use-vision`) when:**
+- ✅ Analyzing dashboards (Grafana, GCP Console, monitoring tools)
+- ✅ Code walkthroughs or debugging sessions
+- ✅ Complex layouts with mixed content
+- ✅ Need contextual understanding, not just text extraction
+- ✅ Working with charts, graphs, or visualizations
+- ⚠️ Trade-off: Slower (requires GPU/CPU for local model)
+
+**Use OCR when:**
+- ✅ Simple text extraction from slides or documents
+- ✅ Need maximum speed
+- ✅ Limited computational resources
+- ✅ Presentations with mostly text
+- ⚠️ Trade-off: Less context-aware, may miss visual relationships
+
+### Context Hints for Vision Analysis
+- **`--vision-context meeting`**: General purpose (default)
+- **`--vision-context code`**: Optimized for code screenshots, preserves formatting
+- **`--vision-context dashboard`**: Extracts metrics, trends, panel names
+- **`--vision-context console`**: Captures commands, output, error messages
+
 ### Scene Detection vs Interval
 - **Scene detection**: Better for presentations with distinct slides. More efficient.
 - **Interval extraction**: Better for continuous screen sharing (coding, browsing). More thorough.

-### OCR Engine Selection
- **Tesseract**: Best for clean slides, documents, presentations. Fast and lightweight.
- **EasyOCR**: Better for handwriting, rotated text, or varied fonts.
- **PaddleOCR**: Excellent for code, terminal outputs, and mixed languages.
+### Vision Model Selection
+- **`llava:7b`**: Faster, lower memory (~4GB RAM), good quality
+- **`llava:13b`**: Better quality, slower, needs ~8GB RAM (default)
+- **`bakllava`**: Alternative with different strengths

 ### Deduplication
 - Enabled by default - removes similar consecutive frames
- Disable with `--no-deduplicate` if slides change subtly
+- Disable with `--no-deduplicate` if slides/screens change subtly

 ## Troubleshooting

-### "pytesseract not installed"
+### Vision Model Issues
+
+**"ollama package not installed"**
+```bash
+pip install ollama
+```
+
+**"Ollama not found" or connection errors**
+```bash
+# Install Ollama first: https://ollama.ai/download
+# Then pull a vision model:
+ollama pull llava:13b
+```
+
+**Vision analysis is slow**
+- Use lighter model: `--vision-model llava:7b`
+- Reduce frame count: `--scene-detection` or `--interval 10`
+- Check if Ollama is using GPU (much faster)
+
+**Poor vision analysis results**
+- Try different context hint: `--vision-context code` or `--vision-context dashboard`
+- Use larger model: `--vision-model llava:13b`
+- Ensure frames are clear (check video resolution)
+
+### OCR Issues
+
+**"pytesseract not installed"**
 ```bash
 pip install pytesseract
 sudo apt-get install tesseract-ocr  # Don't forget system package!
 ```

-### "No frames extracted"
+**Poor OCR quality**
+- **Solution**: Switch to vision analysis with `--use-vision`
+- Or try different OCR engine: `--ocr-engine easyocr`
+- Check if video resolution is sufficient
+- Use `--no-deduplicate` to keep more frames
+
+### General Issues
+
+**"No frames extracted"**
 - Check video file is valid: `ffmpeg -i video.mkv`
 - Try lower interval: `--interval 3`
 - Check disk space in frames directory

-### Poor OCR quality
- Try different OCR engine
- Check if video resolution is sufficient
- Use `--no-deduplicate` to keep more frames
-
-### Scene detection not working
+**Scene detection not working**
 - Fallback to interval extraction automatically
 - Ensure FFmpeg is installed
 - Try manual interval: `--interval 5`

+**Cache not being used**
+- Ensure you're using the same video filename
+- Check that output directory contains cached files
+- Use `--verbose` to see what's being cached/loaded
+
 ## Project Structure

 ```