add vision processor

This commit is contained in:
Mariano Gabriel
2025-10-19 22:58:28 -03:00
parent ae89564373
commit a999bc9093
4 changed files with 511 additions and 107 deletions

241
README.md
View File

@@ -1,12 +1,17 @@
# Meeting Processor
Extract screen content from meeting recordings and merge with Whisper transcripts for better Claude summarization.
Extract screen content from meeting recordings and merge with Whisper transcripts for better AI summarization.
## Overview
This tool enhances meeting transcripts by combining:
- **Audio transcription** (from Whisper)
- **Screen content** (OCR from screen shares)
- **Screen content analysis** (Vision models or OCR)
### Vision Analysis vs OCR
- **Vision Models** (recommended): Uses local LLaVA model via Ollama to understand context - great for dashboards, code, consoles
- **OCR**: Traditional text extraction - faster but less context-aware
The result is a rich, timestamped transcript that provides full context for AI summarization.
@@ -14,16 +19,13 @@ The result is a rich, timestamped transcript that provides full context for AI s
### 1. System Dependencies
**Tesseract OCR** (recommended):
**Ollama** (required for vision analysis):
```bash
# Ubuntu/Debian
sudo apt-get install tesseract-ocr
# macOS
brew install tesseract
# Arch Linux
sudo pacman -S tesseract
# Install from https://ollama.ai/download
# Then pull a vision model:
ollama pull llava:13b
# or for lighter model:
ollama pull llava:7b
```
**FFmpeg** (for scene detection):
@@ -35,6 +37,18 @@ sudo apt-get install ffmpeg
brew install ffmpeg
```
**Tesseract OCR** (optional, if not using vision):
```bash
# Ubuntu/Debian
sudo apt-get install tesseract-ocr
# macOS
brew install tesseract
# Arch Linux
sudo pacman -S tesseract
```
### 2. Python Dependencies
```bash
@@ -49,6 +63,7 @@ pip install openai-whisper
### 4. Optional: Install Alternative OCR Engines
If you prefer OCR over vision analysis:
```bash
# EasyOCR (better for rotated/handwritten text)
pip install easyocr
@@ -59,118 +74,173 @@ pip install paddleocr
## Quick Start
### Recommended: Run Everything in One Command
### Recommended: Vision Analysis (Best for Code/Dashboards)
```bash
python process_meeting.py samples/meeting.mkv --run-whisper
python process_meeting.py samples/meeting.mkv --run-whisper --use-vision
```
This will:
1. Run Whisper transcription (audio → text)
2. Extract frames every 5 seconds
3. Run OCR to extract screen text
3. Use LLaVA vision model to analyze frames with context
4. Merge audio + screen content
5. Save everything to `output/` folder
### Alternative: Use Existing Whisper Transcript
### Re-run with Cached Results
If you already have a Whisper transcript:
Already ran it once? Re-run instantly using cached results:
```bash
python process_meeting.py samples/meeting.mkv --transcript output/meeting.json
# Uses cached transcript, frames, and analysis
python process_meeting.py samples/meeting.mkv --use-vision
# Force reprocessing
python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --no-cache
```
### Screen Content Only (No Audio)
### Traditional OCR (Faster, Less Context-Aware)
```bash
python process_meeting.py samples/meeting.mkv
python process_meeting.py samples/meeting.mkv --run-whisper
```
## Usage Examples
### Run with different Whisper models
### Vision Analysis with Context Hints
```bash
# Tiny model (fastest, less accurate)
python process_meeting.py samples/meeting.mkv --run-whisper --whisper-model tiny
# For code-heavy meetings
python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --vision-context code
# Small model (balanced)
python process_meeting.py samples/meeting.mkv --run-whisper --whisper-model small
# For dashboard/monitoring meetings (Grafana, GCP, etc.)
python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --vision-context dashboard
# Large model (slowest, most accurate)
python process_meeting.py samples/meeting.mkv --run-whisper --whisper-model large
# For console/terminal sessions
python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --vision-context console
```
### Different Vision Models
```bash
# Lighter/faster model (7B parameters)
python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --vision-model llava:7b
# Default model (13B parameters, better quality)
python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --vision-model llava:13b
# Alternative models
python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --vision-model bakllava
```
### Extract frames at different intervals
```bash
# Every 10 seconds (with Whisper)
python process_meeting.py samples/meeting.mkv --run-whisper --interval 10
# Every 10 seconds
python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --interval 10
# Every 3 seconds (more detailed)
python process_meeting.py samples/meeting.mkv --run-whisper --interval 3
python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --interval 3
```
### Use scene detection (smarter, fewer frames)
```bash
python process_meeting.py samples/meeting.mkv --run-whisper --scene-detection
python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --scene-detection
```
### Use different OCR engines
### Traditional OCR (if you prefer)
```bash
# EasyOCR (good for varied layouts)
# Tesseract (default)
python process_meeting.py samples/meeting.mkv --run-whisper
# EasyOCR
python process_meeting.py samples/meeting.mkv --run-whisper --ocr-engine easyocr
# PaddleOCR (good for code/terminal)
# PaddleOCR
python process_meeting.py samples/meeting.mkv --run-whisper --ocr-engine paddleocr
```
### Extract frames only (no merging)
### Caching Examples
```bash
python process_meeting.py samples/meeting.mkv --extract-only
# First run - processes everything
python process_meeting.py samples/meeting.mkv --run-whisper --use-vision
# Second run - uses cached transcript and frames, only re-merges
python process_meeting.py samples/meeting.mkv
# Switch from OCR to vision using existing frames
python process_meeting.py samples/meeting.mkv --use-vision
# Force complete reprocessing
python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --no-cache
```
### Custom output location
```bash
python process_meeting.py samples/meeting.mkv --run-whisper --output-dir my_outputs/
python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --output-dir my_outputs/
```
### Enable verbose logging
```bash
# Show detailed debug information
python process_meeting.py samples/meeting.mkv --run-whisper --verbose
python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --verbose
```
## Output Files
All output files are saved to the `output/` directory by default:
- **`output/<video>_enhanced.txt`** - Enhanced transcript ready for Claude
- **`output/<video>_enhanced.txt`** - Enhanced transcript ready for AI summarization
- **`output/<video>.json`** - Whisper transcript (if `--run-whisper` was used)
- **`output/<video>_ocr.json`** - Raw OCR data with timestamps
- **`output/<video>_vision.json`** - Vision analysis results with timestamps (if `--use-vision`)
- **`output/<video>_ocr.json`** - OCR results with timestamps (if using OCR)
- **`frames/`** - Extracted video frames (JPG files)
### Caching Behavior
The tool automatically caches intermediate results to speed up re-runs:
- **Whisper transcript**: Cached as `output/<video>.json`
- **Extracted frames**: Cached in `frames/<video>_*.jpg`
- **Analysis results**: Cached as `output/<video>_vision.json` or `output/<video>_ocr.json`
Re-running with the same video will use cached results unless `--no-cache` is specified.
## Workflow for Meeting Analysis
### Complete Workflow (One Command!)
```bash
# Process everything in one step
python process_meeting.py samples/alo-intro1.mkv --run-whisper --scene-detection
# Process everything in one step with vision analysis
python process_meeting.py samples/alo-intro1.mkv --run-whisper --use-vision --scene-detection
# Output will be in output/alo-intro1_enhanced.txt
```
### Typical Iterative Workflow
```bash
# First run - full processing
python process_meeting.py samples/meeting.mkv --run-whisper --use-vision
# Review results, then re-run with different context if needed
python process_meeting.py samples/meeting.mkv --use-vision --vision-context code
# Or switch to a different vision model
python process_meeting.py samples/meeting.mkv --use-vision --vision-model llava:7b
# All use cached frames and transcript!
```
### Traditional Workflow (Separate Steps)
```bash
# 1. Extract audio and transcribe with Whisper (optional, if not using --run-whisper)
whisper samples/alo-intro1.mkv --model base --output_format json --output_dir output
# 2. Process video to extract screen content
# 2. Process video to extract screen content with vision
python process_meeting.py samples/alo-intro1.mkv \
--transcript output/alo-intro1.json \
--use-vision \
--scene-detection
# 3. Use the enhanced transcript with Claude
# Copy the content from output/alo-intro1_enhanced.txt and paste into Claude
# 3. Use the enhanced transcript with AI
# Copy the content from output/alo-intro1_enhanced.txt and paste into Claude or your LLM
```
### Example Prompt for Claude
@@ -217,42 +287,99 @@ Options:
## Tips for Best Results
### Vision vs OCR: When to Use Each
**Use Vision Models (`--use-vision`) when:**
- ✅ Analyzing dashboards (Grafana, GCP Console, monitoring tools)
- ✅ Code walkthroughs or debugging sessions
- ✅ Complex layouts with mixed content
- ✅ Need contextual understanding, not just text extraction
- ✅ Working with charts, graphs, or visualizations
- ⚠️ Trade-off: Slower (requires GPU/CPU for local model)
**Use OCR when:**
- ✅ Simple text extraction from slides or documents
- ✅ Need maximum speed
- ✅ Limited computational resources
- ✅ Presentations with mostly text
- ⚠️ Trade-off: Less context-aware, may miss visual relationships
### Context Hints for Vision Analysis
- **`--vision-context meeting`**: General purpose (default)
- **`--vision-context code`**: Optimized for code screenshots, preserves formatting
- **`--vision-context dashboard`**: Extracts metrics, trends, panel names
- **`--vision-context console`**: Captures commands, output, error messages
### Scene Detection vs Interval
- **Scene detection**: Better for presentations with distinct slides. More efficient.
- **Interval extraction**: Better for continuous screen sharing (coding, browsing). More thorough.
### OCR Engine Selection
- **Tesseract**: Best for clean slides, documents, presentations. Fast and lightweight.
- **EasyOCR**: Better for handwriting, rotated text, or varied fonts.
- **PaddleOCR**: Excellent for code, terminal outputs, and mixed languages.
### Vision Model Selection
- **`llava:7b`**: Faster, lower memory (~4GB RAM), good quality
- **`llava:13b`**: Better quality, slower, needs ~8GB RAM (default)
- **`bakllava`**: Alternative with different strengths
### Deduplication
- Enabled by default - removes similar consecutive frames
- Disable with `--no-deduplicate` if slides change subtly
- Disable with `--no-deduplicate` if slides/screens change subtly
## Troubleshooting
### "pytesseract not installed"
### Vision Model Issues
**"ollama package not installed"**
```bash
pip install ollama
```
**"Ollama not found" or connection errors**
```bash
# Install Ollama first: https://ollama.ai/download
# Then pull a vision model:
ollama pull llava:13b
```
**Vision analysis is slow**
- Use lighter model: `--vision-model llava:7b`
- Reduce frame count: `--scene-detection` or `--interval 10`
- Check if Ollama is using GPU (much faster)
**Poor vision analysis results**
- Try different context hint: `--vision-context code` or `--vision-context dashboard`
- Use larger model: `--vision-model llava:13b`
- Ensure frames are clear (check video resolution)
### OCR Issues
**"pytesseract not installed"**
```bash
pip install pytesseract
sudo apt-get install tesseract-ocr # Don't forget system package!
```
### "No frames extracted"
**Poor OCR quality**
- **Solution**: Switch to vision analysis with `--use-vision`
- Or try different OCR engine: `--ocr-engine easyocr`
- Check if video resolution is sufficient
- Use `--no-deduplicate` to keep more frames
### General Issues
**"No frames extracted"**
- Check video file is valid: `ffmpeg -i video.mkv`
- Try lower interval: `--interval 3`
- Check disk space in frames directory
### Poor OCR quality
- Try different OCR engine
- Check if video resolution is sufficient
- Use `--no-deduplicate` to keep more frames
### Scene detection not working
**Scene detection not working**
- Fallback to interval extraction automatically
- Ensure FFmpeg is installed
- Try manual interval: `--interval 5`
**Cache not being used**
- Ensure you're using the same video filename
- Check that output directory contains cached files
- Use `--verbose` to see what's being cached/loaded
## Project Structure
```