add vision processor
This commit is contained in:
241
README.md
241
README.md
@@ -1,12 +1,17 @@
|
||||
# Meeting Processor
|
||||
|
||||
Extract screen content from meeting recordings and merge with Whisper transcripts for better Claude summarization.
|
||||
Extract screen content from meeting recordings and merge with Whisper transcripts for better AI summarization.
|
||||
|
||||
## Overview
|
||||
|
||||
This tool enhances meeting transcripts by combining:
|
||||
- **Audio transcription** (from Whisper)
|
||||
- **Screen content** (OCR from screen shares)
|
||||
- **Screen content analysis** (Vision models or OCR)
|
||||
|
||||
### Vision Analysis vs OCR
|
||||
|
||||
- **Vision Models** (recommended): Uses local LLaVA model via Ollama to understand context - great for dashboards, code, consoles
|
||||
- **OCR**: Traditional text extraction - faster but less context-aware
|
||||
|
||||
The result is a rich, timestamped transcript that provides full context for AI summarization.
|
||||
|
||||
@@ -14,16 +19,13 @@ The result is a rich, timestamped transcript that provides full context for AI s
|
||||
|
||||
### 1. System Dependencies
|
||||
|
||||
**Tesseract OCR** (recommended):
|
||||
**Ollama** (required for vision analysis):
|
||||
```bash
|
||||
# Ubuntu/Debian
|
||||
sudo apt-get install tesseract-ocr
|
||||
|
||||
# macOS
|
||||
brew install tesseract
|
||||
|
||||
# Arch Linux
|
||||
sudo pacman -S tesseract
|
||||
# Install from https://ollama.ai/download
|
||||
# Then pull a vision model:
|
||||
ollama pull llava:13b
|
||||
# or for lighter model:
|
||||
ollama pull llava:7b
|
||||
```
|
||||
|
||||
**FFmpeg** (for scene detection):
|
||||
@@ -35,6 +37,18 @@ sudo apt-get install ffmpeg
|
||||
brew install ffmpeg
|
||||
```
|
||||
|
||||
**Tesseract OCR** (optional, if not using vision):
|
||||
```bash
|
||||
# Ubuntu/Debian
|
||||
sudo apt-get install tesseract-ocr
|
||||
|
||||
# macOS
|
||||
brew install tesseract
|
||||
|
||||
# Arch Linux
|
||||
sudo pacman -S tesseract
|
||||
```
|
||||
|
||||
### 2. Python Dependencies
|
||||
|
||||
```bash
|
||||
@@ -49,6 +63,7 @@ pip install openai-whisper
|
||||
|
||||
### 4. Optional: Install Alternative OCR Engines
|
||||
|
||||
If you prefer OCR over vision analysis:
|
||||
```bash
|
||||
# EasyOCR (better for rotated/handwritten text)
|
||||
pip install easyocr
|
||||
@@ -59,118 +74,173 @@ pip install paddleocr
|
||||
|
||||
## Quick Start
|
||||
|
||||
### Recommended: Run Everything in One Command
|
||||
### Recommended: Vision Analysis (Best for Code/Dashboards)
|
||||
|
||||
```bash
|
||||
python process_meeting.py samples/meeting.mkv --run-whisper
|
||||
python process_meeting.py samples/meeting.mkv --run-whisper --use-vision
|
||||
```
|
||||
|
||||
This will:
|
||||
1. Run Whisper transcription (audio → text)
|
||||
2. Extract frames every 5 seconds
|
||||
3. Run OCR to extract screen text
|
||||
3. Use LLaVA vision model to analyze frames with context
|
||||
4. Merge audio + screen content
|
||||
5. Save everything to `output/` folder
|
||||
|
||||
### Alternative: Use Existing Whisper Transcript
|
||||
### Re-run with Cached Results
|
||||
|
||||
If you already have a Whisper transcript:
|
||||
Already ran it once? Re-run instantly using cached results:
|
||||
```bash
|
||||
python process_meeting.py samples/meeting.mkv --transcript output/meeting.json
|
||||
# Uses cached transcript, frames, and analysis
|
||||
python process_meeting.py samples/meeting.mkv --use-vision
|
||||
|
||||
# Force reprocessing
|
||||
python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --no-cache
|
||||
```
|
||||
|
||||
### Screen Content Only (No Audio)
|
||||
### Traditional OCR (Faster, Less Context-Aware)
|
||||
|
||||
```bash
|
||||
python process_meeting.py samples/meeting.mkv
|
||||
python process_meeting.py samples/meeting.mkv --run-whisper
|
||||
```
|
||||
|
||||
## Usage Examples
|
||||
|
||||
### Run with different Whisper models
|
||||
### Vision Analysis with Context Hints
|
||||
```bash
|
||||
# Tiny model (fastest, less accurate)
|
||||
python process_meeting.py samples/meeting.mkv --run-whisper --whisper-model tiny
|
||||
# For code-heavy meetings
|
||||
python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --vision-context code
|
||||
|
||||
# Small model (balanced)
|
||||
python process_meeting.py samples/meeting.mkv --run-whisper --whisper-model small
|
||||
# For dashboard/monitoring meetings (Grafana, GCP, etc.)
|
||||
python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --vision-context dashboard
|
||||
|
||||
# Large model (slowest, most accurate)
|
||||
python process_meeting.py samples/meeting.mkv --run-whisper --whisper-model large
|
||||
# For console/terminal sessions
|
||||
python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --vision-context console
|
||||
```
|
||||
|
||||
### Different Vision Models
|
||||
```bash
|
||||
# Lighter/faster model (7B parameters)
|
||||
python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --vision-model llava:7b
|
||||
|
||||
# Default model (13B parameters, better quality)
|
||||
python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --vision-model llava:13b
|
||||
|
||||
# Alternative models
|
||||
python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --vision-model bakllava
|
||||
```
|
||||
|
||||
### Extract frames at different intervals
|
||||
```bash
|
||||
# Every 10 seconds (with Whisper)
|
||||
python process_meeting.py samples/meeting.mkv --run-whisper --interval 10
|
||||
# Every 10 seconds
|
||||
python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --interval 10
|
||||
|
||||
# Every 3 seconds (more detailed)
|
||||
python process_meeting.py samples/meeting.mkv --run-whisper --interval 3
|
||||
python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --interval 3
|
||||
```
|
||||
|
||||
### Use scene detection (smarter, fewer frames)
|
||||
```bash
|
||||
python process_meeting.py samples/meeting.mkv --run-whisper --scene-detection
|
||||
python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --scene-detection
|
||||
```
|
||||
|
||||
### Use different OCR engines
|
||||
### Traditional OCR (if you prefer)
|
||||
```bash
|
||||
# EasyOCR (good for varied layouts)
|
||||
# Tesseract (default)
|
||||
python process_meeting.py samples/meeting.mkv --run-whisper
|
||||
|
||||
# EasyOCR
|
||||
python process_meeting.py samples/meeting.mkv --run-whisper --ocr-engine easyocr
|
||||
|
||||
# PaddleOCR (good for code/terminal)
|
||||
# PaddleOCR
|
||||
python process_meeting.py samples/meeting.mkv --run-whisper --ocr-engine paddleocr
|
||||
```
|
||||
|
||||
### Extract frames only (no merging)
|
||||
### Caching Examples
|
||||
```bash
|
||||
python process_meeting.py samples/meeting.mkv --extract-only
|
||||
# First run - processes everything
|
||||
python process_meeting.py samples/meeting.mkv --run-whisper --use-vision
|
||||
|
||||
# Second run - uses cached transcript and frames, only re-merges
|
||||
python process_meeting.py samples/meeting.mkv
|
||||
|
||||
# Switch from OCR to vision using existing frames
|
||||
python process_meeting.py samples/meeting.mkv --use-vision
|
||||
|
||||
# Force complete reprocessing
|
||||
python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --no-cache
|
||||
```
|
||||
|
||||
### Custom output location
|
||||
```bash
|
||||
python process_meeting.py samples/meeting.mkv --run-whisper --output-dir my_outputs/
|
||||
python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --output-dir my_outputs/
|
||||
```
|
||||
|
||||
### Enable verbose logging
|
||||
```bash
|
||||
# Show detailed debug information
|
||||
python process_meeting.py samples/meeting.mkv --run-whisper --verbose
|
||||
python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --verbose
|
||||
```
|
||||
|
||||
## Output Files
|
||||
|
||||
All output files are saved to the `output/` directory by default:
|
||||
|
||||
- **`output/<video>_enhanced.txt`** - Enhanced transcript ready for Claude
|
||||
- **`output/<video>_enhanced.txt`** - Enhanced transcript ready for AI summarization
|
||||
- **`output/<video>.json`** - Whisper transcript (if `--run-whisper` was used)
|
||||
- **`output/<video>_ocr.json`** - Raw OCR data with timestamps
|
||||
- **`output/<video>_vision.json`** - Vision analysis results with timestamps (if `--use-vision`)
|
||||
- **`output/<video>_ocr.json`** - OCR results with timestamps (if using OCR)
|
||||
- **`frames/`** - Extracted video frames (JPG files)
|
||||
|
||||
### Caching Behavior
|
||||
|
||||
The tool automatically caches intermediate results to speed up re-runs:
|
||||
- **Whisper transcript**: Cached as `output/<video>.json`
|
||||
- **Extracted frames**: Cached in `frames/<video>_*.jpg`
|
||||
- **Analysis results**: Cached as `output/<video>_vision.json` or `output/<video>_ocr.json`
|
||||
|
||||
Re-running with the same video will use cached results unless `--no-cache` is specified.
|
||||
|
||||
## Workflow for Meeting Analysis
|
||||
|
||||
### Complete Workflow (One Command!)
|
||||
|
||||
```bash
|
||||
# Process everything in one step
|
||||
python process_meeting.py samples/alo-intro1.mkv --run-whisper --scene-detection
|
||||
# Process everything in one step with vision analysis
|
||||
python process_meeting.py samples/alo-intro1.mkv --run-whisper --use-vision --scene-detection
|
||||
|
||||
# Output will be in output/alo-intro1_enhanced.txt
|
||||
```
|
||||
|
||||
### Typical Iterative Workflow
|
||||
|
||||
```bash
|
||||
# First run - full processing
|
||||
python process_meeting.py samples/meeting.mkv --run-whisper --use-vision
|
||||
|
||||
# Review results, then re-run with different context if needed
|
||||
python process_meeting.py samples/meeting.mkv --use-vision --vision-context code
|
||||
|
||||
# Or switch to a different vision model
|
||||
python process_meeting.py samples/meeting.mkv --use-vision --vision-model llava:7b
|
||||
|
||||
# All use cached frames and transcript!
|
||||
```
|
||||
|
||||
### Traditional Workflow (Separate Steps)
|
||||
|
||||
```bash
|
||||
# 1. Extract audio and transcribe with Whisper (optional, if not using --run-whisper)
|
||||
whisper samples/alo-intro1.mkv --model base --output_format json --output_dir output
|
||||
|
||||
# 2. Process video to extract screen content
|
||||
# 2. Process video to extract screen content with vision
|
||||
python process_meeting.py samples/alo-intro1.mkv \
|
||||
--transcript output/alo-intro1.json \
|
||||
--use-vision \
|
||||
--scene-detection
|
||||
|
||||
# 3. Use the enhanced transcript with Claude
|
||||
# Copy the content from output/alo-intro1_enhanced.txt and paste into Claude
|
||||
# 3. Use the enhanced transcript with AI
|
||||
# Copy the content from output/alo-intro1_enhanced.txt and paste into Claude or your LLM
|
||||
```
|
||||
|
||||
### Example Prompt for Claude
|
||||
@@ -217,42 +287,99 @@ Options:
|
||||
|
||||
## Tips for Best Results
|
||||
|
||||
### Vision vs OCR: When to Use Each
|
||||
|
||||
**Use Vision Models (`--use-vision`) when:**
|
||||
- ✅ Analyzing dashboards (Grafana, GCP Console, monitoring tools)
|
||||
- ✅ Code walkthroughs or debugging sessions
|
||||
- ✅ Complex layouts with mixed content
|
||||
- ✅ Need contextual understanding, not just text extraction
|
||||
- ✅ Working with charts, graphs, or visualizations
|
||||
- ⚠️ Trade-off: Slower (requires GPU/CPU for local model)
|
||||
|
||||
**Use OCR when:**
|
||||
- ✅ Simple text extraction from slides or documents
|
||||
- ✅ Need maximum speed
|
||||
- ✅ Limited computational resources
|
||||
- ✅ Presentations with mostly text
|
||||
- ⚠️ Trade-off: Less context-aware, may miss visual relationships
|
||||
|
||||
### Context Hints for Vision Analysis
|
||||
- **`--vision-context meeting`**: General purpose (default)
|
||||
- **`--vision-context code`**: Optimized for code screenshots, preserves formatting
|
||||
- **`--vision-context dashboard`**: Extracts metrics, trends, panel names
|
||||
- **`--vision-context console`**: Captures commands, output, error messages
|
||||
|
||||
### Scene Detection vs Interval
|
||||
- **Scene detection**: Better for presentations with distinct slides. More efficient.
|
||||
- **Interval extraction**: Better for continuous screen sharing (coding, browsing). More thorough.
|
||||
|
||||
### OCR Engine Selection
|
||||
- **Tesseract**: Best for clean slides, documents, presentations. Fast and lightweight.
|
||||
- **EasyOCR**: Better for handwriting, rotated text, or varied fonts.
|
||||
- **PaddleOCR**: Excellent for code, terminal outputs, and mixed languages.
|
||||
### Vision Model Selection
|
||||
- **`llava:7b`**: Faster, lower memory (~4GB RAM), good quality
|
||||
- **`llava:13b`**: Better quality, slower, needs ~8GB RAM (default)
|
||||
- **`bakllava`**: Alternative with different strengths
|
||||
|
||||
### Deduplication
|
||||
- Enabled by default - removes similar consecutive frames
|
||||
- Disable with `--no-deduplicate` if slides change subtly
|
||||
- Disable with `--no-deduplicate` if slides/screens change subtly
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### "pytesseract not installed"
|
||||
### Vision Model Issues
|
||||
|
||||
**"ollama package not installed"**
|
||||
```bash
|
||||
pip install ollama
|
||||
```
|
||||
|
||||
**"Ollama not found" or connection errors**
|
||||
```bash
|
||||
# Install Ollama first: https://ollama.ai/download
|
||||
# Then pull a vision model:
|
||||
ollama pull llava:13b
|
||||
```
|
||||
|
||||
**Vision analysis is slow**
|
||||
- Use lighter model: `--vision-model llava:7b`
|
||||
- Reduce frame count: `--scene-detection` or `--interval 10`
|
||||
- Check if Ollama is using GPU (much faster)
|
||||
|
||||
**Poor vision analysis results**
|
||||
- Try different context hint: `--vision-context code` or `--vision-context dashboard`
|
||||
- Use larger model: `--vision-model llava:13b`
|
||||
- Ensure frames are clear (check video resolution)
|
||||
|
||||
### OCR Issues
|
||||
|
||||
**"pytesseract not installed"**
|
||||
```bash
|
||||
pip install pytesseract
|
||||
sudo apt-get install tesseract-ocr # Don't forget system package!
|
||||
```
|
||||
|
||||
### "No frames extracted"
|
||||
**Poor OCR quality**
|
||||
- **Solution**: Switch to vision analysis with `--use-vision`
|
||||
- Or try different OCR engine: `--ocr-engine easyocr`
|
||||
- Check if video resolution is sufficient
|
||||
- Use `--no-deduplicate` to keep more frames
|
||||
|
||||
### General Issues
|
||||
|
||||
**"No frames extracted"**
|
||||
- Check video file is valid: `ffmpeg -i video.mkv`
|
||||
- Try lower interval: `--interval 3`
|
||||
- Check disk space in frames directory
|
||||
|
||||
### Poor OCR quality
|
||||
- Try different OCR engine
|
||||
- Check if video resolution is sufficient
|
||||
- Use `--no-deduplicate` to keep more frames
|
||||
|
||||
### Scene detection not working
|
||||
**Scene detection not working**
|
||||
- Fallback to interval extraction automatically
|
||||
- Ensure FFmpeg is installed
|
||||
- Try manual interval: `--interval 5`
|
||||
|
||||
**Cache not being used**
|
||||
- Ensure you're using the same video filename
|
||||
- Check that output directory contains cached files
|
||||
- Use `--verbose` to see what's being cached/loaded
|
||||
|
||||
## Project Structure
|
||||
|
||||
```
|
||||
|
||||
Reference in New Issue
Block a user