Meeting Processor
Extract screen content from meeting recordings and merge with Whisper transcripts for better AI summarization.
Overview
This tool enhances meeting transcripts by combining:
- Audio transcription (from Whisper)
- Screen content analysis (Vision models or OCR)
Vision Analysis vs OCR
- Vision Models (recommended): Uses local LLaVA model via Ollama to understand context - great for dashboards, code, consoles
- OCR: Traditional text extraction - faster but less context-aware
The result is a rich, timestamped transcript that provides full context for AI summarization.
Installation
1. System Dependencies
Ollama (required for vision analysis):
# Install from https://ollama.ai/download
# Then pull a vision model:
ollama pull llava:13b
# or for lighter model:
ollama pull llava:7b
FFmpeg (for scene detection):
# Ubuntu/Debian
sudo apt-get install ffmpeg
# macOS
brew install ffmpeg
Tesseract OCR (optional, if not using vision):
# Ubuntu/Debian
sudo apt-get install tesseract-ocr
# macOS
brew install tesseract
# Arch Linux
sudo pacman -S tesseract
2. Python Dependencies
pip install -r requirements.txt
3. Whisper (for audio transcription)
pip install openai-whisper
4. Optional: Install Alternative OCR Engines
If you prefer OCR over vision analysis:
# EasyOCR (better for rotated/handwritten text)
pip install easyocr
# PaddleOCR (better for code/terminal screens)
pip install paddleocr
Quick Start
Recommended: Vision Analysis (Best for Code/Dashboards)
python process_meeting.py samples/meeting.mkv --run-whisper --use-vision
This will:
- Run Whisper transcription (audio → text)
- Extract frames every 5 seconds
- Use LLaVA vision model to analyze frames with context
- Merge audio + screen content
- Save everything to
output/folder
Re-run with Cached Results
Already ran it once? Re-run instantly using cached results:
# Uses cached transcript, frames, and analysis
python process_meeting.py samples/meeting.mkv --use-vision
# Force reprocessing
python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --no-cache
Traditional OCR (Faster, Less Context-Aware)
python process_meeting.py samples/meeting.mkv --run-whisper
Usage Examples
Vision Analysis with Context Hints
# For code-heavy meetings
python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --vision-context code
# For dashboard/monitoring meetings (Grafana, GCP, etc.)
python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --vision-context dashboard
# For console/terminal sessions
python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --vision-context console
Different Vision Models
# Lighter/faster model (7B parameters)
python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --vision-model llava:7b
# Default model (13B parameters, better quality)
python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --vision-model llava:13b
# Alternative models
python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --vision-model bakllava
Extract frames at different intervals
# Every 10 seconds
python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --interval 10
# Every 3 seconds (more detailed)
python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --interval 3
Use scene detection (smarter, fewer frames)
python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --scene-detection
Traditional OCR (if you prefer)
# Tesseract (default)
python process_meeting.py samples/meeting.mkv --run-whisper
# EasyOCR
python process_meeting.py samples/meeting.mkv --run-whisper --ocr-engine easyocr
# PaddleOCR
python process_meeting.py samples/meeting.mkv --run-whisper --ocr-engine paddleocr
Caching Examples
# First run - processes everything
python process_meeting.py samples/meeting.mkv --run-whisper --use-vision
# Second run - uses cached transcript and frames, only re-merges
python process_meeting.py samples/meeting.mkv
# Switch from OCR to vision using existing frames
python process_meeting.py samples/meeting.mkv --use-vision
# Force complete reprocessing
python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --no-cache
Custom output location
python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --output-dir my_outputs/
Enable verbose logging
# Show detailed debug information
python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --verbose
Output Files
All output files are saved to the output/ directory by default:
output/<video>_enhanced.txt- Enhanced transcript ready for AI summarizationoutput/<video>.json- Whisper transcript (if--run-whisperwas used)output/<video>_vision.json- Vision analysis results with timestamps (if--use-vision)output/<video>_ocr.json- OCR results with timestamps (if using OCR)frames/- Extracted video frames (JPG files)
Caching Behavior
The tool automatically caches intermediate results to speed up re-runs:
- Whisper transcript: Cached as
output/<video>.json - Extracted frames: Cached in
frames/<video>_*.jpg - Analysis results: Cached as
output/<video>_vision.jsonoroutput/<video>_ocr.json
Re-running with the same video will use cached results unless --no-cache is specified.
Workflow for Meeting Analysis
Complete Workflow (One Command!)
# Process everything in one step with vision analysis
python process_meeting.py samples/alo-intro1.mkv --run-whisper --use-vision --scene-detection
# Output will be in output/alo-intro1_enhanced.txt
Typical Iterative Workflow
# First run - full processing
python process_meeting.py samples/meeting.mkv --run-whisper --use-vision
# Review results, then re-run with different context if needed
python process_meeting.py samples/meeting.mkv --use-vision --vision-context code
# Or switch to a different vision model
python process_meeting.py samples/meeting.mkv --use-vision --vision-model llava:7b
# All use cached frames and transcript!
Traditional Workflow (Separate Steps)
# 1. Extract audio and transcribe with Whisper (optional, if not using --run-whisper)
whisper samples/alo-intro1.mkv --model base --output_format json --output_dir output
# 2. Process video to extract screen content with vision
python process_meeting.py samples/alo-intro1.mkv \
--transcript output/alo-intro1.json \
--use-vision \
--scene-detection
# 3. Use the enhanced transcript with AI
# Copy the content from output/alo-intro1_enhanced.txt and paste into Claude or your LLM
Example Prompt for Claude
Please summarize this meeting transcript. Pay special attention to:
1. Key decisions made
2. Action items
3. Technical details shown on screen
4. Any metrics or data presented
[Paste enhanced transcript here]
Command Reference
usage: process_meeting.py [-h] [--transcript TRANSCRIPT] [--run-whisper]
[--whisper-model {tiny,base,small,medium,large}]
[--output OUTPUT] [--output-dir OUTPUT_DIR]
[--frames-dir FRAMES_DIR] [--interval INTERVAL]
[--scene-detection]
[--ocr-engine {tesseract,easyocr,paddleocr}]
[--no-deduplicate] [--extract-only]
[--format {detailed,compact}] [--verbose]
video
Options:
video Path to video file
--transcript, -t Path to Whisper transcript (JSON or TXT)
--run-whisper Run Whisper transcription before processing
--whisper-model Whisper model: tiny, base, small, medium, large (default: base)
--output, -o Output file for enhanced transcript
--output-dir Directory for output files (default: output/)
--frames-dir Directory to save extracted frames (default: frames/)
--interval Extract frame every N seconds (default: 5)
--scene-detection Use scene detection instead of interval extraction
--ocr-engine OCR engine: tesseract, easyocr, paddleocr (default: tesseract)
--no-deduplicate Disable text deduplication
--extract-only Only extract frames and OCR, skip transcript merging
--format Output format: detailed or compact (default: detailed)
--verbose, -v Enable verbose logging (DEBUG level)
Tips for Best Results
Vision vs OCR: When to Use Each
Use Vision Models (--use-vision) when:
- ✅ Analyzing dashboards (Grafana, GCP Console, monitoring tools)
- ✅ Code walkthroughs or debugging sessions
- ✅ Complex layouts with mixed content
- ✅ Need contextual understanding, not just text extraction
- ✅ Working with charts, graphs, or visualizations
- ⚠️ Trade-off: Slower (requires GPU/CPU for local model)
Use OCR when:
- ✅ Simple text extraction from slides or documents
- ✅ Need maximum speed
- ✅ Limited computational resources
- ✅ Presentations with mostly text
- ⚠️ Trade-off: Less context-aware, may miss visual relationships
Context Hints for Vision Analysis
--vision-context meeting: General purpose (default)--vision-context code: Optimized for code screenshots, preserves formatting--vision-context dashboard: Extracts metrics, trends, panel names--vision-context console: Captures commands, output, error messages
Scene Detection vs Interval
- Scene detection: Better for presentations with distinct slides. More efficient.
- Interval extraction: Better for continuous screen sharing (coding, browsing). More thorough.
Vision Model Selection
llava:7b: Faster, lower memory (~4GB RAM), good qualityllava:13b: Better quality, slower, needs ~8GB RAM (default)bakllava: Alternative with different strengths
Deduplication
- Enabled by default - removes similar consecutive frames
- Disable with
--no-deduplicateif slides/screens change subtly
Troubleshooting
Vision Model Issues
"ollama package not installed"
pip install ollama
"Ollama not found" or connection errors
# Install Ollama first: https://ollama.ai/download
# Then pull a vision model:
ollama pull llava:13b
Vision analysis is slow
- Use lighter model:
--vision-model llava:7b - Reduce frame count:
--scene-detectionor--interval 10 - Check if Ollama is using GPU (much faster)
Poor vision analysis results
- Try different context hint:
--vision-context codeor--vision-context dashboard - Use larger model:
--vision-model llava:13b - Ensure frames are clear (check video resolution)
OCR Issues
"pytesseract not installed"
pip install pytesseract
sudo apt-get install tesseract-ocr # Don't forget system package!
Poor OCR quality
- Solution: Switch to vision analysis with
--use-vision - Or try different OCR engine:
--ocr-engine easyocr - Check if video resolution is sufficient
- Use
--no-deduplicateto keep more frames
General Issues
"No frames extracted"
- Check video file is valid:
ffmpeg -i video.mkv - Try lower interval:
--interval 3 - Check disk space in frames directory
Scene detection not working
- Fallback to interval extraction automatically
- Ensure FFmpeg is installed
- Try manual interval:
--interval 5
Cache not being used
- Ensure you're using the same video filename
- Check that output directory contains cached files
- Use
--verboseto see what's being cached/loaded
Project Structure
meetus/
├── meetus/ # Main package
│ ├── __init__.py
│ ├── frame_extractor.py # Video frame extraction
│ ├── ocr_processor.py # OCR processing
│ └── transcript_merger.py # Transcript merging
├── process_meeting.py # Main CLI script
├── requirements.txt # Python dependencies
└── README.md # This file
License
For personal use.