12 KiB
Meeting Processor
Extract screen content from meeting recordings and merge with Whisper/WhisperX transcripts for better AI summarization.
Overview
This tool enhances meeting transcripts by combining:
- Audio transcription (Whisper or WhisperX with speaker diarization)
- Screen content extraction via FFmpeg scene detection
- Frame embedding for direct LLM analysis
The result is a rich, timestamped transcript with embedded screen frames that provides full context for AI summarization.
Installation
1. System Dependencies
FFmpeg (required for scene detection and frame extraction):
# Ubuntu/Debian
sudo apt-get install ffmpeg
# macOS
brew install ffmpeg
2. Python Dependencies
pip install -r requirements.txt
3. Whisper or WhisperX (for audio transcription)
Standard Whisper:
pip install openai-whisper
WhisperX (recommended - includes speaker diarization):
pip install whisperx
For speaker diarization, you'll need a HuggingFace token with access to pyannote models.
Quick Start
Recommended: Embed Frames with Scene Detection
python process_meeting.py samples/meeting.mkv --run-whisper --embed-images --scene-detection
This will:
- Run Whisper transcription (audio → text)
- Extract frames at scene changes (smarter than fixed intervals)
- Embed frame references in the transcript for LLM analysis
- Save everything to
output/folder
With Speaker Diarization (WhisperX)
python process_meeting.py samples/meeting.mkv --run-whisper --diarize --embed-images --scene-detection
This uses WhisperX to identify different speakers in the transcript.
Re-run with Cached Results
Already ran it once? Re-run instantly using cached results:
# Uses cached transcript and frames
python process_meeting.py samples/meeting.mkv --embed-images
# Skip only specific cached items
python process_meeting.py samples/meeting.mkv --embed-images --skip-cache-frames
python process_meeting.py samples/meeting.mkv --embed-images --skip-cache-whisper
python process_meeting.py samples/meeting.mkv --embed-images --skip-cache-analysis
# Force complete reprocessing
python process_meeting.py samples/meeting.mkv --run-whisper --embed-images --no-cache
Usage Examples
Scene Detection Options
# Default scene detection (threshold: 15)
python process_meeting.py samples/meeting.mkv --run-whisper --embed-images --scene-detection
# More sensitive (more frames captured, threshold: 5)
python process_meeting.py samples/meeting.mkv --run-whisper --embed-images --scene-detection --scene-threshold 5
# Less sensitive (fewer frames, threshold: 30)
python process_meeting.py samples/meeting.mkv --run-whisper --embed-images --scene-detection --scene-threshold 30
Fixed Interval Extraction (alternative to scene detection)
# Every 10 seconds
python process_meeting.py samples/meeting.mkv --run-whisper --embed-images --interval 10
# Every 3 seconds (more detailed)
python process_meeting.py samples/meeting.mkv --run-whisper --embed-images --interval 3
Frame Quality Options
# Default quality (80)
python process_meeting.py samples/meeting.mkv --run-whisper --embed-images --scene-detection
# Lower quality for smaller files (60)
python process_meeting.py samples/meeting.mkv --run-whisper --embed-images --scene-detection --embed-quality 60
Caching Examples
# First run - processes everything
python process_meeting.py samples/meeting.mkv --run-whisper --embed-images --scene-detection
# Iterate on scene threshold (reuse whisper transcript)
python process_meeting.py samples/meeting.mkv --embed-images --scene-detection --scene-threshold 5 --skip-cache-frames --skip-cache-analysis
# Re-run whisper only
python process_meeting.py samples/meeting.mkv --embed-images --skip-cache-whisper
# Force complete reprocessing
python process_meeting.py samples/meeting.mkv --run-whisper --embed-images --no-cache
Custom output location
python process_meeting.py samples/meeting.mkv --run-whisper --embed-images --output-dir my_outputs/
Enable verbose logging
python process_meeting.py samples/meeting.mkv --run-whisper --embed-images --scene-detection --verbose
Output Files
Each video gets its own timestamped output directory:
output/
└── 20241019_143022-meeting/
├── manifest.json # Processing configuration
├── meeting_enhanced.txt # Enhanced transcript for AI
├── meeting.json # Whisper/WhisperX transcript
└── frames/ # Extracted video frames
├── frame_00001_5.00s.jpg
├── frame_00002_10.00s.jpg
└── ...
Caching Behavior
The tool automatically reuses the most recent output directory for the same video:
- First run: Creates new timestamped directory (e.g.,
20241019_143022-meeting/) - Subsequent runs: Reuses the same directory and cached results
- Cached items: Whisper transcript, extracted frames, analysis results
Fine-grained cache control:
--no-cache: Force complete reprocessing--skip-cache-frames: Re-extract frames only--skip-cache-whisper: Re-run transcription only--skip-cache-analysis: Re-run analysis only
This allows you to iterate on scene detection thresholds without re-running Whisper!
Workflow for Meeting Analysis
Complete Workflow (One Command!)
# Process everything in one step with scene detection
python process_meeting.py samples/meeting.mkv --run-whisper --embed-images --scene-detection
# With speaker diarization
python process_meeting.py samples/meeting.mkv --run-whisper --diarize --embed-images --scene-detection
Typical Iterative Workflow
# First run - full processing
python process_meeting.py samples/meeting.mkv --run-whisper --embed-images --scene-detection
# Adjust scene threshold (keeps cached whisper transcript)
python process_meeting.py samples/meeting.mkv --embed-images --scene-detection --scene-threshold 10 --skip-cache-frames --skip-cache-analysis
# Try different frame quality
python process_meeting.py samples/meeting.mkv --embed-images --embed-quality 60 --skip-cache-frames --skip-cache-analysis
Example Prompt for Claude
Please summarize this meeting transcript. Pay special attention to:
1. Key decisions made
2. Action items
3. Technical details shown on screen
4. Any metrics or data presented
[Paste enhanced transcript here]
Command Reference
usage: process_meeting.py [-h] [--transcript TRANSCRIPT] [--run-whisper]
[--whisper-model {tiny,base,small,medium,large}]
[--diarize] [--output OUTPUT] [--output-dir OUTPUT_DIR]
[--interval INTERVAL] [--scene-detection]
[--scene-threshold SCENE_THRESHOLD]
[--embed-images] [--embed-quality EMBED_QUALITY]
[--no-cache] [--skip-cache-frames] [--skip-cache-whisper]
[--skip-cache-analysis] [--no-deduplicate]
[--extract-only] [--format {detailed,compact}]
[--verbose] video
Main Options:
video Path to video file
--run-whisper Run Whisper transcription before processing
--whisper-model Whisper model: tiny, base, small, medium, large (default: medium)
--diarize Use WhisperX with speaker diarization
--embed-images Embed frame references for LLM analysis (recommended)
--embed-quality JPEG quality for frames (default: 80)
Frame Extraction:
--scene-detection Use FFmpeg scene detection (recommended)
--scene-threshold Detection sensitivity 0-100 (default: 15, lower=more sensitive)
--interval Extract frame every N seconds (alternative to scene detection)
Caching:
--no-cache Force complete reprocessing
--skip-cache-frames Re-extract frames only
--skip-cache-whisper Re-run transcription only
--skip-cache-analysis Re-run analysis only
Other:
--transcript, -t Path to existing Whisper transcript (JSON or TXT)
--output, -o Output file for enhanced transcript
--output-dir Directory for output files (default: output/)
--verbose, -v Enable verbose logging
Tips for Best Results
Scene Detection vs Interval
- Scene detection (
--scene-detection): Recommended. Captures frames when content changes. More efficient. - Interval extraction (
--interval N): Alternative for continuous content. Captures every N seconds.
Scene Detection Threshold
- Lower values (5-10): More sensitive, captures more frames
- Default (15): Good balance for most meetings
- Higher values (20-30): Less sensitive, fewer frames
Whisper vs WhisperX
- Whisper (
--run-whisper): Standard transcription, fast - WhisperX (
--run-whisper --diarize): Adds speaker identification, requires HuggingFace token
Frame Quality
- Default quality (80) works well for most cases
- Use
--embed-quality 60for smaller files if storage is a concern
Deduplication
- Enabled by default - removes similar consecutive frames
- Disable with
--no-deduplicateif slides/screens change subtly
Troubleshooting
Frame Extraction Issues
"No frames extracted"
- Check video file is valid:
ffmpeg -i video.mkv - Try lower scene threshold:
--scene-threshold 5 - Try interval extraction:
--interval 3 - Check disk space in output directory
Scene detection not working
- Ensure FFmpeg is installed
- Falls back to interval extraction automatically
- Try manual interval:
--interval 5
Whisper/WhisperX Issues
WhisperX diarization not working
- Ensure you have a HuggingFace token set
- Token needs access to pyannote models
- Fall back to standard Whisper without
--diarize
Cache Issues
Cache not being used
- Ensure you're using the same video filename
- Check that output directory contains cached files
- Use
--verboseto see what's being cached/loaded
Want to re-run specific steps
--skip-cache-frames: Re-extract frames--skip-cache-whisper: Re-run transcription--skip-cache-analysis: Re-run analysis--no-cache: Force complete reprocessing
Experimental Features
OCR and Vision Analysis
OCR (--ocr-engine) and Vision analysis (--use-vision) options are available but experimental. The recommended approach is to use --embed-images which embeds frame references directly in the transcript, letting your LLM analyze the images.
# Experimental: OCR extraction
python process_meeting.py samples/meeting.mkv --run-whisper --ocr-engine tesseract
# Experimental: Vision model analysis
python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --vision-model llava:13b
# Experimental: Hybrid OpenCV + OCR
python process_meeting.py samples/meeting.mkv --run-whisper --use-hybrid
Project Structure
meetus/
├── meetus/ # Main package
│ ├── __init__.py
│ ├── workflow.py # Processing orchestrator
│ ├── output_manager.py # Output directory & manifest management
│ ├── cache_manager.py # Caching logic
│ ├── frame_extractor.py # Video frame extraction (FFmpeg scene detection)
│ ├── vision_processor.py # Vision model analysis (experimental)
│ ├── ocr_processor.py # OCR processing (experimental)
│ └── transcript_merger.py # Transcript merging
├── process_meeting.py # Main CLI script
├── requirements.txt # Python dependencies
├── output/ # Timestamped output directories
│ └── YYYYMMDD_HHMMSS-video/ # Auto-generated per video
├── samples/ # Sample videos (gitignored)
└── README.md # This file
License
For personal use.