# Meeting Processor Extract screen content from meeting recordings and merge with Whisper transcripts for better AI summarization. ## Overview This tool enhances meeting transcripts by combining: - **Audio transcription** (from Whisper) - **Screen content analysis** (Vision models or OCR) ### Vision Analysis vs OCR - **Vision Models** (recommended): Uses local LLaVA model via Ollama to understand context - great for dashboards, code, consoles - **OCR**: Traditional text extraction - faster but less context-aware The result is a rich, timestamped transcript that provides full context for AI summarization. ## Installation ### 1. System Dependencies **Ollama** (required for vision analysis): ```bash # Install from https://ollama.ai/download # Then pull a vision model: ollama pull llava:13b # or for lighter model: ollama pull llava:7b ``` **FFmpeg** (for scene detection): ```bash # Ubuntu/Debian sudo apt-get install ffmpeg # macOS brew install ffmpeg ``` **Tesseract OCR** (optional, if not using vision): ```bash # Ubuntu/Debian sudo apt-get install tesseract-ocr # macOS brew install tesseract # Arch Linux sudo pacman -S tesseract ``` ### 2. Python Dependencies ```bash pip install -r requirements.txt ``` ### 3. Whisper (for audio transcription) ```bash pip install openai-whisper ``` ### 4. Optional: Install Alternative OCR Engines If you prefer OCR over vision analysis: ```bash # EasyOCR (better for rotated/handwritten text) pip install easyocr # PaddleOCR (better for code/terminal screens) pip install paddleocr ``` ## Quick Start ### Recommended: Vision Analysis (Best for Code/Dashboards) ```bash python process_meeting.py samples/meeting.mkv --run-whisper --use-vision ``` This will: 1. Run Whisper transcription (audio → text) 2. Extract frames every 5 seconds 3. Use LLaVA vision model to analyze frames with context 4. Merge audio + screen content 5. Save everything to `output/` folder ### Re-run with Cached Results Already ran it once? Re-run instantly using cached results: ```bash # Uses cached transcript, frames, and analysis python process_meeting.py samples/meeting.mkv --use-vision # Force reprocessing python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --no-cache ``` ### Traditional OCR (Faster, Less Context-Aware) ```bash python process_meeting.py samples/meeting.mkv --run-whisper ``` ## Usage Examples ### Vision Analysis with Context Hints ```bash # For code-heavy meetings python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --vision-context code # For dashboard/monitoring meetings (Grafana, GCP, etc.) python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --vision-context dashboard # For console/terminal sessions python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --vision-context console ``` ### Different Vision Models ```bash # Lighter/faster model (7B parameters) python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --vision-model llava:7b # Default model (13B parameters, better quality) python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --vision-model llava:13b # Alternative models python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --vision-model bakllava ``` ### Extract frames at different intervals ```bash # Every 10 seconds python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --interval 10 # Every 3 seconds (more detailed) python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --interval 3 ``` ### Use scene detection (smarter, fewer frames) ```bash python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --scene-detection ``` ### Traditional OCR (if you prefer) ```bash # Tesseract (default) python process_meeting.py samples/meeting.mkv --run-whisper # EasyOCR python process_meeting.py samples/meeting.mkv --run-whisper --ocr-engine easyocr # PaddleOCR python process_meeting.py samples/meeting.mkv --run-whisper --ocr-engine paddleocr ``` ### Caching Examples ```bash # First run - processes everything python process_meeting.py samples/meeting.mkv --run-whisper --use-vision # Second run - uses cached transcript and frames, only re-merges python process_meeting.py samples/meeting.mkv # Switch from OCR to vision using existing frames python process_meeting.py samples/meeting.mkv --use-vision # Force complete reprocessing python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --no-cache ``` ### Custom output location ```bash python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --output-dir my_outputs/ ``` ### Enable verbose logging ```bash # Show detailed debug information python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --verbose ``` ## Output Files Each video gets its own timestamped output directory: ``` output/ └── 20241019_143022-meeting/ ├── manifest.json # Processing configuration ├── meeting_enhanced.txt # Enhanced transcript for AI ├── meeting.json # Whisper transcript ├── meeting_vision.json # Vision analysis results └── frames/ # Extracted video frames ├── frame_00001_5.00s.jpg ├── frame_00002_10.00s.jpg └── ... ``` ### Manifest File Each processing run creates a `manifest.json` that tracks: - Video information (name, path) - Processing timestamp - Configuration used (Whisper model, vision settings, etc.) - Output file locations Example manifest: ```json { "video": { "name": "meeting.mkv", "path": "/full/path/to/meeting.mkv" }, "processed_at": "2024-10-19T14:30:22", "configuration": { "whisper": {"enabled": true, "model": "base"}, "analysis": {"method": "vision", "vision_model": "llava:13b", "vision_context": "code"} } } ``` ### Caching Behavior The tool automatically reuses the most recent output directory for the same video: - **First run**: Creates new timestamped directory (e.g., `20241019_143022-meeting/`) - **Subsequent runs**: Reuses the same directory and cached results - **Cached items**: Whisper transcript, extracted frames, analysis results - **Force new run**: Use `--no-cache` to create a fresh directory This means you can instantly switch between OCR and vision analysis without re-extracting frames! ## Workflow for Meeting Analysis ### Complete Workflow (One Command!) ```bash # Process everything in one step with vision analysis python process_meeting.py samples/alo-intro1.mkv --run-whisper --use-vision --scene-detection # Output will be in output/alo-intro1_enhanced.txt ``` ### Typical Iterative Workflow ```bash # First run - full processing python process_meeting.py samples/meeting.mkv --run-whisper --use-vision # Review results, then re-run with different context if needed python process_meeting.py samples/meeting.mkv --use-vision --vision-context code # Or switch to a different vision model python process_meeting.py samples/meeting.mkv --use-vision --vision-model llava:7b # All use cached frames and transcript! ``` ### Traditional Workflow (Separate Steps) ```bash # 1. Extract audio and transcribe with Whisper (optional, if not using --run-whisper) whisper samples/alo-intro1.mkv --model base --output_format json --output_dir output # 2. Process video to extract screen content with vision python process_meeting.py samples/alo-intro1.mkv \ --transcript output/alo-intro1.json \ --use-vision \ --scene-detection # 3. Use the enhanced transcript with AI # Copy the content from output/alo-intro1_enhanced.txt and paste into Claude or your LLM ``` ### Example Prompt for Claude ``` Please summarize this meeting transcript. Pay special attention to: 1. Key decisions made 2. Action items 3. Technical details shown on screen 4. Any metrics or data presented [Paste enhanced transcript here] ``` ## Command Reference ``` usage: process_meeting.py [-h] [--transcript TRANSCRIPT] [--run-whisper] [--whisper-model {tiny,base,small,medium,large}] [--output OUTPUT] [--output-dir OUTPUT_DIR] [--frames-dir FRAMES_DIR] [--interval INTERVAL] [--scene-detection] [--ocr-engine {tesseract,easyocr,paddleocr}] [--no-deduplicate] [--extract-only] [--format {detailed,compact}] [--verbose] video Options: video Path to video file --transcript, -t Path to Whisper transcript (JSON or TXT) --run-whisper Run Whisper transcription before processing --whisper-model Whisper model: tiny, base, small, medium, large (default: base) --output, -o Output file for enhanced transcript --output-dir Directory for output files (default: output/) --frames-dir Directory to save extracted frames (default: frames/) --interval Extract frame every N seconds (default: 5) --scene-detection Use scene detection instead of interval extraction --ocr-engine OCR engine: tesseract, easyocr, paddleocr (default: tesseract) --no-deduplicate Disable text deduplication --extract-only Only extract frames and OCR, skip transcript merging --format Output format: detailed or compact (default: detailed) --verbose, -v Enable verbose logging (DEBUG level) ``` ## Tips for Best Results ### Vision vs OCR: When to Use Each **Use Vision Models (`--use-vision`) when:** - ✅ Analyzing dashboards (Grafana, GCP Console, monitoring tools) - ✅ Code walkthroughs or debugging sessions - ✅ Complex layouts with mixed content - ✅ Need contextual understanding, not just text extraction - ✅ Working with charts, graphs, or visualizations - ⚠️ Trade-off: Slower (requires GPU/CPU for local model) **Use OCR when:** - ✅ Simple text extraction from slides or documents - ✅ Need maximum speed - ✅ Limited computational resources - ✅ Presentations with mostly text - ⚠️ Trade-off: Less context-aware, may miss visual relationships ### Context Hints for Vision Analysis - **`--vision-context meeting`**: General purpose (default) - **`--vision-context code`**: Optimized for code screenshots, preserves formatting - **`--vision-context dashboard`**: Extracts metrics, trends, panel names - **`--vision-context console`**: Captures commands, output, error messages **Customizing Prompts:** Prompts are stored as editable text files in `meetus/prompts/`: - `meeting.txt` - General meeting analysis - `code.txt` - Code screenshot analysis - `dashboard.txt` - Dashboard/monitoring analysis - `console.txt` - Terminal/console analysis Just edit these files to customize how the vision model analyzes your frames! ### Scene Detection vs Interval - **Scene detection**: Better for presentations with distinct slides. More efficient. - **Interval extraction**: Better for continuous screen sharing (coding, browsing). More thorough. ### Vision Model Selection - **`llava:7b`**: Faster, lower memory (~4GB RAM), good quality - **`llava:13b`**: Better quality, slower, needs ~8GB RAM (default) - **`bakllava`**: Alternative with different strengths ### Deduplication - Enabled by default - removes similar consecutive frames - Disable with `--no-deduplicate` if slides/screens change subtly ## Troubleshooting ### Vision Model Issues **"ollama package not installed"** ```bash pip install ollama ``` **"Ollama not found" or connection errors** ```bash # Install Ollama first: https://ollama.ai/download # Then pull a vision model: ollama pull llava:13b ``` **Vision analysis is slow** - Use lighter model: `--vision-model llava:7b` - Reduce frame count: `--scene-detection` or `--interval 10` - Check if Ollama is using GPU (much faster) **Poor vision analysis results** - Try different context hint: `--vision-context code` or `--vision-context dashboard` - Use larger model: `--vision-model llava:13b` - Ensure frames are clear (check video resolution) ### OCR Issues **"pytesseract not installed"** ```bash pip install pytesseract sudo apt-get install tesseract-ocr # Don't forget system package! ``` **Poor OCR quality** - **Solution**: Switch to vision analysis with `--use-vision` - Or try different OCR engine: `--ocr-engine easyocr` - Check if video resolution is sufficient - Use `--no-deduplicate` to keep more frames ### General Issues **"No frames extracted"** - Check video file is valid: `ffmpeg -i video.mkv` - Try lower interval: `--interval 3` - Check disk space in frames directory **Scene detection not working** - Fallback to interval extraction automatically - Ensure FFmpeg is installed - Try manual interval: `--interval 5` **Cache not being used** - Ensure you're using the same video filename - Check that output directory contains cached files - Use `--verbose` to see what's being cached/loaded ## Project Structure ``` meetus/ ├── meetus/ # Main package │ ├── __init__.py │ ├── workflow.py # Processing orchestrator │ ├── output_manager.py # Output directory & manifest management │ ├── cache_manager.py # Caching logic │ ├── frame_extractor.py # Video frame extraction │ ├── vision_processor.py # Vision model analysis (Ollama/LLaVA) │ ├── ocr_processor.py # OCR processing │ ├── transcript_merger.py # Transcript merging │ └── prompts/ # Vision analysis prompts (editable!) │ ├── meeting.txt # General meeting analysis │ ├── code.txt # Code screenshot analysis │ ├── dashboard.txt # Dashboard/monitoring analysis │ └── console.txt # Terminal/console analysis ├── process_meeting.py # Main CLI script (thin wrapper) ├── requirements.txt # Python dependencies ├── output/ # Timestamped output directories │ ├── .gitkeep │ └── YYYYMMDD_HHMMSS-video/ # Auto-generated per video ├── samples/ # Sample videos (gitignored) └── README.md # This file ``` The code is modular and easy to extend - each module has a single responsibility. ## License For personal use.