From 331cccb15f33e2c5d058f9a02d6a89ba0a0d6f0d Mon Sep 17 00:00:00 2001 From: Mariano Gabriel Date: Thu, 4 Dec 2025 20:15:16 -0300 Subject: [PATCH] updated readme --- README.md | 428 ++++++++++++++++++++---------------------------------- 1 file changed, 161 insertions(+), 267 deletions(-) diff --git a/README.md b/README.md index 4ce3915..c1570c9 100644 --- a/README.md +++ b/README.md @@ -1,34 +1,21 @@ # Meeting Processor -Extract screen content from meeting recordings and merge with Whisper transcripts for better AI summarization. +Extract screen content from meeting recordings and merge with Whisper/WhisperX transcripts for better AI summarization. ## Overview This tool enhances meeting transcripts by combining: -- **Audio transcription** (from Whisper) -- **Screen content analysis** (Vision models or OCR) +- **Audio transcription** (Whisper or WhisperX with speaker diarization) +- **Screen content extraction** via FFmpeg scene detection +- **Frame embedding** for direct LLM analysis -### Vision Analysis vs OCR - -- **Vision Models** (recommended): Uses local LLaVA model via Ollama to understand context - great for dashboards, code, consoles -- **OCR**: Traditional text extraction - faster but less context-aware - -The result is a rich, timestamped transcript that provides full context for AI summarization. +The result is a rich, timestamped transcript with embedded screen frames that provides full context for AI summarization. ## Installation ### 1. System Dependencies -**Ollama** (required for vision analysis): -```bash -# Install from https://ollama.ai/download -# Then pull a vision model: -ollama pull llava:13b -# or for lighter model: -ollama pull llava:7b -``` - -**FFmpeg** (for scene detection): +**FFmpeg** (required for scene detection and frame extraction): ```bash # Ubuntu/Debian sudo apt-get install ffmpeg @@ -37,149 +24,119 @@ sudo apt-get install ffmpeg brew install ffmpeg ``` -**Tesseract OCR** (optional, if not using vision): -```bash -# Ubuntu/Debian -sudo apt-get install tesseract-ocr - -# macOS -brew install tesseract - -# Arch Linux -sudo pacman -S tesseract -``` - ### 2. Python Dependencies ```bash pip install -r requirements.txt ``` -### 3. Whisper (for audio transcription) +### 3. Whisper or WhisperX (for audio transcription) +**Standard Whisper:** ```bash pip install openai-whisper ``` -### 4. Optional: Install Alternative OCR Engines - -If you prefer OCR over vision analysis: +**WhisperX** (recommended - includes speaker diarization): ```bash -# EasyOCR (better for rotated/handwritten text) -pip install easyocr - -# PaddleOCR (better for code/terminal screens) -pip install paddleocr +pip install whisperx ``` +For speaker diarization, you'll need a HuggingFace token with access to pyannote models. + ## Quick Start -### Recommended: Vision Analysis (Best for Code/Dashboards) +### Recommended: Embed Frames with Scene Detection ```bash -python process_meeting.py samples/meeting.mkv --run-whisper --use-vision +python process_meeting.py samples/meeting.mkv --run-whisper --embed-images --scene-detection ``` This will: 1. Run Whisper transcription (audio → text) -2. Extract frames every 5 seconds -3. Use LLaVA vision model to analyze frames with context -4. Merge audio + screen content -5. Save everything to `output/` folder +2. Extract frames at scene changes (smarter than fixed intervals) +3. Embed frame references in the transcript for LLM analysis +4. Save everything to `output/` folder + +### With Speaker Diarization (WhisperX) + +```bash +python process_meeting.py samples/meeting.mkv --run-whisper --diarize --embed-images --scene-detection +``` + +This uses WhisperX to identify different speakers in the transcript. ### Re-run with Cached Results Already ran it once? Re-run instantly using cached results: ```bash -# Uses cached transcript, frames, and analysis -python process_meeting.py samples/meeting.mkv --use-vision +# Uses cached transcript and frames +python process_meeting.py samples/meeting.mkv --embed-images -# Force reprocessing -python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --no-cache -``` +# Skip only specific cached items +python process_meeting.py samples/meeting.mkv --embed-images --skip-cache-frames +python process_meeting.py samples/meeting.mkv --embed-images --skip-cache-whisper +python process_meeting.py samples/meeting.mkv --embed-images --skip-cache-analysis -### Traditional OCR (Faster, Less Context-Aware) - -```bash -python process_meeting.py samples/meeting.mkv --run-whisper +# Force complete reprocessing +python process_meeting.py samples/meeting.mkv --run-whisper --embed-images --no-cache ``` ## Usage Examples -### Vision Analysis with Context Hints +### Scene Detection Options ```bash -# For code-heavy meetings -python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --vision-context code +# Default scene detection (threshold: 15) +python process_meeting.py samples/meeting.mkv --run-whisper --embed-images --scene-detection -# For dashboard/monitoring meetings (Grafana, GCP, etc.) -python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --vision-context dashboard +# More sensitive (more frames captured, threshold: 5) +python process_meeting.py samples/meeting.mkv --run-whisper --embed-images --scene-detection --scene-threshold 5 -# For console/terminal sessions -python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --vision-context console +# Less sensitive (fewer frames, threshold: 30) +python process_meeting.py samples/meeting.mkv --run-whisper --embed-images --scene-detection --scene-threshold 30 ``` -### Different Vision Models -```bash -# Lighter/faster model (7B parameters) -python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --vision-model llava:7b - -# Default model (13B parameters, better quality) -python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --vision-model llava:13b - -# Alternative models -python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --vision-model bakllava -``` - -### Extract frames at different intervals +### Fixed Interval Extraction (alternative to scene detection) ```bash # Every 10 seconds -python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --interval 10 +python process_meeting.py samples/meeting.mkv --run-whisper --embed-images --interval 10 # Every 3 seconds (more detailed) -python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --interval 3 +python process_meeting.py samples/meeting.mkv --run-whisper --embed-images --interval 3 ``` -### Use scene detection (smarter, fewer frames) +### Frame Quality Options ```bash -python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --scene-detection -``` +# Default quality (80) +python process_meeting.py samples/meeting.mkv --run-whisper --embed-images --scene-detection -### Traditional OCR (if you prefer) -```bash -# Tesseract (default) -python process_meeting.py samples/meeting.mkv --run-whisper - -# EasyOCR -python process_meeting.py samples/meeting.mkv --run-whisper --ocr-engine easyocr - -# PaddleOCR -python process_meeting.py samples/meeting.mkv --run-whisper --ocr-engine paddleocr +# Lower quality for smaller files (60) +python process_meeting.py samples/meeting.mkv --run-whisper --embed-images --scene-detection --embed-quality 60 ``` ### Caching Examples ```bash # First run - processes everything -python process_meeting.py samples/meeting.mkv --run-whisper --use-vision +python process_meeting.py samples/meeting.mkv --run-whisper --embed-images --scene-detection -# Second run - uses cached transcript and frames, only re-merges -python process_meeting.py samples/meeting.mkv +# Iterate on scene threshold (reuse whisper transcript) +python process_meeting.py samples/meeting.mkv --embed-images --scene-detection --scene-threshold 5 --skip-cache-frames --skip-cache-analysis -# Switch from OCR to vision using existing frames -python process_meeting.py samples/meeting.mkv --use-vision +# Re-run whisper only +python process_meeting.py samples/meeting.mkv --embed-images --skip-cache-whisper # Force complete reprocessing -python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --no-cache +python process_meeting.py samples/meeting.mkv --run-whisper --embed-images --no-cache ``` ### Custom output location ```bash -python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --output-dir my_outputs/ +python process_meeting.py samples/meeting.mkv --run-whisper --embed-images --output-dir my_outputs/ ``` ### Enable verbose logging ```bash -# Show detailed debug information -python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --verbose +python process_meeting.py samples/meeting.mkv --run-whisper --embed-images --scene-detection --verbose ``` ## Output Files @@ -191,87 +148,51 @@ output/ └── 20241019_143022-meeting/ ├── manifest.json # Processing configuration ├── meeting_enhanced.txt # Enhanced transcript for AI - ├── meeting.json # Whisper transcript - ├── meeting_vision.json # Vision analysis results + ├── meeting.json # Whisper/WhisperX transcript └── frames/ # Extracted video frames ├── frame_00001_5.00s.jpg ├── frame_00002_10.00s.jpg └── ... ``` -### Manifest File - -Each processing run creates a `manifest.json` that tracks: -- Video information (name, path) -- Processing timestamp -- Configuration used (Whisper model, vision settings, etc.) -- Output file locations - -Example manifest: -```json -{ - "video": { - "name": "meeting.mkv", - "path": "/full/path/to/meeting.mkv" - }, - "processed_at": "2024-10-19T14:30:22", - "configuration": { - "whisper": {"enabled": true, "model": "base"}, - "analysis": {"method": "vision", "vision_model": "llava:13b", "vision_context": "code"} - } -} -``` - ### Caching Behavior The tool automatically reuses the most recent output directory for the same video: - **First run**: Creates new timestamped directory (e.g., `20241019_143022-meeting/`) - **Subsequent runs**: Reuses the same directory and cached results - **Cached items**: Whisper transcript, extracted frames, analysis results -- **Force new run**: Use `--no-cache` to create a fresh directory -This means you can instantly switch between OCR and vision analysis without re-extracting frames! +**Fine-grained cache control:** +- `--no-cache`: Force complete reprocessing +- `--skip-cache-frames`: Re-extract frames only +- `--skip-cache-whisper`: Re-run transcription only +- `--skip-cache-analysis`: Re-run analysis only + +This allows you to iterate on scene detection thresholds without re-running Whisper! ## Workflow for Meeting Analysis ### Complete Workflow (One Command!) ```bash -# Process everything in one step with vision analysis -python process_meeting.py samples/alo-intro1.mkv --run-whisper --use-vision --scene-detection +# Process everything in one step with scene detection +python process_meeting.py samples/meeting.mkv --run-whisper --embed-images --scene-detection -# Output will be in output/alo-intro1_enhanced.txt +# With speaker diarization +python process_meeting.py samples/meeting.mkv --run-whisper --diarize --embed-images --scene-detection ``` ### Typical Iterative Workflow ```bash # First run - full processing -python process_meeting.py samples/meeting.mkv --run-whisper --use-vision +python process_meeting.py samples/meeting.mkv --run-whisper --embed-images --scene-detection -# Review results, then re-run with different context if needed -python process_meeting.py samples/meeting.mkv --use-vision --vision-context code +# Adjust scene threshold (keeps cached whisper transcript) +python process_meeting.py samples/meeting.mkv --embed-images --scene-detection --scene-threshold 10 --skip-cache-frames --skip-cache-analysis -# Or switch to a different vision model -python process_meeting.py samples/meeting.mkv --use-vision --vision-model llava:7b - -# All use cached frames and transcript! -``` - -### Traditional Workflow (Separate Steps) - -```bash -# 1. Extract audio and transcribe with Whisper (optional, if not using --run-whisper) -whisper samples/alo-intro1.mkv --model base --output_format json --output_dir output - -# 2. Process video to extract screen content with vision -python process_meeting.py samples/alo-intro1.mkv \ - --transcript output/alo-intro1.json \ - --use-vision \ - --scene-detection - -# 3. Use the enhanced transcript with AI -# Copy the content from output/alo-intro1_enhanced.txt and paste into Claude or your LLM +# Try different frame quality +python process_meeting.py samples/meeting.mkv --embed-images --embed-quality 60 --skip-cache-frames --skip-cache-analysis ``` ### Example Prompt for Claude @@ -291,73 +212,59 @@ Please summarize this meeting transcript. Pay special attention to: ``` usage: process_meeting.py [-h] [--transcript TRANSCRIPT] [--run-whisper] [--whisper-model {tiny,base,small,medium,large}] - [--output OUTPUT] [--output-dir OUTPUT_DIR] - [--frames-dir FRAMES_DIR] [--interval INTERVAL] - [--scene-detection] - [--ocr-engine {tesseract,easyocr,paddleocr}] - [--no-deduplicate] [--extract-only] - [--format {detailed,compact}] [--verbose] - video + [--diarize] [--output OUTPUT] [--output-dir OUTPUT_DIR] + [--interval INTERVAL] [--scene-detection] + [--scene-threshold SCENE_THRESHOLD] + [--embed-images] [--embed-quality EMBED_QUALITY] + [--no-cache] [--skip-cache-frames] [--skip-cache-whisper] + [--skip-cache-analysis] [--no-deduplicate] + [--extract-only] [--format {detailed,compact}] + [--verbose] video -Options: - video Path to video file - --transcript, -t Path to Whisper transcript (JSON or TXT) - --run-whisper Run Whisper transcription before processing - --whisper-model Whisper model: tiny, base, small, medium, large (default: base) - --output, -o Output file for enhanced transcript - --output-dir Directory for output files (default: output/) - --frames-dir Directory to save extracted frames (default: frames/) - --interval Extract frame every N seconds (default: 5) - --scene-detection Use scene detection instead of interval extraction - --ocr-engine OCR engine: tesseract, easyocr, paddleocr (default: tesseract) - --no-deduplicate Disable text deduplication - --extract-only Only extract frames and OCR, skip transcript merging - --format Output format: detailed or compact (default: detailed) - --verbose, -v Enable verbose logging (DEBUG level) +Main Options: + video Path to video file + --run-whisper Run Whisper transcription before processing + --whisper-model Whisper model: tiny, base, small, medium, large (default: medium) + --diarize Use WhisperX with speaker diarization + --embed-images Embed frame references for LLM analysis (recommended) + --embed-quality JPEG quality for frames (default: 80) + +Frame Extraction: + --scene-detection Use FFmpeg scene detection (recommended) + --scene-threshold Detection sensitivity 0-100 (default: 15, lower=more sensitive) + --interval Extract frame every N seconds (alternative to scene detection) + +Caching: + --no-cache Force complete reprocessing + --skip-cache-frames Re-extract frames only + --skip-cache-whisper Re-run transcription only + --skip-cache-analysis Re-run analysis only + +Other: + --transcript, -t Path to existing Whisper transcript (JSON or TXT) + --output, -o Output file for enhanced transcript + --output-dir Directory for output files (default: output/) + --verbose, -v Enable verbose logging ``` ## Tips for Best Results -### Vision vs OCR: When to Use Each - -**Use Vision Models (`--use-vision`) when:** -- ✅ Analyzing dashboards (Grafana, GCP Console, monitoring tools) -- ✅ Code walkthroughs or debugging sessions -- ✅ Complex layouts with mixed content -- ✅ Need contextual understanding, not just text extraction -- ✅ Working with charts, graphs, or visualizations -- ⚠️ Trade-off: Slower (requires GPU/CPU for local model) - -**Use OCR when:** -- ✅ Simple text extraction from slides or documents -- ✅ Need maximum speed -- ✅ Limited computational resources -- ✅ Presentations with mostly text -- ⚠️ Trade-off: Less context-aware, may miss visual relationships - -### Context Hints for Vision Analysis -- **`--vision-context meeting`**: General purpose (default) -- **`--vision-context code`**: Optimized for code screenshots, preserves formatting -- **`--vision-context dashboard`**: Extracts metrics, trends, panel names -- **`--vision-context console`**: Captures commands, output, error messages - -**Customizing Prompts:** -Prompts are stored as editable text files in `meetus/prompts/`: -- `meeting.txt` - General meeting analysis -- `code.txt` - Code screenshot analysis -- `dashboard.txt` - Dashboard/monitoring analysis -- `console.txt` - Terminal/console analysis - -Just edit these files to customize how the vision model analyzes your frames! - ### Scene Detection vs Interval -- **Scene detection**: Better for presentations with distinct slides. More efficient. -- **Interval extraction**: Better for continuous screen sharing (coding, browsing). More thorough. +- **Scene detection** (`--scene-detection`): Recommended. Captures frames when content changes. More efficient. +- **Interval extraction** (`--interval N`): Alternative for continuous content. Captures every N seconds. -### Vision Model Selection -- **`llava:7b`**: Faster, lower memory (~4GB RAM), good quality -- **`llava:13b`**: Better quality, slower, needs ~8GB RAM (default) -- **`bakllava`**: Alternative with different strengths +### Scene Detection Threshold +- Lower values (5-10): More sensitive, captures more frames +- Default (15): Good balance for most meetings +- Higher values (20-30): Less sensitive, fewer frames + +### Whisper vs WhisperX +- **Whisper** (`--run-whisper`): Standard transcription, fast +- **WhisperX** (`--run-whisper --diarize`): Adds speaker identification, requires HuggingFace token + +### Frame Quality +- Default quality (80) works well for most cases +- Use `--embed-quality 60` for smaller files if storage is a concern ### Deduplication - Enabled by default - removes similar consecutive frames @@ -365,61 +272,56 @@ Just edit these files to customize how the vision model analyzes your frames! ## Troubleshooting -### Vision Model Issues - -**"ollama package not installed"** -```bash -pip install ollama -``` - -**"Ollama not found" or connection errors** -```bash -# Install Ollama first: https://ollama.ai/download -# Then pull a vision model: -ollama pull llava:13b -``` - -**Vision analysis is slow** -- Use lighter model: `--vision-model llava:7b` -- Reduce frame count: `--scene-detection` or `--interval 10` -- Check if Ollama is using GPU (much faster) - -**Poor vision analysis results** -- Try different context hint: `--vision-context code` or `--vision-context dashboard` -- Use larger model: `--vision-model llava:13b` -- Ensure frames are clear (check video resolution) - -### OCR Issues - -**"pytesseract not installed"** -```bash -pip install pytesseract -sudo apt-get install tesseract-ocr # Don't forget system package! -``` - -**Poor OCR quality** -- **Solution**: Switch to vision analysis with `--use-vision` -- Or try different OCR engine: `--ocr-engine easyocr` -- Check if video resolution is sufficient -- Use `--no-deduplicate` to keep more frames - -### General Issues +### Frame Extraction Issues **"No frames extracted"** - Check video file is valid: `ffmpeg -i video.mkv` -- Try lower interval: `--interval 3` -- Check disk space in frames directory +- Try lower scene threshold: `--scene-threshold 5` +- Try interval extraction: `--interval 3` +- Check disk space in output directory **Scene detection not working** -- Fallback to interval extraction automatically - Ensure FFmpeg is installed +- Falls back to interval extraction automatically - Try manual interval: `--interval 5` +### Whisper/WhisperX Issues + +**WhisperX diarization not working** +- Ensure you have a HuggingFace token set +- Token needs access to pyannote models +- Fall back to standard Whisper without `--diarize` + +### Cache Issues + **Cache not being used** - Ensure you're using the same video filename - Check that output directory contains cached files - Use `--verbose` to see what's being cached/loaded +**Want to re-run specific steps** +- `--skip-cache-frames`: Re-extract frames +- `--skip-cache-whisper`: Re-run transcription +- `--skip-cache-analysis`: Re-run analysis +- `--no-cache`: Force complete reprocessing + +## Experimental Features + +### OCR and Vision Analysis + +OCR (`--ocr-engine`) and Vision analysis (`--use-vision`) options are available but experimental. The recommended approach is to use `--embed-images` which embeds frame references directly in the transcript, letting your LLM analyze the images. + +```bash +# Experimental: OCR extraction +python process_meeting.py samples/meeting.mkv --run-whisper --ocr-engine tesseract + +# Experimental: Vision model analysis +python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --vision-model llava:13b + +# Experimental: Hybrid OpenCV + OCR +python process_meeting.py samples/meeting.mkv --run-whisper --use-hybrid +``` + ## Project Structure ``` @@ -429,26 +331,18 @@ meetus/ │ ├── workflow.py # Processing orchestrator │ ├── output_manager.py # Output directory & manifest management │ ├── cache_manager.py # Caching logic -│ ├── frame_extractor.py # Video frame extraction -│ ├── vision_processor.py # Vision model analysis (Ollama/LLaVA) -│ ├── ocr_processor.py # OCR processing -│ ├── transcript_merger.py # Transcript merging -│ └── prompts/ # Vision analysis prompts (editable!) -│ ├── meeting.txt # General meeting analysis -│ ├── code.txt # Code screenshot analysis -│ ├── dashboard.txt # Dashboard/monitoring analysis -│ └── console.txt # Terminal/console analysis -├── process_meeting.py # Main CLI script (thin wrapper) +│ ├── frame_extractor.py # Video frame extraction (FFmpeg scene detection) +│ ├── vision_processor.py # Vision model analysis (experimental) +│ ├── ocr_processor.py # OCR processing (experimental) +│ └── transcript_merger.py # Transcript merging +├── process_meeting.py # Main CLI script ├── requirements.txt # Python dependencies ├── output/ # Timestamped output directories -│ ├── .gitkeep │ └── YYYYMMDD_HHMMSS-video/ # Auto-generated per video ├── samples/ # Sample videos (gitignored) └── README.md # This file ``` -The code is modular and easy to extend - each module has a single responsibility. - ## License For personal use.