updated readme

2025-12-04 20:15:16 -03:00
parent 7d7ec15ff7
commit 331cccb15f
1 changed files with 161 additions and 267 deletions
--- a/README.md
+++ b/README.md
@@ -1,34 +1,21 @@
 # Meeting Processor
-Extract screen content from meeting recordings and merge with Whisper transcripts for better AI summarization.
+Extract screen content from meeting recordings and merge with Whisper/WhisperX transcripts for better AI summarization.
 ## Overview
 This tool enhances meeting transcripts by combining:
- **Audio transcription** (from Whisper)
+- **Audio transcription** (Whisper or WhisperX with speaker diarization)
- **Screen content analysis** (Vision models or OCR)
+- **Screen content extraction** via FFmpeg scene detection
 - **Frame embedding** for direct LLM analysis
-### Vision Analysis vs OCR
+The result is a rich, timestamped transcript with embedded screen frames that provides full context for AI summarization.
 - **Vision Models** (recommended): Uses local LLaVA model via Ollama to understand context - great for dashboards, code, consoles
 - **OCR**: Traditional text extraction - faster but less context-aware
 The result is a rich, timestamped transcript that provides full context for AI summarization.
 ## Installation
 ### 1. System Dependencies
-**Ollama** (required for vision analysis):
+**FFmpeg** (required for scene detection and frame extraction):
 ```bash
 # Install from https://ollama.ai/download
 # Then pull a vision model:
 ollama pull llava:13b
 # or for lighter model:
 ollama pull llava:7b
 ```
 **FFmpeg** (for scene detection):
 ```bash
 # Ubuntu/Debian
 sudo apt-get install ffmpeg
@@ -37,149 +24,119 @@ sudo apt-get install ffmpeg
 brew install ffmpeg
 ```
 **Tesseract OCR** (optional, if not using vision):
 ```bash
 # Ubuntu/Debian
 sudo apt-get install tesseract-ocr
 # macOS
 brew install tesseract
 # Arch Linux
 sudo pacman -S tesseract
 ```
 ### 2. Python Dependencies
 ```bash
 pip install -r requirements.txt
 ```
-### 3. Whisper (for audio transcription)
+### 3. Whisper or WhisperX (for audio transcription)
 **Standard Whisper:**
 ```bash
 pip install openai-whisper
 ```
-### 4. Optional: Install Alternative OCR Engines
+**WhisperX** (recommended - includes speaker diarization):
 If you prefer OCR over vision analysis:
 ```bash
-# EasyOCR (better for rotated/handwritten text)
+pip install whisperx
 pip install easyocr
 # PaddleOCR (better for code/terminal screens)
 pip install paddleocr
 ```
 For speaker diarization, you'll need a HuggingFace token with access to pyannote models.
 ## Quick Start
-### Recommended: Vision Analysis (Best for Code/Dashboards)
+### Recommended: Embed Frames with Scene Detection
 ```bash
-python process_meeting.py samples/meeting.mkv --run-whisper --use-vision
+python process_meeting.py samples/meeting.mkv --run-whisper --embed-images --scene-detection
 ```
 This will:
 1. Run Whisper transcription (audio → text)
-2. Extract frames every 5 seconds
+2. Extract frames at scene changes (smarter than fixed intervals)
-3. Use LLaVA vision model to analyze frames with context
+3. Embed frame references in the transcript for LLM analysis
-4. Merge audio + screen content
+4. Save everything to `output/` folder
-5. Save everything to `output/` folder
+
 ### With Speaker Diarization (WhisperX)
 ```bash
 python process_meeting.py samples/meeting.mkv --run-whisper --diarize --embed-images --scene-detection
 ```
 This uses WhisperX to identify different speakers in the transcript.
 ### Re-run with Cached Results
 Already ran it once? Re-run instantly using cached results:
 ```bash
-# Uses cached transcript, frames, and analysis
+# Uses cached transcript and frames
-python process_meeting.py samples/meeting.mkv --use-vision
+python process_meeting.py samples/meeting.mkv --embed-images
-# Force reprocessing
+# Skip only specific cached items
-python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --no-cache
+python process_meeting.py samples/meeting.mkv --embed-images --skip-cache-frames
-```
+python process_meeting.py samples/meeting.mkv --embed-images --skip-cache-whisper
 python process_meeting.py samples/meeting.mkv --embed-images --skip-cache-analysis
-### Traditional OCR (Faster, Less Context-Aware)
+# Force complete reprocessing
-
+python process_meeting.py samples/meeting.mkv --run-whisper --embed-images --no-cache
 ```bash
 python process_meeting.py samples/meeting.mkv --run-whisper
 ```
 ## Usage Examples
-### Vision Analysis with Context Hints
+### Scene Detection Options
 ```bash
-# For code-heavy meetings
+# Default scene detection (threshold: 15)
-python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --vision-context code
+python process_meeting.py samples/meeting.mkv --run-whisper --embed-images --scene-detection
-# For dashboard/monitoring meetings (Grafana, GCP, etc.)
+# More sensitive (more frames captured, threshold: 5)
-python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --vision-context dashboard
+python process_meeting.py samples/meeting.mkv --run-whisper --embed-images --scene-detection --scene-threshold 5
-# For console/terminal sessions
+# Less sensitive (fewer frames, threshold: 30)
-python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --vision-context console
+python process_meeting.py samples/meeting.mkv --run-whisper --embed-images --scene-detection --scene-threshold 30
 ```
-### Different Vision Models
+### Fixed Interval Extraction (alternative to scene detection)
 ```bash
 # Lighter/faster model (7B parameters)
 python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --vision-model llava:7b
 # Default model (13B parameters, better quality)
 python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --vision-model llava:13b
 # Alternative models
 python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --vision-model bakllava
 ```
 ### Extract frames at different intervals
 ```bash
 # Every 10 seconds
-python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --interval 10
+python process_meeting.py samples/meeting.mkv --run-whisper --embed-images --interval 10
 # Every 3 seconds (more detailed)
-python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --interval 3
+python process_meeting.py samples/meeting.mkv --run-whisper --embed-images --interval 3
 ```
-### Use scene detection (smarter, fewer frames)
+### Frame Quality Options
 ```bash
-python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --scene-detection
+# Default quality (80)
-```
+python process_meeting.py samples/meeting.mkv --run-whisper --embed-images --scene-detection
-### Traditional OCR (if you prefer)
+# Lower quality for smaller files (60)
-```bash
+python process_meeting.py samples/meeting.mkv --run-whisper --embed-images --scene-detection --embed-quality 60
 # Tesseract (default)
 python process_meeting.py samples/meeting.mkv --run-whisper
 # EasyOCR
 python process_meeting.py samples/meeting.mkv --run-whisper --ocr-engine easyocr
 # PaddleOCR
 python process_meeting.py samples/meeting.mkv --run-whisper --ocr-engine paddleocr
 ```
 ### Caching Examples
 ```bash
 # First run - processes everything
-python process_meeting.py samples/meeting.mkv --run-whisper --use-vision
+python process_meeting.py samples/meeting.mkv --run-whisper --embed-images --scene-detection
-# Second run - uses cached transcript and frames, only re-merges
+# Iterate on scene threshold (reuse whisper transcript)
-python process_meeting.py samples/meeting.mkv
+python process_meeting.py samples/meeting.mkv --embed-images --scene-detection --scene-threshold 5 --skip-cache-frames --skip-cache-analysis
-# Switch from OCR to vision using existing frames
+# Re-run whisper only
-python process_meeting.py samples/meeting.mkv --use-vision
+python process_meeting.py samples/meeting.mkv --embed-images --skip-cache-whisper
 # Force complete reprocessing
-python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --no-cache
+python process_meeting.py samples/meeting.mkv --run-whisper --embed-images --no-cache
 ```
 ### Custom output location
 ```bash
-python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --output-dir my_outputs/
+python process_meeting.py samples/meeting.mkv --run-whisper --embed-images --output-dir my_outputs/
 ```
 ### Enable verbose logging
 ```bash
-# Show detailed debug information
+python process_meeting.py samples/meeting.mkv --run-whisper --embed-images --scene-detection --verbose
 python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --verbose
 ```
 ## Output Files
@@ -191,87 +148,51 @@ output/
 └── 20241019_143022-meeting/
    ├── manifest.json                    # Processing configuration
    ├── meeting_enhanced.txt             # Enhanced transcript for AI
-    ├── meeting.json                     # Whisper transcript
+    ├── meeting.json                     # Whisper/WhisperX transcript
    ├── meeting_vision.json              # Vision analysis results
    └── frames/                          # Extracted video frames
        ├── frame_00001_5.00s.jpg
        ├── frame_00002_10.00s.jpg
        └── ...
 ```
 ### Manifest File
 Each processing run creates a `manifest.json` that tracks:
 - Video information (name, path)
 - Processing timestamp
 - Configuration used (Whisper model, vision settings, etc.)
 - Output file locations
 Example manifest:
 ```json
 {
  "video": {
    "name": "meeting.mkv",
    "path": "/full/path/to/meeting.mkv"
  },
  "processed_at": "2024-10-19T14:30:22",
  "configuration": {
    "whisper": {"enabled": true, "model": "base"},
    "analysis": {"method": "vision", "vision_model": "llava:13b", "vision_context": "code"}
  }
 }
 ```
 ### Caching Behavior
 The tool automatically reuses the most recent output directory for the same video:
 - **First run**: Creates new timestamped directory (e.g., `20241019_143022-meeting/`)
 - **Subsequent runs**: Reuses the same directory and cached results
 - **Cached items**: Whisper transcript, extracted frames, analysis results
 - **Force new run**: Use `--no-cache` to create a fresh directory
-This means you can instantly switch between OCR and vision analysis without re-extracting frames!
+**Fine-grained cache control:**
 - `--no-cache`: Force complete reprocessing
 - `--skip-cache-frames`: Re-extract frames only
 - `--skip-cache-whisper`: Re-run transcription only
 - `--skip-cache-analysis`: Re-run analysis only
 This allows you to iterate on scene detection thresholds without re-running Whisper!
 ## Workflow for Meeting Analysis
 ### Complete Workflow (One Command!)
 ```bash
-# Process everything in one step with vision analysis
+# Process everything in one step with scene detection
-python process_meeting.py samples/alo-intro1.mkv --run-whisper --use-vision --scene-detection
+python process_meeting.py samples/meeting.mkv --run-whisper --embed-images --scene-detection
-# Output will be in output/alo-intro1_enhanced.txt
+# With speaker diarization
 python process_meeting.py samples/meeting.mkv --run-whisper --diarize --embed-images --scene-detection
 ```
 ### Typical Iterative Workflow
 ```bash
 # First run - full processing
-python process_meeting.py samples/meeting.mkv --run-whisper --use-vision
+python process_meeting.py samples/meeting.mkv --run-whisper --embed-images --scene-detection
-# Review results, then re-run with different context if needed
+# Adjust scene threshold (keeps cached whisper transcript)
-python process_meeting.py samples/meeting.mkv --use-vision --vision-context code
+python process_meeting.py samples/meeting.mkv --embed-images --scene-detection --scene-threshold 10 --skip-cache-frames --skip-cache-analysis
-# Or switch to a different vision model
+# Try different frame quality
-python process_meeting.py samples/meeting.mkv --use-vision --vision-model llava:7b
+python process_meeting.py samples/meeting.mkv --embed-images --embed-quality 60 --skip-cache-frames --skip-cache-analysis
 # All use cached frames and transcript!
 ```
 ### Traditional Workflow (Separate Steps)
 ```bash
 # 1. Extract audio and transcribe with Whisper (optional, if not using --run-whisper)
 whisper samples/alo-intro1.mkv --model base --output_format json --output_dir output
 # 2. Process video to extract screen content with vision
 python process_meeting.py samples/alo-intro1.mkv \
    --transcript output/alo-intro1.json \
    --use-vision \
    --scene-detection
 # 3. Use the enhanced transcript with AI
 # Copy the content from output/alo-intro1_enhanced.txt and paste into Claude or your LLM
 ```
 ### Example Prompt for Claude
@@ -291,73 +212,59 @@ Please summarize this meeting transcript. Pay special attention to:
 ```
 usage: process_meeting.py [-h] [--transcript TRANSCRIPT] [--run-whisper]
                          [--whisper-model {tiny,base,small,medium,large}]
-                          [--output OUTPUT] [--output-dir OUTPUT_DIR]
+                          [--diarize] [--output OUTPUT] [--output-dir OUTPUT_DIR]
-                          [--frames-dir FRAMES_DIR] [--interval INTERVAL]
+                          [--interval INTERVAL] [--scene-detection]
-                          [--scene-detection]
+                          [--scene-threshold SCENE_THRESHOLD]
-                          [--ocr-engine {tesseract,easyocr,paddleocr}]
+                          [--embed-images] [--embed-quality EMBED_QUALITY]
-                          [--no-deduplicate] [--extract-only]
+                          [--no-cache] [--skip-cache-frames] [--skip-cache-whisper]
-                          [--format {detailed,compact}] [--verbose]
+                          [--skip-cache-analysis] [--no-deduplicate]
-                          video
+                          [--extract-only] [--format {detailed,compact}]
                          [--verbose] video
-Options:
+Main Options:
  video                   Path to video file
  --transcript, -t      Path to Whisper transcript (JSON or TXT)
  --run-whisper           Run Whisper transcription before processing
-  --whisper-model       Whisper model: tiny, base, small, medium, large (default: base)
+  --whisper-model         Whisper model: tiny, base, small, medium, large (default: medium)
  --diarize               Use WhisperX with speaker diarization
  --embed-images          Embed frame references for LLM analysis (recommended)
  --embed-quality         JPEG quality for frames (default: 80)
 Frame Extraction:
  --scene-detection       Use FFmpeg scene detection (recommended)
  --scene-threshold       Detection sensitivity 0-100 (default: 15, lower=more sensitive)
  --interval              Extract frame every N seconds (alternative to scene detection)
 Caching:
  --no-cache              Force complete reprocessing
  --skip-cache-frames     Re-extract frames only
  --skip-cache-whisper    Re-run transcription only
  --skip-cache-analysis   Re-run analysis only
 Other:
  --transcript, -t        Path to existing Whisper transcript (JSON or TXT)
  --output, -o            Output file for enhanced transcript
  --output-dir            Directory for output files (default: output/)
-  --frames-dir          Directory to save extracted frames (default: frames/)
+  --verbose, -v           Enable verbose logging
  --interval            Extract frame every N seconds (default: 5)
  --scene-detection     Use scene detection instead of interval extraction
  --ocr-engine          OCR engine: tesseract, easyocr, paddleocr (default: tesseract)
  --no-deduplicate      Disable text deduplication
  --extract-only        Only extract frames and OCR, skip transcript merging
  --format              Output format: detailed or compact (default: detailed)
  --verbose, -v         Enable verbose logging (DEBUG level)
 ```
 ## Tips for Best Results
 ### Vision vs OCR: When to Use Each
 **Use Vision Models (`--use-vision`) when:**
 - ✅ Analyzing dashboards (Grafana, GCP Console, monitoring tools)
 - ✅ Code walkthroughs or debugging sessions
 - ✅ Complex layouts with mixed content
 - ✅ Need contextual understanding, not just text extraction
 - ✅ Working with charts, graphs, or visualizations
 - ⚠️ Trade-off: Slower (requires GPU/CPU for local model)
 **Use OCR when:**
 - ✅ Simple text extraction from slides or documents
 - ✅ Need maximum speed
 - ✅ Limited computational resources
 - ✅ Presentations with mostly text
 - ⚠️ Trade-off: Less context-aware, may miss visual relationships
 ### Context Hints for Vision Analysis
 - **`--vision-context meeting`**: General purpose (default)
 - **`--vision-context code`**: Optimized for code screenshots, preserves formatting
 - **`--vision-context dashboard`**: Extracts metrics, trends, panel names
 - **`--vision-context console`**: Captures commands, output, error messages
 **Customizing Prompts:**
 Prompts are stored as editable text files in `meetus/prompts/`:
 - `meeting.txt` - General meeting analysis
 - `code.txt` - Code screenshot analysis
 - `dashboard.txt` - Dashboard/monitoring analysis
 - `console.txt` - Terminal/console analysis
 Just edit these files to customize how the vision model analyzes your frames!
 ### Scene Detection vs Interval
- **Scene detection**: Better for presentations with distinct slides. More efficient.
+- **Scene detection** (`--scene-detection`): Recommended. Captures frames when content changes. More efficient.
- **Interval extraction**: Better for continuous screen sharing (coding, browsing). More thorough.
+- **Interval extraction** (`--interval N`): Alternative for continuous content. Captures every N seconds.
-### Vision Model Selection
+### Scene Detection Threshold
- **`llava:7b`**: Faster, lower memory (~4GB RAM), good quality
+- Lower values (5-10): More sensitive, captures more frames
- **`llava:13b`**: Better quality, slower, needs ~8GB RAM (default)
+- Default (15): Good balance for most meetings
- **`bakllava`**: Alternative with different strengths
+- Higher values (20-30): Less sensitive, fewer frames
 ### Whisper vs WhisperX
 - **Whisper** (`--run-whisper`): Standard transcription, fast
 - **WhisperX** (`--run-whisper --diarize`): Adds speaker identification, requires HuggingFace token
 ### Frame Quality
 - Default quality (80) works well for most cases
 - Use `--embed-quality 60` for smaller files if storage is a concern
 ### Deduplication
 - Enabled by default - removes similar consecutive frames
@@ -365,61 +272,56 @@ Just edit these files to customize how the vision model analyzes your frames!
 ## Troubleshooting
-### Vision Model Issues
+### Frame Extraction Issues
 **"ollama package not installed"**
 ```bash
 pip install ollama
 ```
 **"Ollama not found" or connection errors**
 ```bash
 # Install Ollama first: https://ollama.ai/download
 # Then pull a vision model:
 ollama pull llava:13b
 ```
 **Vision analysis is slow**
 - Use lighter model: `--vision-model llava:7b`
 - Reduce frame count: `--scene-detection` or `--interval 10`
 - Check if Ollama is using GPU (much faster)
 **Poor vision analysis results**
 - Try different context hint: `--vision-context code` or `--vision-context dashboard`
 - Use larger model: `--vision-model llava:13b`
 - Ensure frames are clear (check video resolution)
 ### OCR Issues
 **"pytesseract not installed"**
 ```bash
 pip install pytesseract
 sudo apt-get install tesseract-ocr  # Don't forget system package!
 ```
 **Poor OCR quality**
 - **Solution**: Switch to vision analysis with `--use-vision`
 - Or try different OCR engine: `--ocr-engine easyocr`
 - Check if video resolution is sufficient
 - Use `--no-deduplicate` to keep more frames
 ### General Issues
 **"No frames extracted"**
 - Check video file is valid: `ffmpeg -i video.mkv`
- Try lower interval: `--interval 3`
+- Try lower scene threshold: `--scene-threshold 5`
- Check disk space in frames directory
+- Try interval extraction: `--interval 3`
 - Check disk space in output directory
 **Scene detection not working**
 - Fallback to interval extraction automatically
 - Ensure FFmpeg is installed
 - Falls back to interval extraction automatically
 - Try manual interval: `--interval 5`
 ### Whisper/WhisperX Issues
 **WhisperX diarization not working**
 - Ensure you have a HuggingFace token set
 - Token needs access to pyannote models
 - Fall back to standard Whisper without `--diarize`
 ### Cache Issues
 **Cache not being used**
 - Ensure you're using the same video filename
 - Check that output directory contains cached files
 - Use `--verbose` to see what's being cached/loaded
 **Want to re-run specific steps**
 - `--skip-cache-frames`: Re-extract frames
 - `--skip-cache-whisper`: Re-run transcription
 - `--skip-cache-analysis`: Re-run analysis
 - `--no-cache`: Force complete reprocessing
 ## Experimental Features
 ### OCR and Vision Analysis
 OCR (`--ocr-engine`) and Vision analysis (`--use-vision`) options are available but experimental. The recommended approach is to use `--embed-images` which embeds frame references directly in the transcript, letting your LLM analyze the images.
 ```bash
 # Experimental: OCR extraction
 python process_meeting.py samples/meeting.mkv --run-whisper --ocr-engine tesseract
 # Experimental: Vision model analysis
 python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --vision-model llava:13b
 # Experimental: Hybrid OpenCV + OCR
 python process_meeting.py samples/meeting.mkv --run-whisper --use-hybrid
 ```
 ## Project Structure
 ```
@@ -429,26 +331,18 @@ meetus/
 │   ├── workflow.py             # Processing orchestrator
 │   ├── output_manager.py       # Output directory & manifest management
 │   ├── cache_manager.py        # Caching logic
-│   ├── frame_extractor.py      # Video frame extraction
+│   ├── frame_extractor.py      # Video frame extraction (FFmpeg scene detection)
-│   ├── vision_processor.py     # Vision model analysis (Ollama/LLaVA)
+│   ├── vision_processor.py     # Vision model analysis (experimental)
-│   ├── ocr_processor.py        # OCR processing
+│   ├── ocr_processor.py        # OCR processing (experimental)
-│   ├── transcript_merger.py    # Transcript merging
+│   └── transcript_merger.py    # Transcript merging
-│   └── prompts/                # Vision analysis prompts (editable!)
+├── process_meeting.py          # Main CLI script
 │       ├── meeting.txt         # General meeting analysis
 │       ├── code.txt            # Code screenshot analysis
 │       ├── dashboard.txt       # Dashboard/monitoring analysis
 │       └── console.txt         # Terminal/console analysis
 ├── process_meeting.py          # Main CLI script (thin wrapper)
 ├── requirements.txt            # Python dependencies
 ├── output/                     # Timestamped output directories
 │   ├── .gitkeep
 │   └── YYYYMMDD_HHMMSS-video/  # Auto-generated per video
 ├── samples/                    # Sample videos (gitignored)
 └── README.md                   # This file
 ```
 The code is modular and easy to extend - each module has a single responsibility.
 ## License
 For personal use.