From 331cccb15f33e2c5d058f9a02d6a89ba0a0d6f0d Mon Sep 17 00:00:00 2001
From: Mariano Gabriel <pensalo@gmail.com>
Date: Thu, 4 Dec 2025 20:15:16 -0300
Subject: [PATCH] updated readme

---
 README.md | 428 ++++++++++++++++++++----------------------------------
 1 file changed, 161 insertions(+), 267 deletions(-)

diff --git a/README.md b/README.md
index 4ce3915..c1570c9 100644
--- a/README.md
+++ b/README.md
@@ -1,34 +1,21 @@
 # Meeting Processor
 
-Extract screen content from meeting recordings and merge with Whisper transcripts for better AI summarization.
+Extract screen content from meeting recordings and merge with Whisper/WhisperX transcripts for better AI summarization.
 
 ## Overview
 
 This tool enhances meeting transcripts by combining:
-- **Audio transcription** (from Whisper)
-- **Screen content analysis** (Vision models or OCR)
+- **Audio transcription** (Whisper or WhisperX with speaker diarization)
+- **Screen content extraction** via FFmpeg scene detection
+- **Frame embedding** for direct LLM analysis
 
-### Vision Analysis vs OCR
-
-- **Vision Models** (recommended): Uses local LLaVA model via Ollama to understand context - great for dashboards, code, consoles
-- **OCR**: Traditional text extraction - faster but less context-aware
-
-The result is a rich, timestamped transcript that provides full context for AI summarization.
+The result is a rich, timestamped transcript with embedded screen frames that provides full context for AI summarization.
 
 ## Installation
 
 ### 1. System Dependencies
 
-**Ollama** (required for vision analysis):
-```bash
-# Install from https://ollama.ai/download
-# Then pull a vision model:
-ollama pull llava:13b
-# or for lighter model:
-ollama pull llava:7b
-```
-
-**FFmpeg** (for scene detection):
+**FFmpeg** (required for scene detection and frame extraction):
 ```bash
 # Ubuntu/Debian
 sudo apt-get install ffmpeg
@@ -37,149 +24,119 @@ sudo apt-get install ffmpeg
 brew install ffmpeg
 ```
 
-**Tesseract OCR** (optional, if not using vision):
-```bash
-# Ubuntu/Debian
-sudo apt-get install tesseract-ocr
-
-# macOS
-brew install tesseract
-
-# Arch Linux
-sudo pacman -S tesseract
-```
-
 ### 2. Python Dependencies
 
 ```bash
 pip install -r requirements.txt
 ```
 
-### 3. Whisper (for audio transcription)
+### 3. Whisper or WhisperX (for audio transcription)
 
+**Standard Whisper:**
 ```bash
 pip install openai-whisper
 ```
 
-### 4. Optional: Install Alternative OCR Engines
-
-If you prefer OCR over vision analysis:
+**WhisperX** (recommended - includes speaker diarization):
 ```bash
-# EasyOCR (better for rotated/handwritten text)
-pip install easyocr
-
-# PaddleOCR (better for code/terminal screens)
-pip install paddleocr
+pip install whisperx
 ```
 
+For speaker diarization, you'll need a HuggingFace token with access to pyannote models.
+
 ## Quick Start
 
-### Recommended: Vision Analysis (Best for Code/Dashboards)
+### Recommended: Embed Frames with Scene Detection
 
 ```bash
-python process_meeting.py samples/meeting.mkv --run-whisper --use-vision
+python process_meeting.py samples/meeting.mkv --run-whisper --embed-images --scene-detection
 ```
 
 This will:
 1. Run Whisper transcription (audio → text)
-2. Extract frames every 5 seconds
-3. Use LLaVA vision model to analyze frames with context
-4. Merge audio + screen content
-5. Save everything to `output/` folder
+2. Extract frames at scene changes (smarter than fixed intervals)
+3. Embed frame references in the transcript for LLM analysis
+4. Save everything to `output/` folder
+
+### With Speaker Diarization (WhisperX)
+
+```bash
+python process_meeting.py samples/meeting.mkv --run-whisper --diarize --embed-images --scene-detection
+```
+
+This uses WhisperX to identify different speakers in the transcript.
 
 ### Re-run with Cached Results
 
 Already ran it once? Re-run instantly using cached results:
 ```bash
-# Uses cached transcript, frames, and analysis
-python process_meeting.py samples/meeting.mkv --use-vision
+# Uses cached transcript and frames
+python process_meeting.py samples/meeting.mkv --embed-images
 
-# Force reprocessing
-python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --no-cache
-```
+# Skip only specific cached items
+python process_meeting.py samples/meeting.mkv --embed-images --skip-cache-frames
+python process_meeting.py samples/meeting.mkv --embed-images --skip-cache-whisper
+python process_meeting.py samples/meeting.mkv --embed-images --skip-cache-analysis
 
-### Traditional OCR (Faster, Less Context-Aware)
-
-```bash
-python process_meeting.py samples/meeting.mkv --run-whisper
+# Force complete reprocessing
+python process_meeting.py samples/meeting.mkv --run-whisper --embed-images --no-cache
 ```
 
 ## Usage Examples
 
-### Vision Analysis with Context Hints
+### Scene Detection Options
 ```bash
-# For code-heavy meetings
-python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --vision-context code
+# Default scene detection (threshold: 15)
+python process_meeting.py samples/meeting.mkv --run-whisper --embed-images --scene-detection
 
-# For dashboard/monitoring meetings (Grafana, GCP, etc.)
-python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --vision-context dashboard
+# More sensitive (more frames captured, threshold: 5)
+python process_meeting.py samples/meeting.mkv --run-whisper --embed-images --scene-detection --scene-threshold 5
 
-# For console/terminal sessions
-python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --vision-context console
+# Less sensitive (fewer frames, threshold: 30)
+python process_meeting.py samples/meeting.mkv --run-whisper --embed-images --scene-detection --scene-threshold 30
 ```
 
-### Different Vision Models
-```bash
-# Lighter/faster model (7B parameters)
-python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --vision-model llava:7b
-
-# Default model (13B parameters, better quality)
-python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --vision-model llava:13b
-
-# Alternative models
-python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --vision-model bakllava
-```
-
-### Extract frames at different intervals
+### Fixed Interval Extraction (alternative to scene detection)
 ```bash
 # Every 10 seconds
-python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --interval 10
+python process_meeting.py samples/meeting.mkv --run-whisper --embed-images --interval 10
 
 # Every 3 seconds (more detailed)
-python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --interval 3
+python process_meeting.py samples/meeting.mkv --run-whisper --embed-images --interval 3
 ```
 
-### Use scene detection (smarter, fewer frames)
+### Frame Quality Options
 ```bash
-python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --scene-detection
-```
+# Default quality (80)
+python process_meeting.py samples/meeting.mkv --run-whisper --embed-images --scene-detection
 
-### Traditional OCR (if you prefer)
-```bash
-# Tesseract (default)
-python process_meeting.py samples/meeting.mkv --run-whisper
-
-# EasyOCR
-python process_meeting.py samples/meeting.mkv --run-whisper --ocr-engine easyocr
-
-# PaddleOCR
-python process_meeting.py samples/meeting.mkv --run-whisper --ocr-engine paddleocr
+# Lower quality for smaller files (60)
+python process_meeting.py samples/meeting.mkv --run-whisper --embed-images --scene-detection --embed-quality 60
 ```
 
 ### Caching Examples
 ```bash
 # First run - processes everything
-python process_meeting.py samples/meeting.mkv --run-whisper --use-vision
+python process_meeting.py samples/meeting.mkv --run-whisper --embed-images --scene-detection
 
-# Second run - uses cached transcript and frames, only re-merges
-python process_meeting.py samples/meeting.mkv
+# Iterate on scene threshold (reuse whisper transcript)
+python process_meeting.py samples/meeting.mkv --embed-images --scene-detection --scene-threshold 5 --skip-cache-frames --skip-cache-analysis
 
-# Switch from OCR to vision using existing frames
-python process_meeting.py samples/meeting.mkv --use-vision
+# Re-run whisper only
+python process_meeting.py samples/meeting.mkv --embed-images --skip-cache-whisper
 
 # Force complete reprocessing
-python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --no-cache
+python process_meeting.py samples/meeting.mkv --run-whisper --embed-images --no-cache
 ```
 
 ### Custom output location
 ```bash
-python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --output-dir my_outputs/
+python process_meeting.py samples/meeting.mkv --run-whisper --embed-images --output-dir my_outputs/
 ```
 
 ### Enable verbose logging
 ```bash
-# Show detailed debug information
-python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --verbose
+python process_meeting.py samples/meeting.mkv --run-whisper --embed-images --scene-detection --verbose
 ```
 
 ## Output Files
@@ -191,87 +148,51 @@ output/
 └── 20241019_143022-meeting/
     ├── manifest.json                    # Processing configuration
     ├── meeting_enhanced.txt             # Enhanced transcript for AI
-    ├── meeting.json                     # Whisper transcript
-    ├── meeting_vision.json              # Vision analysis results
+    ├── meeting.json                     # Whisper/WhisperX transcript
     └── frames/                          # Extracted video frames
         ├── frame_00001_5.00s.jpg
         ├── frame_00002_10.00s.jpg
         └── ...
 ```
 
-### Manifest File
-
-Each processing run creates a `manifest.json` that tracks:
-- Video information (name, path)
-- Processing timestamp
-- Configuration used (Whisper model, vision settings, etc.)
-- Output file locations
-
-Example manifest:
-```json
-{
-  "video": {
-    "name": "meeting.mkv",
-    "path": "/full/path/to/meeting.mkv"
-  },
-  "processed_at": "2024-10-19T14:30:22",
-  "configuration": {
-    "whisper": {"enabled": true, "model": "base"},
-    "analysis": {"method": "vision", "vision_model": "llava:13b", "vision_context": "code"}
-  }
-}
-```
-
 ### Caching Behavior
 
 The tool automatically reuses the most recent output directory for the same video:
 - **First run**: Creates new timestamped directory (e.g., `20241019_143022-meeting/`)
 - **Subsequent runs**: Reuses the same directory and cached results
 - **Cached items**: Whisper transcript, extracted frames, analysis results
-- **Force new run**: Use `--no-cache` to create a fresh directory
 
-This means you can instantly switch between OCR and vision analysis without re-extracting frames!
+**Fine-grained cache control:**
+- `--no-cache`: Force complete reprocessing
+- `--skip-cache-frames`: Re-extract frames only
+- `--skip-cache-whisper`: Re-run transcription only
+- `--skip-cache-analysis`: Re-run analysis only
+
+This allows you to iterate on scene detection thresholds without re-running Whisper!
 
 ## Workflow for Meeting Analysis
 
 ### Complete Workflow (One Command!)
 
 ```bash
-# Process everything in one step with vision analysis
-python process_meeting.py samples/alo-intro1.mkv --run-whisper --use-vision --scene-detection
+# Process everything in one step with scene detection
+python process_meeting.py samples/meeting.mkv --run-whisper --embed-images --scene-detection
 
-# Output will be in output/alo-intro1_enhanced.txt
+# With speaker diarization
+python process_meeting.py samples/meeting.mkv --run-whisper --diarize --embed-images --scene-detection
 ```
 
 ### Typical Iterative Workflow
 
 ```bash
 # First run - full processing
-python process_meeting.py samples/meeting.mkv --run-whisper --use-vision
+python process_meeting.py samples/meeting.mkv --run-whisper --embed-images --scene-detection
 
-# Review results, then re-run with different context if needed
-python process_meeting.py samples/meeting.mkv --use-vision --vision-context code
+# Adjust scene threshold (keeps cached whisper transcript)
+python process_meeting.py samples/meeting.mkv --embed-images --scene-detection --scene-threshold 10 --skip-cache-frames --skip-cache-analysis
 
-# Or switch to a different vision model
-python process_meeting.py samples/meeting.mkv --use-vision --vision-model llava:7b
-
-# All use cached frames and transcript!
-```
-
-### Traditional Workflow (Separate Steps)
-
-```bash
-# 1. Extract audio and transcribe with Whisper (optional, if not using --run-whisper)
-whisper samples/alo-intro1.mkv --model base --output_format json --output_dir output
-
-# 2. Process video to extract screen content with vision
-python process_meeting.py samples/alo-intro1.mkv \
-    --transcript output/alo-intro1.json \
-    --use-vision \
-    --scene-detection
-
-# 3. Use the enhanced transcript with AI
-# Copy the content from output/alo-intro1_enhanced.txt and paste into Claude or your LLM
+# Try different frame quality
+python process_meeting.py samples/meeting.mkv --embed-images --embed-quality 60 --skip-cache-frames --skip-cache-analysis
 ```
 
 ### Example Prompt for Claude
@@ -291,73 +212,59 @@ Please summarize this meeting transcript. Pay special attention to:
 ```
 usage: process_meeting.py [-h] [--transcript TRANSCRIPT] [--run-whisper]
                           [--whisper-model {tiny,base,small,medium,large}]
-                          [--output OUTPUT] [--output-dir OUTPUT_DIR]
-                          [--frames-dir FRAMES_DIR] [--interval INTERVAL]
-                          [--scene-detection]
-                          [--ocr-engine {tesseract,easyocr,paddleocr}]
-                          [--no-deduplicate] [--extract-only]
-                          [--format {detailed,compact}] [--verbose]
-                          video
+                          [--diarize] [--output OUTPUT] [--output-dir OUTPUT_DIR]
+                          [--interval INTERVAL] [--scene-detection]
+                          [--scene-threshold SCENE_THRESHOLD]
+                          [--embed-images] [--embed-quality EMBED_QUALITY]
+                          [--no-cache] [--skip-cache-frames] [--skip-cache-whisper]
+                          [--skip-cache-analysis] [--no-deduplicate]
+                          [--extract-only] [--format {detailed,compact}]
+                          [--verbose] video
 
-Options:
-  video                 Path to video file
-  --transcript, -t      Path to Whisper transcript (JSON or TXT)
-  --run-whisper         Run Whisper transcription before processing
-  --whisper-model       Whisper model: tiny, base, small, medium, large (default: base)
-  --output, -o          Output file for enhanced transcript
-  --output-dir          Directory for output files (default: output/)
-  --frames-dir          Directory to save extracted frames (default: frames/)
-  --interval            Extract frame every N seconds (default: 5)
-  --scene-detection     Use scene detection instead of interval extraction
-  --ocr-engine          OCR engine: tesseract, easyocr, paddleocr (default: tesseract)
-  --no-deduplicate      Disable text deduplication
-  --extract-only        Only extract frames and OCR, skip transcript merging
-  --format              Output format: detailed or compact (default: detailed)
-  --verbose, -v         Enable verbose logging (DEBUG level)
+Main Options:
+  video                   Path to video file
+  --run-whisper           Run Whisper transcription before processing
+  --whisper-model         Whisper model: tiny, base, small, medium, large (default: medium)
+  --diarize               Use WhisperX with speaker diarization
+  --embed-images          Embed frame references for LLM analysis (recommended)
+  --embed-quality         JPEG quality for frames (default: 80)
+
+Frame Extraction:
+  --scene-detection       Use FFmpeg scene detection (recommended)
+  --scene-threshold       Detection sensitivity 0-100 (default: 15, lower=more sensitive)
+  --interval              Extract frame every N seconds (alternative to scene detection)
+
+Caching:
+  --no-cache              Force complete reprocessing
+  --skip-cache-frames     Re-extract frames only
+  --skip-cache-whisper    Re-run transcription only
+  --skip-cache-analysis   Re-run analysis only
+
+Other:
+  --transcript, -t        Path to existing Whisper transcript (JSON or TXT)
+  --output, -o            Output file for enhanced transcript
+  --output-dir            Directory for output files (default: output/)
+  --verbose, -v           Enable verbose logging
 ```
 
 ## Tips for Best Results
 
-### Vision vs OCR: When to Use Each
-
-**Use Vision Models (`--use-vision`) when:**
-- ✅ Analyzing dashboards (Grafana, GCP Console, monitoring tools)
-- ✅ Code walkthroughs or debugging sessions
-- ✅ Complex layouts with mixed content
-- ✅ Need contextual understanding, not just text extraction
-- ✅ Working with charts, graphs, or visualizations
-- ⚠️ Trade-off: Slower (requires GPU/CPU for local model)
-
-**Use OCR when:**
-- ✅ Simple text extraction from slides or documents
-- ✅ Need maximum speed
-- ✅ Limited computational resources
-- ✅ Presentations with mostly text
-- ⚠️ Trade-off: Less context-aware, may miss visual relationships
-
-### Context Hints for Vision Analysis
-- **`--vision-context meeting`**: General purpose (default)
-- **`--vision-context code`**: Optimized for code screenshots, preserves formatting
-- **`--vision-context dashboard`**: Extracts metrics, trends, panel names
-- **`--vision-context console`**: Captures commands, output, error messages
-
-**Customizing Prompts:**
-Prompts are stored as editable text files in `meetus/prompts/`:
-- `meeting.txt` - General meeting analysis
-- `code.txt` - Code screenshot analysis
-- `dashboard.txt` - Dashboard/monitoring analysis
-- `console.txt` - Terminal/console analysis
-
-Just edit these files to customize how the vision model analyzes your frames!
-
 ### Scene Detection vs Interval
-- **Scene detection**: Better for presentations with distinct slides. More efficient.
-- **Interval extraction**: Better for continuous screen sharing (coding, browsing). More thorough.
+- **Scene detection** (`--scene-detection`): Recommended. Captures frames when content changes. More efficient.
+- **Interval extraction** (`--interval N`): Alternative for continuous content. Captures every N seconds.
 
-### Vision Model Selection
-- **`llava:7b`**: Faster, lower memory (~4GB RAM), good quality
-- **`llava:13b`**: Better quality, slower, needs ~8GB RAM (default)
-- **`bakllava`**: Alternative with different strengths
+### Scene Detection Threshold
+- Lower values (5-10): More sensitive, captures more frames
+- Default (15): Good balance for most meetings
+- Higher values (20-30): Less sensitive, fewer frames
+
+### Whisper vs WhisperX
+- **Whisper** (`--run-whisper`): Standard transcription, fast
+- **WhisperX** (`--run-whisper --diarize`): Adds speaker identification, requires HuggingFace token
+
+### Frame Quality
+- Default quality (80) works well for most cases
+- Use `--embed-quality 60` for smaller files if storage is a concern
 
 ### Deduplication
 - Enabled by default - removes similar consecutive frames
@@ -365,61 +272,56 @@ Just edit these files to customize how the vision model analyzes your frames!
 
 ## Troubleshooting
 
-### Vision Model Issues
-
-**"ollama package not installed"**
-```bash
-pip install ollama
-```
-
-**"Ollama not found" or connection errors**
-```bash
-# Install Ollama first: https://ollama.ai/download
-# Then pull a vision model:
-ollama pull llava:13b
-```
-
-**Vision analysis is slow**
-- Use lighter model: `--vision-model llava:7b`
-- Reduce frame count: `--scene-detection` or `--interval 10`
-- Check if Ollama is using GPU (much faster)
-
-**Poor vision analysis results**
-- Try different context hint: `--vision-context code` or `--vision-context dashboard`
-- Use larger model: `--vision-model llava:13b`
-- Ensure frames are clear (check video resolution)
-
-### OCR Issues
-
-**"pytesseract not installed"**
-```bash
-pip install pytesseract
-sudo apt-get install tesseract-ocr  # Don't forget system package!
-```
-
-**Poor OCR quality**
-- **Solution**: Switch to vision analysis with `--use-vision`
-- Or try different OCR engine: `--ocr-engine easyocr`
-- Check if video resolution is sufficient
-- Use `--no-deduplicate` to keep more frames
-
-### General Issues
+### Frame Extraction Issues
 
 **"No frames extracted"**
 - Check video file is valid: `ffmpeg -i video.mkv`
-- Try lower interval: `--interval 3`
-- Check disk space in frames directory
+- Try lower scene threshold: `--scene-threshold 5`
+- Try interval extraction: `--interval 3`
+- Check disk space in output directory
 
 **Scene detection not working**
-- Fallback to interval extraction automatically
 - Ensure FFmpeg is installed
+- Falls back to interval extraction automatically
 - Try manual interval: `--interval 5`
 
+### Whisper/WhisperX Issues
+
+**WhisperX diarization not working**
+- Ensure you have a HuggingFace token set
+- Token needs access to pyannote models
+- Fall back to standard Whisper without `--diarize`
+
+### Cache Issues
+
 **Cache not being used**
 - Ensure you're using the same video filename
 - Check that output directory contains cached files
 - Use `--verbose` to see what's being cached/loaded
 
+**Want to re-run specific steps**
+- `--skip-cache-frames`: Re-extract frames
+- `--skip-cache-whisper`: Re-run transcription
+- `--skip-cache-analysis`: Re-run analysis
+- `--no-cache`: Force complete reprocessing
+
+## Experimental Features
+
+### OCR and Vision Analysis
+
+OCR (`--ocr-engine`) and Vision analysis (`--use-vision`) options are available but experimental. The recommended approach is to use `--embed-images` which embeds frame references directly in the transcript, letting your LLM analyze the images.
+
+```bash
+# Experimental: OCR extraction
+python process_meeting.py samples/meeting.mkv --run-whisper --ocr-engine tesseract
+
+# Experimental: Vision model analysis
+python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --vision-model llava:13b
+
+# Experimental: Hybrid OpenCV + OCR
+python process_meeting.py samples/meeting.mkv --run-whisper --use-hybrid
+```
+
 ## Project Structure
 
 ```
@@ -429,26 +331,18 @@ meetus/
 │   ├── workflow.py             # Processing orchestrator
 │   ├── output_manager.py       # Output directory & manifest management
 │   ├── cache_manager.py        # Caching logic
-│   ├── frame_extractor.py      # Video frame extraction
-│   ├── vision_processor.py     # Vision model analysis (Ollama/LLaVA)
-│   ├── ocr_processor.py        # OCR processing
-│   ├── transcript_merger.py    # Transcript merging
-│   └── prompts/                # Vision analysis prompts (editable!)
-│       ├── meeting.txt         # General meeting analysis
-│       ├── code.txt            # Code screenshot analysis
-│       ├── dashboard.txt       # Dashboard/monitoring analysis
-│       └── console.txt         # Terminal/console analysis
-├── process_meeting.py          # Main CLI script (thin wrapper)
+│   ├── frame_extractor.py      # Video frame extraction (FFmpeg scene detection)
+│   ├── vision_processor.py     # Vision model analysis (experimental)
+│   ├── ocr_processor.py        # OCR processing (experimental)
+│   └── transcript_merger.py    # Transcript merging
+├── process_meeting.py          # Main CLI script
 ├── requirements.txt            # Python dependencies
 ├── output/                     # Timestamped output directories
-│   ├── .gitkeep
 │   └── YYYYMMDD_HHMMSS-video/  # Auto-generated per video
 ├── samples/                    # Sample videos (gitignored)
 └── README.md                   # This file
 ```
 
-The code is modular and easy to extend - each module has a single responsibility.
-
 ## License
 
 For personal use.