updated readme

add whisperx support
2025-12-04 20:24:52 -03:00 · 2025-12-04 20:15:16 -03:00 · 2025-12-03 06:48:45 -03:00 · 2025-12-02 02:33:39 -03:00 · 2025-10-28 08:02:45 -03:00 · 2025-10-28 05:52:31 -03:00
21 changed files with 2331 additions and 606 deletions
--- a/.gitignore
+++ b/.gitignore
@@ -2,10 +2,11 @@
 samples/*
 !samples/.gitkeep

-# Output files
+# Output directories (timestamped folders for each video)
 output/*
 !output/.gitkeep

-# Extracted frames
-frames/
-__pycache__
+# Python cache
+__pycache__
+*.pyc
+.pytest_cache/
--- a/README.md
+++ b/README.md
@@ -1,34 +1,21 @@
 # Meeting Processor

-Extract screen content from meeting recordings and merge with Whisper transcripts for better AI summarization.
+Extract screen content from meeting recordings and merge with Whisper/WhisperX transcripts for better AI summarization.

 ## Overview

 This tool enhances meeting transcripts by combining:
- **Audio transcription** (from Whisper)
- **Screen content analysis** (Vision models or OCR)
+- **Audio transcription** (Whisper or WhisperX with speaker diarization)
+- **Screen content extraction** via FFmpeg scene detection
+- **Frame embedding** for direct LLM analysis

-### Vision Analysis vs OCR
-
- **Vision Models** (recommended): Uses local LLaVA model via Ollama to understand context - great for dashboards, code, consoles
- **OCR**: Traditional text extraction - faster but less context-aware
-
-The result is a rich, timestamped transcript that provides full context for AI summarization.
+The result is a rich, timestamped transcript with embedded screen frames that provides full context for AI summarization.

 ## Installation

 ### 1. System Dependencies

-**Ollama** (required for vision analysis):
-```bash
-# Install from https://ollama.ai/download
-# Then pull a vision model:
-ollama pull llava:13b
-# or for lighter model:
-ollama pull llava:7b
-```
-
-**FFmpeg** (for scene detection):
+**FFmpeg** (required for scene detection and frame extraction):
 ```bash
 # Ubuntu/Debian
 sudo apt-get install ffmpeg
@@ -37,210 +24,152 @@ sudo apt-get install ffmpeg
 brew install ffmpeg
 ```

-**Tesseract OCR** (optional, if not using vision):
-```bash
-# Ubuntu/Debian
-sudo apt-get install tesseract-ocr
-
-# macOS
-brew install tesseract
-
-# Arch Linux
-sudo pacman -S tesseract
-```
-
 ### 2. Python Dependencies

 ```bash
 pip install -r requirements.txt
 ```

-### 3. Whisper (for audio transcription)
+### 3. Whisper or WhisperX (for audio transcription)

+**Standard Whisper:**
 ```bash
 pip install openai-whisper
 ```

-### 4. Optional: Install Alternative OCR Engines
-
-If you prefer OCR over vision analysis:
+**WhisperX** (recommended - includes speaker diarization):
 ```bash
-# EasyOCR (better for rotated/handwritten text)
-pip install easyocr
-
-# PaddleOCR (better for code/terminal screens)
-pip install paddleocr
+pip install whisperx
 ```

+For speaker diarization, you'll need a HuggingFace token with access to pyannote models.
+
 ## Quick Start

-### Recommended: Vision Analysis (Best for Code/Dashboards)
+### Recommended Usage

 ```bash
-python process_meeting.py samples/meeting.mkv --run-whisper --use-vision
+python process_meeting.py samples/meeting.mkv --embed-images --scene-detection --scene-threshold 10 --diarize
 ```

 This will:
-1. Run Whisper transcription (audio → text)
-2. Extract frames every 5 seconds
-3. Use LLaVA vision model to analyze frames with context
-4. Merge audio + screen content
-5. Save everything to `output/` folder
+1. Run WhisperX transcription with speaker diarization
+2. Extract frames at scene changes (threshold 10 = moderately sensitive)
+3. Create an enhanced transcript with frame file references
+4. Save everything to `output/` folder
+
+The `--embed-images` flag adds frame paths to the transcript (e.g., `Frame: frames/video_00257.jpg`), keeping the transcript small while frames stay in `frames/` folder for LLM access.

 ### Re-run with Cached Results

 Already ran it once? Re-run instantly using cached results:
 ```bash
-# Uses cached transcript, frames, and analysis
-python process_meeting.py samples/meeting.mkv --use-vision
+# Uses cached transcript and frames
+python process_meeting.py samples/meeting.mkv --embed-images

-# Force reprocessing
-python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --no-cache
-```
+# Skip only specific cached items
+python process_meeting.py samples/meeting.mkv --embed-images --skip-cache-frames
+python process_meeting.py samples/meeting.mkv --embed-images --skip-cache-whisper

-### Traditional OCR (Faster, Less Context-Aware)
-
-```bash
-python process_meeting.py samples/meeting.mkv --run-whisper
+# Force complete reprocessing
+python process_meeting.py samples/meeting.mkv --embed-images --scene-detection --diarize --no-cache
 ```

 ## Usage Examples

-### Vision Analysis with Context Hints
+### Scene Detection Options
 ```bash
-# For code-heavy meetings
-python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --vision-context code
+# Default threshold (15)
+python process_meeting.py samples/meeting.mkv --embed-images --scene-detection --diarize

-# For dashboard/monitoring meetings (Grafana, GCP, etc.)
-python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --vision-context dashboard
+# More sensitive (more frames, threshold: 5)
+python process_meeting.py samples/meeting.mkv --embed-images --scene-detection --scene-threshold 5 --diarize

-# For console/terminal sessions
-python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --vision-context console
+# Less sensitive (fewer frames, threshold: 30)
+python process_meeting.py samples/meeting.mkv --embed-images --scene-detection --scene-threshold 30 --diarize
 ```

-### Different Vision Models
-```bash
-# Lighter/faster model (7B parameters)
-python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --vision-model llava:7b
-
-# Default model (13B parameters, better quality)
-python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --vision-model llava:13b
-
-# Alternative models
-python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --vision-model bakllava
-```
-
-### Extract frames at different intervals
+### Fixed Interval Extraction (alternative to scene detection)
 ```bash
 # Every 10 seconds
-python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --interval 10
+python process_meeting.py samples/meeting.mkv --embed-images --interval 10 --diarize

 # Every 3 seconds (more detailed)
-python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --interval 3
-```
-
-### Use scene detection (smarter, fewer frames)
-```bash
-python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --scene-detection
-```
-
-### Traditional OCR (if you prefer)
-```bash
-# Tesseract (default)
-python process_meeting.py samples/meeting.mkv --run-whisper
-
-# EasyOCR
-python process_meeting.py samples/meeting.mkv --run-whisper --ocr-engine easyocr
-
-# PaddleOCR
-python process_meeting.py samples/meeting.mkv --run-whisper --ocr-engine paddleocr
+python process_meeting.py samples/meeting.mkv --embed-images --interval 3 --diarize
 ```

 ### Caching Examples
 ```bash
 # First run - processes everything
-python process_meeting.py samples/meeting.mkv --run-whisper --use-vision
+python process_meeting.py samples/meeting.mkv --embed-images --scene-detection --scene-threshold 10 --diarize

-# Second run - uses cached transcript and frames, only re-merges
-python process_meeting.py samples/meeting.mkv
+# Iterate on scene threshold (reuse whisper transcript)
+python process_meeting.py samples/meeting.mkv --embed-images --scene-detection --scene-threshold 5 --skip-cache-frames --skip-cache-analysis

-# Switch from OCR to vision using existing frames
-python process_meeting.py samples/meeting.mkv --use-vision
+# Re-run whisper only
+python process_meeting.py samples/meeting.mkv --embed-images --skip-cache-whisper

 # Force complete reprocessing
-python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --no-cache
+python process_meeting.py samples/meeting.mkv --embed-images --scene-detection --diarize --no-cache
 ```

 ### Custom output location
 ```bash
-python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --output-dir my_outputs/
+python process_meeting.py samples/meeting.mkv --embed-images --scene-detection --diarize --output-dir my_outputs/
 ```

 ### Enable verbose logging
 ```bash
-# Show detailed debug information
-python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --verbose
+python process_meeting.py samples/meeting.mkv --embed-images --scene-detection --diarize --verbose
 ```

 ## Output Files

-All output files are saved to the `output/` directory by default:
+Each video gets its own timestamped output directory:

- **`output/<video>_enhanced.txt`** - Enhanced transcript ready for AI summarization
- **`output/<video>.json`** - Whisper transcript (if `--run-whisper` was used)
- **`output/<video>_vision.json`** - Vision analysis results with timestamps (if `--use-vision`)
- **`output/<video>_ocr.json`** - OCR results with timestamps (if using OCR)
- **`frames/`** - Extracted video frames (JPG files)
+```
+output/
+└── 20241019_143022-meeting/
+    ├── manifest.json                    # Processing configuration
+    ├── meeting_enhanced.txt             # Enhanced transcript for AI
+    ├── meeting.json                     # Whisper/WhisperX transcript
+    └── frames/                          # Extracted video frames
+        ├── frame_00001_5.00s.jpg
+        ├── frame_00002_10.00s.jpg
+        └── ...
+```

 ### Caching Behavior

-The tool automatically caches intermediate results to speed up re-runs:
- **Whisper transcript**: Cached as `output/<video>.json`
- **Extracted frames**: Cached in `frames/<video>_*.jpg`
- **Analysis results**: Cached as `output/<video>_vision.json` or `output/<video>_ocr.json`
+The tool automatically reuses the most recent output directory for the same video:
+- **First run**: Creates new timestamped directory (e.g., `20241019_143022-meeting/`)
+- **Subsequent runs**: Reuses the same directory and cached results
+- **Cached items**: Whisper transcript, extracted frames, analysis results

-Re-running with the same video will use cached results unless `--no-cache` is specified.
+**Fine-grained cache control:**
+- `--no-cache`: Force complete reprocessing
+- `--skip-cache-frames`: Re-extract frames only
+- `--skip-cache-whisper`: Re-run transcription only
+- `--skip-cache-analysis`: Re-run analysis only
+
+This allows you to iterate on scene detection thresholds without re-running Whisper!

 ## Workflow for Meeting Analysis

 ### Complete Workflow (One Command!)

 ```bash
-# Process everything in one step with vision analysis
-python process_meeting.py samples/alo-intro1.mkv --run-whisper --use-vision --scene-detection
-
-# Output will be in output/alo-intro1_enhanced.txt
+python process_meeting.py samples/meeting.mkv --embed-images --scene-detection --scene-threshold 10 --diarize
 ```

 ### Typical Iterative Workflow

 ```bash
 # First run - full processing
-python process_meeting.py samples/meeting.mkv --run-whisper --use-vision
+python process_meeting.py samples/meeting.mkv --embed-images --scene-detection --scene-threshold 10 --diarize

-# Review results, then re-run with different context if needed
-python process_meeting.py samples/meeting.mkv --use-vision --vision-context code
-
-# Or switch to a different vision model
-python process_meeting.py samples/meeting.mkv --use-vision --vision-model llava:7b
-
-# All use cached frames and transcript!
-```
-
-### Traditional Workflow (Separate Steps)
-
-```bash
-# 1. Extract audio and transcribe with Whisper (optional, if not using --run-whisper)
-whisper samples/alo-intro1.mkv --model base --output_format json --output_dir output
-
-# 2. Process video to extract screen content with vision
-python process_meeting.py samples/alo-intro1.mkv \
-    --transcript output/alo-intro1.json \
-    --use-vision \
-    --scene-detection
-
-# 3. Use the enhanced transcript with AI
-# Copy the content from output/alo-intro1_enhanced.txt and paste into Claude or your LLM
+# Adjust scene threshold (keeps cached whisper transcript)
+python process_meeting.py samples/meeting.mkv --embed-images --scene-detection --scene-threshold 5 --skip-cache-frames --skip-cache-analysis
 ```

 ### Example Prompt for Claude
@@ -260,64 +189,54 @@ Please summarize this meeting transcript. Pay special attention to:
 ```
 usage: process_meeting.py [-h] [--transcript TRANSCRIPT] [--run-whisper]
                          [--whisper-model {tiny,base,small,medium,large}]
-                          [--output OUTPUT] [--output-dir OUTPUT_DIR]
-                          [--frames-dir FRAMES_DIR] [--interval INTERVAL]
-                          [--scene-detection]
-                          [--ocr-engine {tesseract,easyocr,paddleocr}]
-                          [--no-deduplicate] [--extract-only]
-                          [--format {detailed,compact}] [--verbose]
-                          video
+                          [--diarize] [--output OUTPUT] [--output-dir OUTPUT_DIR]
+                          [--interval INTERVAL] [--scene-detection]
+                          [--scene-threshold SCENE_THRESHOLD]
+                          [--embed-images] [--embed-quality EMBED_QUALITY]
+                          [--no-cache] [--skip-cache-frames] [--skip-cache-whisper]
+                          [--skip-cache-analysis] [--no-deduplicate]
+                          [--extract-only] [--format {detailed,compact}]
+                          [--verbose] video

-Options:
-  video                 Path to video file
-  --transcript, -t      Path to Whisper transcript (JSON or TXT)
-  --run-whisper         Run Whisper transcription before processing
-  --whisper-model       Whisper model: tiny, base, small, medium, large (default: base)
-  --output, -o          Output file for enhanced transcript
-  --output-dir          Directory for output files (default: output/)
-  --frames-dir          Directory to save extracted frames (default: frames/)
-  --interval            Extract frame every N seconds (default: 5)
-  --scene-detection     Use scene detection instead of interval extraction
-  --ocr-engine          OCR engine: tesseract, easyocr, paddleocr (default: tesseract)
-  --no-deduplicate      Disable text deduplication
-  --extract-only        Only extract frames and OCR, skip transcript merging
-  --format              Output format: detailed or compact (default: detailed)
-  --verbose, -v         Enable verbose logging (DEBUG level)
+Main Options:
+  video                   Path to video file
+  --diarize               Use WhisperX with speaker diarization
+  --embed-images          Add frame file references to transcript (recommended)
+
+Frame Extraction:
+  --scene-detection       Use FFmpeg scene detection (recommended)
+  --scene-threshold       Detection sensitivity 0-100 (default: 15, lower=more sensitive)
+  --interval              Extract frame every N seconds (alternative to scene detection)
+
+Caching:
+  --no-cache              Force complete reprocessing
+  --skip-cache-frames     Re-extract frames only
+  --skip-cache-whisper    Re-run transcription only
+  --skip-cache-analysis   Re-run analysis only
+
+Other:
+  --run-whisper           Run Whisper (without diarization)
+  --whisper-model         Whisper model: tiny, base, small, medium, large (default: medium)
+  --transcript, -t        Path to existing Whisper transcript (JSON or TXT)
+  --output, -o            Output file for enhanced transcript
+  --output-dir            Directory for output files (default: output/)
+  --verbose, -v           Enable verbose logging
 ```

 ## Tips for Best Results

-### Vision vs OCR: When to Use Each
-
-**Use Vision Models (`--use-vision`) when:**
- ✅ Analyzing dashboards (Grafana, GCP Console, monitoring tools)
- ✅ Code walkthroughs or debugging sessions
- ✅ Complex layouts with mixed content
- ✅ Need contextual understanding, not just text extraction
- ✅ Working with charts, graphs, or visualizations
- ⚠️ Trade-off: Slower (requires GPU/CPU for local model)
-
-**Use OCR when:**
- ✅ Simple text extraction from slides or documents
- ✅ Need maximum speed
- ✅ Limited computational resources
- ✅ Presentations with mostly text
- ⚠️ Trade-off: Less context-aware, may miss visual relationships
-
-### Context Hints for Vision Analysis
- **`--vision-context meeting`**: General purpose (default)
- **`--vision-context code`**: Optimized for code screenshots, preserves formatting
- **`--vision-context dashboard`**: Extracts metrics, trends, panel names
- **`--vision-context console`**: Captures commands, output, error messages
-
 ### Scene Detection vs Interval
- **Scene detection**: Better for presentations with distinct slides. More efficient.
- **Interval extraction**: Better for continuous screen sharing (coding, browsing). More thorough.
+- **Scene detection** (`--scene-detection`): Recommended. Captures frames when content changes. More efficient.
+- **Interval extraction** (`--interval N`): Alternative for continuous content. Captures every N seconds.

-### Vision Model Selection
- **`llava:7b`**: Faster, lower memory (~4GB RAM), good quality
- **`llava:13b`**: Better quality, slower, needs ~8GB RAM (default)
- **`bakllava`**: Alternative with different strengths
+### Scene Detection Threshold
+- Lower values (5-10): More sensitive, captures more frames
+- Default (15): Good balance for most meetings
+- Higher values (20-30): Less sensitive, fewer frames
+
+### Whisper vs WhisperX
+- **Whisper** (`--run-whisper`): Standard transcription, fast
+- **WhisperX** (`--run-whisper --diarize`): Adds speaker identification, requires HuggingFace token

 ### Deduplication
 - Enabled by default - removes similar consecutive frames
@@ -325,73 +244,75 @@ Options:

 ## Troubleshooting

-### Vision Model Issues
-
-**"ollama package not installed"**
-```bash
-pip install ollama
-```
-
-**"Ollama not found" or connection errors**
-```bash
-# Install Ollama first: https://ollama.ai/download
-# Then pull a vision model:
-ollama pull llava:13b
-```
-
-**Vision analysis is slow**
- Use lighter model: `--vision-model llava:7b`
- Reduce frame count: `--scene-detection` or `--interval 10`
- Check if Ollama is using GPU (much faster)
-
-**Poor vision analysis results**
- Try different context hint: `--vision-context code` or `--vision-context dashboard`
- Use larger model: `--vision-model llava:13b`
- Ensure frames are clear (check video resolution)
-
-### OCR Issues
-
-**"pytesseract not installed"**
-```bash
-pip install pytesseract
-sudo apt-get install tesseract-ocr  # Don't forget system package!
-```
-
-**Poor OCR quality**
- **Solution**: Switch to vision analysis with `--use-vision`
- Or try different OCR engine: `--ocr-engine easyocr`
- Check if video resolution is sufficient
- Use `--no-deduplicate` to keep more frames
-
-### General Issues
+### Frame Extraction Issues

 **"No frames extracted"**
 - Check video file is valid: `ffmpeg -i video.mkv`
- Try lower interval: `--interval 3`
- Check disk space in frames directory
+- Try lower scene threshold: `--scene-threshold 5`
+- Try interval extraction: `--interval 3`
+- Check disk space in output directory

 **Scene detection not working**
- Fallback to interval extraction automatically
 - Ensure FFmpeg is installed
+- Falls back to interval extraction automatically
 - Try manual interval: `--interval 5`

+### Whisper/WhisperX Issues
+
+**WhisperX diarization not working**
+- Ensure you have a HuggingFace token set
+- Token needs access to pyannote models
+- Fall back to standard Whisper without `--diarize`
+
+### Cache Issues
+
 **Cache not being used**
 - Ensure you're using the same video filename
 - Check that output directory contains cached files
 - Use `--verbose` to see what's being cached/loaded

+**Want to re-run specific steps**
+- `--skip-cache-frames`: Re-extract frames
+- `--skip-cache-whisper`: Re-run transcription
+- `--skip-cache-analysis`: Re-run analysis
+- `--no-cache`: Force complete reprocessing
+
+## Experimental Features
+
+### OCR and Vision Analysis
+
+OCR (`--ocr-engine`) and Vision analysis (`--use-vision`) options are available but experimental. The recommended approach is to use `--embed-images` which embeds frame references directly in the transcript, letting your LLM analyze the images.
+
+```bash
+# Experimental: OCR extraction
+python process_meeting.py samples/meeting.mkv --run-whisper --ocr-engine tesseract
+
+# Experimental: Vision model analysis
+python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --vision-model llava:13b
+
+# Experimental: Hybrid OpenCV + OCR
+python process_meeting.py samples/meeting.mkv --run-whisper --use-hybrid
+```
+
 ## Project Structure

 ```
 meetus/
-├── meetus/                  # Main package
+├── meetus/                     # Main package
 │   ├── __init__.py
-│   ├── frame_extractor.py   # Video frame extraction
-│   ├── ocr_processor.py     # OCR processing
-│   └── transcript_merger.py # Transcript merging
-├── process_meeting.py       # Main CLI script
-├── requirements.txt         # Python dependencies
-└── README.md               # This file
+│   ├── workflow.py             # Processing orchestrator
+│   ├── output_manager.py       # Output directory & manifest management
+│   ├── cache_manager.py        # Caching logic
+│   ├── frame_extractor.py      # Video frame extraction (FFmpeg scene detection)
+│   ├── vision_processor.py     # Vision model analysis (experimental)
+│   ├── ocr_processor.py        # OCR processing (experimental)
+│   └── transcript_merger.py    # Transcript merging
+├── process_meeting.py          # Main CLI script
+├── requirements.txt            # Python dependencies
+├── output/                     # Timestamped output directories
+│   └── YYYYMMDD_HHMMSS-video/  # Auto-generated per video
+├── samples/                    # Sample videos (gitignored)
+└── README.md                   # This file
 ```

 ## License
--- a/def/01-scene-detection-quality-caching.md
+++ b/def/01-scene-detection-quality-caching.md
@@ -0,0 +1,80 @@
+# 01 - Scene Detection Sensitivity, Image Quality, and Granular Caching
+
+## Date
+2025-10-28
+
+## Context
+Last run on zaca-run-scrapers sample (Zed editor walkthrough) only detected 19 frames with 7+ minute gaps. Whisper wasn't running (flag not passed). JPEG compression quality was poor for code/text readability.
+
+## Problems Identified
+1. **Scene detection too conservative** - Default threshold of 30.0 missed file switches and scrolling in clean UI (Zed vs VS Code)
+2. **No whisper transcription** - User expected it to run but `--run-whisper` is opt-in
+3. **Poor JPEG quality** - Default compression made code/text hard to read for OCR/vision
+4. **Subprocess-based FFmpeg** - Using shell commands instead of Python library
+5. **All-or-nothing caching** - `--no-cache` regenerates everything including slow whisper transcription
+
+## Changes Made
+
+### 1. Scene Detection Sensitivity
+**Files:** `meetus/frame_extractor.py`, `process_meeting.py`, `meetus/workflow.py`
+
+- Lowered default threshold: `30.0` → `15.0` (more sensitive for clean UIs)
+- Added `--scene-threshold` CLI argument (0-100, lower = more sensitive)
+- Added threshold to manifest for tracking
+- Updated docstring with usage guidelines:
+  - 15.0: Good for clean UIs like Zed
+  - 20-30: Busy UIs like VS Code
+  - 5-10: Very subtle changes
+
+### 2. JPEG Quality Improvements
+**Files:** `meetus/frame_extractor.py`
+
+- **Interval extraction**: Added `cv2.IMWRITE_JPEG_QUALITY, 95` (line 60)
+- **Scene detection**: Added `-q:v 2` to FFmpeg (best quality, line 94)
+
+### 3. Migration to ffmpeg-python
+**Files:** `meetus/frame_extractor.py`, `requirements.txt`
+
+- Replaced `subprocess.run()` with `ffmpeg-python` library
+- Cleaner, more Pythonic API
+- Better error handling with `ffmpeg.Error`
+- Added to requirements.txt
+
+### 4. Granular Cache Control
+**Files:** `process_meeting.py`, `meetus/workflow.py`, `meetus/cache_manager.py`
+
+Added three new flags for selective cache invalidation:
+- `--skip-cache-frames`: Regenerate frames (useful when tuning scene threshold)
+- `--skip-cache-whisper`: Rerun whisper transcription
+- `--skip-cache-analysis`: Rerun OCR/vision analysis
+
+**Key design:**
+- `--no-cache`: Still works as before (new directory + regenerate everything)
+- New flags: Reuse existing output directory but selectively invalidate caches
+- Frames are cleaned up when regenerating to avoid stale data
+
+## Typical Workflow
+
+```bash
+# First run - generate everything including whisper (expensive, once)
+python process_meeting.py samples/video.mkv --run-whisper --scene-detection --use-vision
+
+# Iterate on scene threshold without re-running whisper
+python process_meeting.py samples/video.mkv --scene-detection --scene-threshold 10 --use-vision --skip-cache-frames --skip-cache-analysis
+
+# Try even more sensitive
+python process_meeting.py samples/video.mkv --scene-detection --scene-threshold 5 --use-vision --skip-cache-frames --skip-cache-analysis
+```
+
+## Notes
+- Whisper is the most expensive and reliable step → always cache it during iteration
+- Scene detection needs tuning per UI style (Zed vs VS Code)
+- Vision analysis should regenerate when frames change
+- Walking through code (file switches, scrolling) should trigger scene changes
+
+## Files Modified
+- `meetus/frame_extractor.py` - Scene threshold, quality, ffmpeg-python
+- `meetus/workflow.py` - Cache flags, frame cleanup
+- `meetus/cache_manager.py` - Granular cache checks
+- `process_meeting.py` - CLI arguments
+- `requirements.txt` - Added ffmpeg-python
--- a/def/02-hybrid-opencv-ocr-llm.md
+++ b/def/02-hybrid-opencv-ocr-llm.md
@@ -0,0 +1,111 @@
+# 02 - Hybrid OpenCV + OCR + LLM Approach
+
+## Date
+2025-10-28
+
+## Context
+Vision models (llava) were hallucinating text content badly - showing HTML code when there was none, inventing text that didn't exist. Pure OCR was fast and accurate but lost code formatting and structure.
+
+## Problem
+- **Vision models**: Hallucinate text content, can't be trusted for accurate extraction
+- **Pure OCR**: Accurate text but messy output, lost indentation/formatting
+- **Need**: Accurate text extraction + preserved code structure
+
+## Solution: Three-Stage Hybrid Approach
+
+### Stage 1: OpenCV Text Detection
+Use morphological operations to find text regions:
+- Adaptive thresholding (handles varying lighting)
+- Dilation with horizontal kernel to connect text lines
+- Contour detection to find bounding boxes
+- Filter by area and aspect ratio
+- Merge overlapping regions
+
+### Stage 2: Region-Based OCR
+- Sort regions by reading order (top-to-bottom, left-to-right)
+- Crop each region from original image
+- Run OCR on cropped regions (more accurate than full frame)
+- Tesseract with PSM 6 mode to preserve layout
+- Preserve indentation in cleaning step
+
+### Stage 3: Optional LLM Cleanup
+- Take accurate OCR output (no hallucination)
+- Use lightweight LLM (llama3.2:3b for speed) to:
+  - Fix obvious OCR errors (l→1, O→0)
+  - Restore code indentation and structure
+  - Preserve exact text content
+  - No added explanations or hallucinated content
+
+## Benefits
+✓ **Accurate**: OCR reads actual pixels, no hallucination
+✓ **Fast**: OpenCV detection is instant, focused OCR is quick
+✓ **Structured**: Regions separated with headers showing position
+✓ **Formatted**: Optional LLM cleanup preserves/restores code structure
+✓ **Deterministic**: Same input = same output (unlike vision models)
+
+## Implementation
+
+**New file:** `meetus/hybrid_processor.py`
+- `HybridProcessor` class with OpenCV detection + OCR + optional LLM
+- Region sorting for proper reading order
+- Visual separators between regions
+
+**CLI flags:**
+```bash
+--use-hybrid                 # Enable hybrid mode
+--hybrid-llm-cleanup        # Add LLM post-processing (optional)
+--hybrid-llm-model MODEL    # LLM model (default: llama3.2:3b)
+```
+
+**OCR improvements:**
+- Tesseract PSM 6 mode for better layout preservation
+- Modified text cleaning to keep indentation
+- `preserve_layout` parameter
+
+## Usage
+
+```bash
+# Basic hybrid (OpenCV + OCR)
+python process_meeting.py samples/video.mkv --use-hybrid --scene-detection
+
+# With LLM cleanup for best code formatting
+python process_meeting.py samples/video.mkv --use-hybrid --hybrid-llm-cleanup --scene-detection -v
+
+# Iterate on threshold
+python process_meeting.py samples/video.mkv --use-hybrid --scene-detection --scene-threshold 5 --skip-cache-frames --skip-cache-analysis
+```
+
+## Output Format
+
+```
+[Region 1 at y=120]
+function calculateTotal(items) {
+  return items.reduce((sum, item) => sum + item.price, 0);
+}
+
+============================================================
+
+[Region 2 at y=450]
+const result = calculateTotal(cartItems);
+console.log('Total:', result);
+```
+
+## Performance
+- **Without LLM cleanup**: Very fast (~2-3s per frame)
+- **With LLM cleanup**: Slower but still faster than vision models (~5-8s per frame)
+- **Accuracy**: Much better than vision model hallucinations
+
+## When to Use What
+
+| Method | Best For | Pros | Cons |
+|--------|----------|------|------|
+| **Hybrid** | Code/terminal text extraction | Accurate, fast, no hallucination | Formatting may be messy |
+| **Hybrid + LLM** | Code with preserved structure | Accurate + formatted | Slower, needs Ollama |
+| **Vision** | Understanding layout/context | Semantic understanding | Hallucinates text |
+| **Pure OCR** | Simple text, no structure needed | Fast, simple | Full-frame, no region detection |
+
+## Files Modified
+- `meetus/hybrid_processor.py` - New hybrid processor
+- `meetus/ocr_processor.py` - Layout preservation
+- `meetus/workflow.py` - Hybrid mode integration
+- `process_meeting.py` - CLI flags and examples
--- a/def/03-embed-images-for-llm.md
+++ b/def/03-embed-images-for-llm.md
@@ -0,0 +1,100 @@
+# 03 - Embed Images for LLM Analysis
+
+## Date
+2025-10-28
+
+## Context
+Hybrid OCR approach was fast and accurate but formatting was messy. Vision models hallucinated text. Rather than fighting with text extraction, a better approach is to embed the actual frame images in the enhanced transcript and let the end-user's LLM analyze them with full audio context.
+
+## Problem
+- OCR/vision models either hallucinate or produce messy text
+- Code formatting/indentation is hard to preserve
+- User wants to analyze frames with their own LLM (Claude, GPT, etc.)
+- Need to keep file size reasonable (~200KB per image is too big)
+
+## Solution: Image Embedding
+
+Instead of extracting text, embed the actual frame images as base64 in the enhanced transcript. The LLM can then:
+- See the actual screen content (no hallucination)
+- Understand code structure, layout, and formatting visually
+- Have full audio transcript context for each frame
+- Analyze dashboards, terminals, editors with perfect accuracy
+
+## Implementation
+
+**Quality Optimization:**
+- Default JPEG quality: 80 (good tradeoff between size and readability)
+- Configurable via `--embed-quality` (0-100)
+- Typical sizes at quality 80: ~40-80KB per image (vs 200KB original)
+
+**Format:**
+```
+[MM:SS] SPEAKER:
+  Audio transcript text here
+
+[MM:SS] SCREEN CONTENT:
+  IMAGE (base64, 52KB):
+  <image>data:image/jpeg;base64,/9j/4AAQSkZJRg...</image>
+
+  TEXT:
+  | Optional OCR text for reference
+```
+
+**Features:**
+- Base64 encoding for easy embedding
+- Size tracking and reporting
+- Optional text content alongside images
+- Works with scene detection for smart frame selection
+
+## Usage
+
+```bash
+# Basic: Embed images at quality 80 (default)
+python process_meeting.py samples/video.mkv --run-whisper --embed-images --scene-detection --no-cache -v
+
+# Lower quality for smaller files (still readable)
+python process_meeting.py samples/video.mkv --run-whisper --embed-images --embed-quality 60 --scene-detection --no-cache -v
+
+# Higher quality for detailed code
+python process_meeting.py samples/video.mkv --run-whisper --embed-images --embed-quality 90 --scene-detection --no-cache -v
+
+# Iterate on scene threshold (reuse whisper)
+python process_meeting.py samples/video.mkv --embed-images --scene-detection --scene-threshold 5 --skip-cache-frames --skip-cache-analysis -v
+```
+
+## File Sizes
+
+**Example for 20 frames:**
+- Quality 60: ~30-50KB per image = 0.6-1MB total
+- Quality 80: ~40-80KB per image = 0.8-1.6MB total (recommended)
+- Quality 90: ~80-120KB per image = 1.6-2.4MB total
+- Original: ~200KB per image = 4MB total
+
+## Benefits
+
+✓ **No hallucination**: LLM sees actual pixels
+✓ **Perfect formatting**: Code structure preserved visually
+✓ **Full context**: Audio transcript + visual frame together
+✓ **User's choice**: Use your preferred LLM (Claude, GPT, etc.)
+✓ **Reasonable size**: Quality 80 gives 4x smaller files vs original
+✓ **Simple workflow**: One file contains everything
+
+## Use Cases
+
+**Code walkthroughs:** LLM can see actual code structure and indentation
+**Dashboard analysis:** Charts, graphs, metrics visible to LLM
+**Terminal sessions:** Commands and output in proper context
+**UI reviews:** Actual interface visible with audio commentary
+
+## Files Modified
+
+- `meetus/transcript_merger.py` - Image encoding and embedding
+- `meetus/workflow.py` - Wire through config
+- `process_meeting.py` - CLI flags
+- `meetus/output_manager.py` - Cleaner directory naming (date + increment)
+
+## Output Directory Naming
+
+Also changed output directory format for clarity:
+- Old: `20251028_054553-video` (confusing timestamps)
+- New: `20251028-001-video` (clear date + run number)
--- a/def/04-fix-whisper-cache-loading.md
+++ b/def/04-fix-whisper-cache-loading.md
@@ -0,0 +1,78 @@
+# 04 - Fix Whisper Cache Loading
+
+## Date
+2025-10-28
+
+## Problem
+Enhanced transcript was not including the audio segments from cached whisper transcripts when running without the `--run-whisper` flag.
+
+Example command that failed:
+```bash
+python process_meeting.py samples/zaca-run-scrapers.mkv --embed-images --scene-detection --scene-threshold 10 --skip-cache-frames -v
+```
+
+Result: Enhanced transcript only contained embedded images, no audio segments (0 "SPEAKER" entries).
+
+## Root Cause
+In `workflow.py`, the `_run_whisper()` method was checking the `run_whisper` flag **before** checking the cache:
+
+```python
+def _run_whisper(self) -> Optional[str]:
+    if not self.config.run_whisper:
+        return self.config.transcript_path  # Returns None if --transcript not specified
+
+    # Cache check NEVER REACHED if run_whisper is False
+    cached = self.cache_mgr.get_whisper_cache()
+    if cached:
+        return str(cached)
+```
+
+This meant:
+- User runs command without `--run-whisper`
+- Method returns None immediately
+- Cached whisper transcript is never discovered
+- No audio segments in enhanced output
+
+## Solution
+Reorder the logic to check cache **first**, regardless of flags:
+
+```python
+def _run_whisper(self) -> Optional[str]:
+    """Run Whisper transcription if requested, or use cached/provided transcript."""
+    # First, check cache (regardless of run_whisper flag)
+    cached = self.cache_mgr.get_whisper_cache()
+    if cached:
+        return str(cached)
+
+    # If no cache and not running whisper, use provided transcript path (if any)
+    if not self.config.run_whisper:
+        return self.config.transcript_path
+
+    # If no cache and run_whisper is True, run whisper transcription
+    # ... rest of whisper code
+```
+
+## New Behavior
+1. Cache is checked first (regardless of `--run-whisper` flag)
+2. If cached whisper exists, use it
+3. If no cache and `--run-whisper` not specified, use `--transcript` path (or None)
+4. If no cache and `--run-whisper` specified, run whisper
+
+## Benefits
+✓ Cached whisper transcripts are always discovered and used
+✓ User can iterate on frame extraction/analysis without re-running whisper
+✓ Enhanced transcripts now properly include both audio + visual content
+✓ Granular cache flags (`--skip-cache-frames`, `--skip-cache-whisper`) work as expected
+
+## Use Case
+```bash
+# First run: Generate whisper transcript + extract frames
+python process_meeting.py samples/video.mkv --run-whisper --embed-images --scene-detection -v
+
+# Second run: Iterate on scene threshold without re-running whisper
+python process_meeting.py samples/video.mkv --embed-images --scene-detection --scene-threshold 10 --skip-cache-frames -v
+# Now correctly includes cached whisper transcript in enhanced output!
+```
+
+## Files Modified
+- `meetus/workflow.py` - Reordered logic in `_run_whisper()` method (lines 172-181)
--- a/def/05-reference-frames-instead-of-embedding.md
+++ b/def/05-reference-frames-instead-of-embedding.md
@@ -0,0 +1,124 @@
+# 05 - Reference Frame Files Instead of Embedding
+
+## Date
+2025-10-28
+
+## Context
+Embedding base64 images made the enhanced transcript files very large (3.7MB for ~40 frames). This made them harder to work with and slower to process.
+
+## Problem
+- Enhanced transcript with embedded base64 images was 3.7MB
+- Large file size makes it slow to read/process
+- Difficult to inspect individual frames
+- Harder to share and version control
+
+## Solution: Reference Frame Paths
+Instead of embedding base64 image data, reference the frame files by their relative paths.
+
+### Before (Embedded):
+```
+[00:08] SCREEN CONTENT:
+  IMAGE (base64, 85KB):
+  <image>data:image/jpeg;base64,/9j/4AAQSkZJRg...</image>
+```
+File size: 3.7MB
+
+### After (Referenced):
+```
+[00:08] SCREEN CONTENT:
+  Frame: frames/zaca-run-scrapers_00257.jpg
+```
+File size: ~50KB
+
+## Implementation
+
+**Directory Structure:**
+```
+output/20251028-003-zaca-run-scrapers/
+├── frames/
+│   ├── zaca-run-scrapers_00257.jpg
+│   ├── zaca-run-scrapers_00487.jpg
+│   └── ...
+├── zaca-run-scrapers.json (whisper transcript)
+└── zaca-run-scrapers_enhanced.txt (references frames/ directory)
+```
+
+**Enhanced Transcript Format:**
+```
+================================================================================
+ENHANCED MEETING TRANSCRIPT
+Audio transcript + Screen frames
+================================================================================
+
+[00:30] SPEAKER:
+  Bueno, te dio un tour para el proyecto...
+
+[00:08] SCREEN CONTENT:
+  Frame: frames/zaca-run-scrapers_00257.jpg
+
+[01:00] SPEAKER:
+  Mayormente en Scrapping lo que tenemos...
+
+[01:15] SCREEN CONTENT:
+  Frame: frames/zaca-run-scrapers_00487.jpg
+  TEXT:
+  | Code snippet from screen (if OCR was used)
+```
+
+## Benefits
+
+✓ **Much smaller files**: ~50KB vs 3.7MB (74x smaller!)
+✓ **Easier to inspect**: Can view individual frames directly
+✓ **LLM can access images**: Frame paths allow LLM to load images on demand
+✓ **Better version control**: Text files are small and diffable
+✓ **Cleaner structure**: Frames organized in dedicated directory
+✓ **Flexible**: Can still do OCR/vision analysis if needed (adds TEXT section)
+
+## Flags
+
+**`--embed-images`**: Skip OCR/vision analysis, just reference frame files
+- Faster (no analysis needed)
+- Lets LLM analyze raw images
+- Enhanced transcript only contains frame references
+
+**Without `--embed-images`**: Run OCR/vision analysis
+- Extracts text from frames
+- Enhanced transcript includes both frame reference AND extracted text
+- Useful for code/dashboard analysis
+
+## Usage
+
+```bash
+# Reference frames only (no OCR, faster)
+python process_meeting.py samples/video.mkv --run-whisper --embed-images --scene-detection -v
+
+# Reference frames + OCR text extraction
+python process_meeting.py samples/video.mkv --run-whisper --use-hybrid --scene-detection -v
+
+# Adjust frame quality (smaller files)
+python process_meeting.py samples/video.mkv --run-whisper --embed-images --embed-quality 60 --scene-detection -v
+```
+
+## Files Modified
+
+- `meetus/transcript_merger.py` - Modified `_format_detailed()` to output frame paths instead of base64
+- `process_meeting.py` - Updated help text and examples to reflect frame referencing
+- All processors (OCR, vision, hybrid) already include `frame_path` in results (no changes needed)
+
+## Workflow Example
+
+```bash
+# First run: Generate everything
+python process_meeting.py samples/meeting.mkv --run-whisper --embed-images --scene-detection -v
+
+# Result:
+# - output/20251028-004-meeting/
+#   - frames/ (40 frames, ~80KB each)
+#   - meeting.json (whisper transcript)
+#   - meeting_enhanced.txt (~50KB, references frames/)
+
+# LLM can now:
+# 1. Read enhanced transcript
+# 2. See timeline of audio + screen changes
+# 3. Load individual frames as needed from frames/ directory
+```
--- a/meetus/cache_manager.py
+++ b/meetus/cache_manager.py
@@ -0,0 +1,162 @@
+"""
+Manage caching for frames, transcripts, and analysis results.
+"""
+from pathlib import Path
+import json
+import logging
+from typing import List, Tuple, Dict, Optional
+
+logger = logging.getLogger(__name__)
+
+
+class CacheManager:
+    """Manage caching of intermediate processing results."""
+
+    def __init__(self, output_dir: Path, frames_dir: Path, video_name: str, use_cache: bool = True,
+                 skip_cache_frames: bool = False, skip_cache_whisper: bool = False,
+                 skip_cache_analysis: bool = False):
+        """
+        Initialize cache manager.
+
+        Args:
+            output_dir: Output directory for cached files
+            frames_dir: Directory for cached frames
+            video_name: Name of the video (stem)
+            use_cache: Whether to use caching globally
+            skip_cache_frames: Skip cached frames specifically
+            skip_cache_whisper: Skip cached whisper specifically
+            skip_cache_analysis: Skip cached analysis specifically
+        """
+        self.output_dir = output_dir
+        self.frames_dir = frames_dir
+        self.video_name = video_name
+        self.use_cache = use_cache
+        self.skip_cache_frames = skip_cache_frames
+        self.skip_cache_whisper = skip_cache_whisper
+        self.skip_cache_analysis = skip_cache_analysis
+
+    def get_whisper_cache(self) -> Optional[Path]:
+        """
+        Check for cached Whisper transcript.
+
+        Returns:
+            Path to cached transcript or None
+        """
+        if not self.use_cache or self.skip_cache_whisper:
+            return None
+
+        cache_path = self.output_dir / f"{self.video_name}.json"
+        if cache_path.exists():
+            logger.info(f"✓ Found cached Whisper transcript: {cache_path.name}")
+
+            # Debug: Show cached transcript info
+            try:
+                import json
+                with open(cache_path, 'r', encoding='utf-8') as f:
+                    data = json.load(f)
+                if 'segments' in data:
+                    logger.debug(f"Cached transcript has {len(data['segments'])} segments")
+            except Exception as e:
+                logger.debug(f"Could not parse cached whisper for debug: {e}")
+
+            return cache_path
+
+        return None
+
+    def get_frames_cache(self) -> Optional[List[Tuple[str, float]]]:
+        """
+        Check for cached frames.
+
+        Returns:
+            List of (frame_path, timestamp) tuples or None
+        """
+        if not self.use_cache or self.skip_cache_frames or not self.frames_dir.exists():
+            return None
+
+        existing_frames = list(self.frames_dir.glob("*.jpg"))
+
+        if not existing_frames:
+            return None
+
+        logger.info(f"✓ Found {len(existing_frames)} cached frames in {self.frames_dir.name}/")
+        logger.debug(f"Frame filenames: {[f.name for f in sorted(existing_frames)[:3]]}...")
+
+        # Build frames_info from existing files
+        frames_info = []
+        for frame_path in sorted(existing_frames):
+            # Try to extract timestamp from filename (e.g., frame_00001_12.34s.jpg)
+            try:
+                timestamp_str = frame_path.stem.split('_')[-1].rstrip('s')
+                timestamp = float(timestamp_str)
+            except:
+                timestamp = 0.0
+            frames_info.append((str(frame_path), timestamp))
+
+        return frames_info
+
+    def get_analysis_cache(self, analysis_type: str) -> Optional[List[Dict]]:
+        """
+        Check for cached analysis results.
+
+        Args:
+            analysis_type: 'vision' or 'ocr'
+
+        Returns:
+            List of analysis results or None
+        """
+        if not self.use_cache or self.skip_cache_analysis:
+            return None
+
+        cache_path = self.output_dir / f"{self.video_name}_{analysis_type}.json"
+
+        if cache_path.exists():
+            logger.info(f"✓ Found cached {analysis_type} analysis: {cache_path.name}")
+            with open(cache_path, 'r', encoding='utf-8') as f:
+                results = json.load(f)
+            logger.info(f"✓ Loaded {len(results)} analyzed frames from cache")
+
+            # Debug: Show first cached result
+            if results:
+                logger.debug(f"First cached result: timestamp={results[0].get('timestamp')}, text_length={len(results[0].get('text', ''))}")
+
+            return results
+
+        return None
+
+    def save_analysis(self, analysis_type: str, results: List[Dict]):
+        """
+        Save analysis results to cache.
+
+        Args:
+            analysis_type: 'vision' or 'ocr'
+            results: Analysis results to save
+        """
+        cache_path = self.output_dir / f"{self.video_name}_{analysis_type}.json"
+
+        with open(cache_path, 'w', encoding='utf-8') as f:
+            json.dump(results, f, indent=2, ensure_ascii=False)
+
+        logger.info(f"✓ Saved {analysis_type} analysis to: {cache_path.name}")
+
+    def cache_exists(self, analysis_type: Optional[str] = None) -> Dict[str, bool]:
+        """
+        Check what caches exist.
+
+        Args:
+            analysis_type: Optional specific analysis type to check
+
+        Returns:
+            Dictionary of cache status
+        """
+        status = {
+            "whisper": (self.output_dir / f"{self.video_name}.json").exists(),
+            "frames": len(list(self.frames_dir.glob("frame_*.jpg"))) > 0 if self.frames_dir.exists() else False,
+        }
+
+        if analysis_type:
+            status[analysis_type] = (self.output_dir / f"{self.video_name}_{analysis_type}.json").exists()
+        else:
+            status["vision"] = (self.output_dir / f"{self.video_name}_vision.json").exists()
+            status["ocr"] = (self.output_dir / f"{self.video_name}_ocr.json").exists()
+
+        return status
--- a/meetus/frame_extractor.py
+++ b/meetus/frame_extractor.py
@@ -6,9 +6,9 @@ import cv2
 import os
 from pathlib import Path
 from typing import List, Tuple, Optional
-import subprocess
 import json
 import logging
+import re

 logger = logging.getLogger(__name__)

@@ -16,17 +16,19 @@ logger = logging.getLogger(__name__)
 class FrameExtractor:
    """Extract frames from video files."""

-    def __init__(self, video_path: str, output_dir: str = "frames"):
+    def __init__(self, video_path: str, output_dir: str = "frames", quality: int = 75):
        """
        Initialize frame extractor.

        Args:
            video_path: Path to video file
            output_dir: Directory to save extracted frames
+            quality: JPEG quality for saved frames (0-100)
        """
        self.video_path = video_path
        self.output_dir = Path(output_dir)
        self.output_dir.mkdir(parents=True, exist_ok=True)
+        self.quality = quality

    def extract_by_interval(self, interval_seconds: int = 5) -> List[Tuple[str, float]]:
        """
@@ -56,7 +58,16 @@ class FrameExtractor:
                frame_filename = f"frame_{saved_count:05d}_{timestamp:.2f}s.jpg"
                frame_path = self.output_dir / frame_filename

-                cv2.imwrite(str(frame_path), frame)
+                # Downscale to 1600px width for smaller file size (but still readable)
+                height, width = frame.shape[:2]
+                if width > 1600:
+                    ratio = 1600 / width
+                    new_width = 1600
+                    new_height = int(height * ratio)
+                    frame = cv2.resize(frame, (new_width, new_height), interpolation=cv2.INTER_LANCZOS4)
+
+                # Save with configured quality (matches embed quality)
+                cv2.imwrite(str(frame_path), frame, [cv2.IMWRITE_JPEG_QUALITY, self.quality])
                frames_info.append((str(frame_path), timestamp))
                saved_count += 1

@@ -66,48 +77,80 @@ class FrameExtractor:
        logger.info(f"Extracted {saved_count} frames at {interval_seconds}s intervals")
        return frames_info

-    def extract_scene_changes(self, threshold: float = 30.0) -> List[Tuple[str, float]]:
+    def extract_scene_changes(self, threshold: float = 15.0) -> List[Tuple[str, float]]:
        """
        Extract frames only on scene changes using FFmpeg.
        More efficient than interval-based extraction.

        Args:
            threshold: Scene change detection threshold (0-100, lower = more sensitive)
+                      Default: 15.0 (good for clean UIs like Zed)
+                      Higher values (20-30) for busy UIs like VS Code
+                      Lower values (5-10) for very subtle changes

        Returns:
            List of (frame_path, timestamp) tuples
        """
+        try:
+            import ffmpeg
+        except ImportError:
+            raise ImportError("ffmpeg-python not installed. Run: pip install ffmpeg-python")
+
        video_name = Path(self.video_path).stem
        output_pattern = self.output_dir / f"{video_name}_%05d.jpg"

-        # Use FFmpeg's scene detection filter
-        cmd = [
-            'ffmpeg',
-            '-i', self.video_path,
-            '-vf', f'select=gt(scene\\,{threshold/100}),showinfo',
-            '-vsync', 'vfr',
-            '-frame_pts', '1',
-            str(output_pattern),
-            '-loglevel', 'info'
-        ]
-
        try:
-            result = subprocess.run(cmd, capture_output=True, text=True, check=True)
+            # Use FFmpeg's scene detection filter with downscaling
+            stream = ffmpeg.input(self.video_path)
+            stream = ffmpeg.filter(stream, 'select', f'gt(scene,{threshold/100})')
+            stream = ffmpeg.filter(stream, 'showinfo')
+            # Scale to 1600px width (maintains aspect ratio, still readable)
+            # Use simple conditional: if width > 1600, scale to 1600, else keep original
+            stream = ffmpeg.filter(stream, 'scale', w='min(1600,iw)', h=-1)

-            # Parse output to get frame timestamps
+            # Convert JPEG quality (0-100) to FFmpeg qscale (2-31, lower=better)
+            # Rough mapping: qscale ≈ (100 - quality) / 10, clamped to 2-31
+            qscale = max(2, min(31, int((100 - self.quality) / 10 + 2)))
+
+            stream = ffmpeg.output(
+                stream,
+                str(output_pattern),
+                vsync='vfr',
+                frame_pts=1,
+                **{'q:v': str(qscale)}  # Matches configured quality
+            )
+
+            # Run with stderr capture to get showinfo output
+            _, stderr = ffmpeg.run(stream, capture_stderr=True, overwrite_output=True)
+            stderr = stderr.decode('utf-8')
+
+            # Parse FFmpeg output to get frame timestamps from showinfo filter
            frames_info = []
-            for img in sorted(self.output_dir.glob(f"{video_name}_*.jpg")):
-                # Extract timestamp from filename or use FFprobe
-                frames_info.append((str(img), 0.0))  # Timestamp extraction can be enhanced
+
+            # Extract timestamps from stderr (showinfo outputs there)
+            timestamp_pattern = r'pts_time:([\d.]+)'
+            timestamps = re.findall(timestamp_pattern, stderr)
+
+            # Match frames to timestamps
+            frame_files = sorted(self.output_dir.glob(f"{video_name}_*.jpg"))
+
+            for idx, img in enumerate(frame_files):
+                # Use extracted timestamp or fallback to index-based estimate
+                timestamp = float(timestamps[idx]) if idx < len(timestamps) else idx * 5.0
+                frames_info.append((str(img), timestamp))

            logger.info(f"Extracted {len(frames_info)} frames at scene changes")
            return frames_info

-        except subprocess.CalledProcessError as e:
-            logger.error(f"FFmpeg error: {e.stderr}")
+        except ffmpeg.Error as e:
+            logger.error(f"FFmpeg error: {e.stderr.decode() if e.stderr else str(e)}")
            # Fallback to interval extraction
            logger.warning("Falling back to interval extraction...")
            return self.extract_by_interval()
+        except Exception as e:
+            logger.error(f"Unexpected error during scene extraction: {e}")
+            logger.warning("Falling back to interval extraction...")
+            return self.extract_by_interval()

    def get_video_duration(self) -> float:
        """Get video duration in seconds."""
--- a/meetus/hybrid_processor.py
+++ b/meetus/hybrid_processor.py
@@ -0,0 +1,355 @@
+"""
+Hybrid frame analysis: OpenCV text detection + OCR for accurate extraction.
+Better than pure vision models which tend to hallucinate text content.
+"""
+from typing import List, Tuple, Dict, Optional
+from pathlib import Path
+import logging
+import cv2
+import numpy as np
+from difflib import SequenceMatcher
+
+logger = logging.getLogger(__name__)
+
+
+class HybridProcessor:
+    """Combine OpenCV text detection with OCR for accurate text extraction."""
+
+    def __init__(self, ocr_engine: str = "tesseract", min_confidence: float = 0.5,
+                 use_llm_cleanup: bool = False, llm_model: Optional[str] = None):
+        """
+        Initialize hybrid processor.
+
+        Args:
+            ocr_engine: OCR engine to use ('tesseract', 'easyocr', 'paddleocr')
+            min_confidence: Minimum confidence for text detection (0-1)
+            use_llm_cleanup: Use LLM to clean up OCR output and preserve formatting
+            llm_model: Ollama model for cleanup (default: llama3.2:3b for speed)
+        """
+        from .ocr_processor import OCRProcessor
+
+        self.ocr = OCRProcessor(engine=ocr_engine)
+        self.min_confidence = min_confidence
+        self.use_llm_cleanup = use_llm_cleanup
+        self.llm_model = llm_model or "llama3.2:3b"
+        self._llm_client = None
+
+        if use_llm_cleanup:
+            self._init_llm()
+
+    def _init_llm(self):
+        """Initialize Ollama client for LLM cleanup."""
+        try:
+            import ollama
+            self._llm_client = ollama
+            logger.info(f"LLM cleanup enabled using {self.llm_model}")
+        except ImportError:
+            logger.warning("ollama package not installed. LLM cleanup disabled.")
+            self.use_llm_cleanup = False
+
+    def _cleanup_with_llm(self, raw_text: str) -> str:
+        """
+        Use LLM to clean up OCR output and preserve code formatting.
+
+        Args:
+            raw_text: Raw OCR output
+
+        Returns:
+            Cleaned up text with proper formatting
+        """
+        if not self.use_llm_cleanup or not self._llm_client:
+            return raw_text
+
+        prompt = """You are cleaning up OCR output from a code editor screenshot.
+
+Your task:
+1. Fix any obvious OCR errors (l→1, O→0, etc.)
+2. Preserve or restore code indentation and structure
+3. Keep the exact text content - don't add explanations or comments
+4. If it's code, maintain proper spacing and formatting
+5. Return ONLY the cleaned text, nothing else
+
+OCR Text:
+"""
+
+        try:
+            response = self._llm_client.generate(
+                model=self.llm_model,
+                prompt=prompt + raw_text,
+                options={"temperature": 0.1}  # Low temperature for accuracy
+            )
+            cleaned = response['response'].strip()
+            logger.debug(f"LLM cleanup: {len(raw_text)} → {len(cleaned)} chars")
+            return cleaned
+        except Exception as e:
+            logger.warning(f"LLM cleanup failed: {e}, using raw OCR output")
+            return raw_text
+
+    def detect_text_regions(self, image_path: str, min_area: int = 100) -> List[Tuple[int, int, int, int]]:
+        """
+        Detect text regions in image using OpenCV.
+
+        Args:
+            image_path: Path to image file
+            min_area: Minimum area for text region (pixels)
+
+        Returns:
+            List of bounding boxes (x, y, w, h)
+        """
+        # Read image
+        img = cv2.imread(image_path)
+        if img is None:
+            logger.warning(f"Could not read image: {image_path}")
+            return []
+
+        gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
+
+        # Method 1: Morphological operations to find text regions
+        # Works well for solid text blocks
+        regions = self._detect_by_morphology(gray, min_area)
+
+        if not regions:
+            logger.debug(f"No text regions detected in {Path(image_path).name}")
+
+        return regions
+
+    def _detect_by_morphology(self, gray: np.ndarray, min_area: int) -> List[Tuple[int, int, int, int]]:
+        """
+        Detect text regions using morphological operations.
+        Fast and works well for solid text blocks (code editors, terminals).
+
+        Args:
+            gray: Grayscale image
+            min_area: Minimum area for region
+
+        Returns:
+            List of bounding boxes (x, y, w, h)
+        """
+        # Apply adaptive threshold to handle varying lighting
+        binary = cv2.adaptiveThreshold(
+            gray, 255, cv2.ADAPTIVE_THRESH_GAUSSIAN_C,
+            cv2.THRESH_BINARY_INV, 11, 2
+        )
+
+        # Morphological operations to connect text regions
+        kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (15, 3))  # Horizontal kernel for text lines
+        dilated = cv2.dilate(binary, kernel, iterations=2)
+
+        # Find contours
+        contours, _ = cv2.findContours(dilated, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
+
+        # Filter and extract bounding boxes
+        regions = []
+        for contour in contours:
+            x, y, w, h = cv2.boundingRect(contour)
+            area = w * h
+
+            # Filter by area and aspect ratio
+            if area > min_area and w > 20 and h > 10:  # Reasonable text dimensions
+                regions.append((x, y, w, h))
+
+        # Merge overlapping regions
+        regions = self._merge_overlapping_regions(regions)
+
+        logger.debug(f"Detected {len(regions)} text regions using morphology")
+        return regions
+
+    def _merge_overlapping_regions(
+        self, regions: List[Tuple[int, int, int, int]],
+        overlap_threshold: float = 0.3
+    ) -> List[Tuple[int, int, int, int]]:
+        """
+        Merge overlapping bounding boxes.
+
+        Args:
+            regions: List of (x, y, w, h) tuples
+            overlap_threshold: Minimum overlap ratio to merge
+
+        Returns:
+            Merged regions
+        """
+        if not regions:
+            return []
+
+        # Sort by y-coordinate (top to bottom)
+        regions = sorted(regions, key=lambda r: r[1])
+
+        merged = []
+        current = list(regions[0])
+
+        for region in regions[1:]:
+            x, y, w, h = region
+            cx, cy, cw, ch = current
+
+            # Check for overlap
+            x_overlap = max(0, min(cx + cw, x + w) - max(cx, x))
+            y_overlap = max(0, min(cy + ch, y + h) - max(cy, y))
+            overlap_area = x_overlap * y_overlap
+
+            current_area = cw * ch
+            region_area = w * h
+            min_area = min(current_area, region_area)
+
+            if overlap_area / min_area > overlap_threshold:
+                # Merge regions
+                new_x = min(cx, x)
+                new_y = min(cy, y)
+                new_x2 = max(cx + cw, x + w)
+                new_y2 = max(cy + ch, y + h)
+                current = [new_x, new_y, new_x2 - new_x, new_y2 - new_y]
+            else:
+                merged.append(tuple(current))
+                current = list(region)
+
+        merged.append(tuple(current))
+        return merged
+
+    def extract_text_from_region(self, image_path: str, region: Tuple[int, int, int, int]) -> str:
+        """
+        Extract text from a specific region using OCR.
+
+        Args:
+            image_path: Path to image file
+            region: Bounding box (x, y, w, h)
+
+        Returns:
+            Extracted text
+        """
+        from PIL import Image
+
+        # Load image and crop region
+        img = Image.open(image_path)
+        x, y, w, h = region
+        cropped = img.crop((x, y, x + w, y + h))
+
+        # Save to temp file for OCR (or use in-memory)
+        import tempfile
+        with tempfile.NamedTemporaryFile(suffix='.png', delete=False) as tmp:
+            cropped.save(tmp.name)
+            text = self.ocr.extract_text(tmp.name)
+
+        # Clean up temp file
+        Path(tmp.name).unlink()
+
+        return text
+
+    def analyze_frame(self, image_path: str) -> str:
+        """
+        Analyze a frame: detect text regions and OCR them.
+
+        Args:
+            image_path: Path to image file
+
+        Returns:
+            Combined text from all detected regions
+        """
+        # Detect text regions
+        regions = self.detect_text_regions(image_path)
+
+        if not regions:
+            # Fallback to full-frame OCR if no regions detected
+            logger.debug(f"No regions detected, using full-frame OCR for {Path(image_path).name}")
+            raw_text = self.ocr.extract_text(image_path)
+            return self._cleanup_with_llm(raw_text) if self.use_llm_cleanup else raw_text
+
+        # Sort regions by reading order (top-to-bottom, left-to-right)
+        regions = self._sort_regions_by_reading_order(regions)
+
+        # Extract text from each region
+        texts = []
+        for idx, region in enumerate(regions):
+            x, y, w, h = region
+            text = self.extract_text_from_region(image_path, region)
+            if text.strip():
+                # Add visual separator with region info
+                section_header = f"[Region {idx+1} at y={y}]"
+                texts.append(f"{section_header}\n{text.strip()}")
+                logger.debug(f"Region {idx+1}/{len(regions)} (y={y}): Extracted {len(text)} chars")
+
+        combined = ("\n\n" + "="*60 + "\n\n").join(texts)
+        logger.debug(f"Total extracted from {len(regions)} regions: {len(combined)} chars")
+
+        # Apply LLM cleanup if enabled
+        if self.use_llm_cleanup:
+            combined = self._cleanup_with_llm(combined)
+
+        return combined
+
+    def _sort_regions_by_reading_order(self, regions: List[Tuple[int, int, int, int]]) -> List[Tuple[int, int, int, int]]:
+        """
+        Sort regions in reading order (top-to-bottom, left-to-right).
+
+        Args:
+            regions: List of (x, y, w, h) tuples
+
+        Returns:
+            Sorted regions
+        """
+        # Sort primarily by y (top to bottom), secondarily by x (left to right)
+        # Group regions that are on roughly the same line (within 20px)
+        sorted_regions = sorted(regions, key=lambda r: (r[1] // 20, r[0]))
+        return sorted_regions
+
+    def process_frames(
+        self,
+        frames_info: List[Tuple[str, float]],
+        deduplicate: bool = True,
+        similarity_threshold: float = 0.85
+    ) -> List[Dict]:
+        """
+        Process multiple frames with hybrid analysis.
+
+        Args:
+            frames_info: List of (frame_path, timestamp) tuples
+            deduplicate: Whether to remove similar consecutive analyses
+            similarity_threshold: Threshold for considering analyses as duplicates (0-1)
+
+        Returns:
+            List of dicts with 'timestamp', 'text', and 'frame_path'
+        """
+        results = []
+        prev_text = ""
+
+        total = len(frames_info)
+        logger.info(f"Starting hybrid analysis of {total} frames...")
+
+        for idx, (frame_path, timestamp) in enumerate(frames_info, 1):
+            logger.info(f"Analyzing frame {idx}/{total} at {timestamp:.2f}s...")
+
+            text = self.analyze_frame(frame_path)
+
+            if not text:
+                logger.warning(f"No content extracted from frame at {timestamp:.2f}s")
+                continue
+
+            # Debug: Show what was extracted
+            logger.debug(f"Frame {idx} ({timestamp:.2f}s): Extracted {len(text)} chars")
+            logger.debug(f"Content preview: {text[:150]}{'...' if len(text) > 150 else ''}")
+
+            # Deduplicate similar consecutive frames
+            if deduplicate and prev_text:
+                similarity = self._text_similarity(prev_text, text)
+                logger.debug(f"Similarity to previous frame: {similarity:.2f} (threshold: {similarity_threshold})")
+                if similarity > similarity_threshold:
+                    logger.debug(f"⊘ Skipping duplicate frame at {timestamp:.2f}s (similarity: {similarity:.2f})")
+                    continue
+
+            results.append({
+                'timestamp': timestamp,
+                'text': text,
+                'frame_path': frame_path
+            })
+
+            prev_text = text
+
+        logger.info(f"Extracted content from {len(results)} frames (deduplication: {deduplicate})")
+        return results
+
+    def _text_similarity(self, text1: str, text2: str) -> float:
+        """
+        Calculate similarity between two texts.
+
+        Returns:
+            Similarity score between 0 and 1
+        """
+        return SequenceMatcher(None, text1, text2).ratio()
--- a/meetus/ocr_processor.py
+++ b/meetus/ocr_processor.py
@@ -53,20 +53,25 @@ class OCRProcessor:
        else:
            raise ValueError(f"Unknown OCR engine: {self.engine}")

-    def extract_text(self, image_path: str) -> str:
+    def extract_text(self, image_path: str, preserve_layout: bool = True) -> str:
        """
        Extract text from a single image.

        Args:
            image_path: Path to image file
+            preserve_layout: Try to preserve whitespace and layout

        Returns:
            Extracted text
        """
        if self.engine == "tesseract":
            from PIL import Image
+            import pytesseract
            image = Image.open(image_path)
-            text = self._ocr_engine.image_to_string(image)
+
+            # Use PSM 6 (uniform block of text) to preserve layout better
+            config = '--psm 6' if preserve_layout else ''
+            text = pytesseract.image_to_string(image, config=config)

        elif self.engine == "easyocr":
            result = self._ocr_engine.readtext(image_path, detail=0)
@@ -81,12 +86,31 @@ class OCRProcessor:

        return self._clean_text(text)

-    def _clean_text(self, text: str) -> str:
-        """Clean up OCR output."""
-        # Remove excessive whitespace
-        text = re.sub(r'\n\s*\n', '\n', text)
-        text = re.sub(r' +', ' ', text)
-        return text.strip()
+    def _clean_text(self, text: str, preserve_indentation: bool = True) -> str:
+        """
+        Clean up OCR output.
+
+        Args:
+            text: Raw OCR text
+            preserve_indentation: Keep leading whitespace on lines
+
+        Returns:
+            Cleaned text
+        """
+        if preserve_indentation:
+            # Remove excessive blank lines but preserve indentation
+            lines = text.split('\n')
+            cleaned_lines = []
+            for line in lines:
+                # Keep line if it has content or is single empty line
+                if line.strip() or (cleaned_lines and cleaned_lines[-1].strip()):
+                    cleaned_lines.append(line)
+            return '\n'.join(cleaned_lines).strip()
+        else:
+            # Original aggressive cleaning
+            text = re.sub(r'\n\s*\n', '\n', text)
+            text = re.sub(r' +', ' ', text)
+            return text.strip()

    def process_frames(
        self,
@@ -108,18 +132,24 @@ class OCRProcessor:
        results = []
        prev_text = ""

-        for frame_path, timestamp in frames_info:
-            logger.debug(f"Processing frame at {timestamp:.2f}s...")
+        for idx, (frame_path, timestamp) in enumerate(frames_info, 1):
+            logger.debug(f"Processing frame {idx}/{len(frames_info)} at {timestamp:.2f}s...")
            text = self.extract_text(frame_path)

            if not text:
+                logger.debug(f"No text extracted from frame at {timestamp:.2f}s")
                continue

+            # Debug: Show what was extracted
+            logger.debug(f"Frame {idx} ({timestamp:.2f}s): Extracted {len(text)} chars")
+            logger.debug(f"Content preview: {text[:150]}{'...' if len(text) > 150 else ''}")
+
            # Deduplicate similar consecutive frames
-            if deduplicate:
+            if deduplicate and prev_text:
                similarity = self._text_similarity(prev_text, text)
+                logger.debug(f"Similarity to previous frame: {similarity:.2f} (threshold: {similarity_threshold})")
                if similarity > similarity_threshold:
-                    logger.debug(f"Skipping duplicate frame at {timestamp:.2f}s (similarity: {similarity:.2f})")
+                    logger.debug(f"⊘ Skipping duplicate frame at {timestamp:.2f}s (similarity: {similarity:.2f})")
                    continue

            results.append({
--- a/meetus/output_manager.py
+++ b/meetus/output_manager.py
@@ -0,0 +1,155 @@
+"""
+Manage output directories and manifest files.
+Creates timestamped folders for each video and tracks processing options.
+"""
+from pathlib import Path
+from datetime import datetime
+import json
+import logging
+from typing import Dict, Any, Optional
+
+logger = logging.getLogger(__name__)
+
+
+class OutputManager:
+    """Manage output directories and manifest files for video processing."""
+
+    def __init__(self, video_path: Path, base_output_dir: str = "output", use_cache: bool = True):
+        """
+        Initialize output manager.
+
+        Args:
+            video_path: Path to the video file being processed
+            base_output_dir: Base directory for all outputs
+            use_cache: Whether to use existing directories if found
+        """
+        self.video_path = video_path
+        self.base_output_dir = Path(base_output_dir)
+        self.use_cache = use_cache
+
+        # Find or create output directory
+        self.output_dir = self._get_or_create_output_dir()
+        self.frames_dir = self.output_dir / "frames"
+        self.frames_dir.mkdir(exist_ok=True)
+
+        logger.info(f"Output directory: {self.output_dir}")
+
+    def _get_or_create_output_dir(self) -> Path:
+        """
+        Get existing output directory or create a new one with incremental number.
+
+        Returns:
+            Path to output directory
+        """
+        video_name = self.video_path.stem
+
+        # Look for existing directories if caching is enabled
+        if self.use_cache and self.base_output_dir.exists():
+            existing_dirs = sorted([
+                d for d in self.base_output_dir.iterdir()
+                if d.is_dir() and d.name.endswith(f"-{video_name}")
+            ], reverse=True)  # Most recent first
+
+            if existing_dirs:
+                logger.info(f"Found existing output: {existing_dirs[0].name}")
+                return existing_dirs[0]
+
+        # Create new directory with date + incremental number
+        date_str = datetime.now().strftime("%Y%m%d")
+
+        # Find existing runs for today
+        if self.base_output_dir.exists():
+            existing_today = [
+                d for d in self.base_output_dir.iterdir()
+                if d.is_dir() and d.name.startswith(date_str) and d.name.endswith(f"-{video_name}")
+            ]
+
+            # Extract run numbers and find max
+            run_numbers = []
+            for d in existing_today:
+                # Format: YYYYMMDD-NNN-videoname
+                parts = d.name.split('-')
+                if len(parts) >= 2 and parts[1].isdigit():
+                    run_numbers.append(int(parts[1]))
+
+            next_run = max(run_numbers) + 1 if run_numbers else 1
+        else:
+            next_run = 1
+
+        dir_name = f"{date_str}-{next_run:03d}-{video_name}"
+        output_dir = self.base_output_dir / dir_name
+        output_dir.mkdir(parents=True, exist_ok=True)
+        logger.info(f"Created new output directory: {dir_name}")
+
+        return output_dir
+
+    def get_path(self, filename: str) -> Path:
+        """Get full path for a file in the output directory."""
+        return self.output_dir / filename
+
+    def get_frames_path(self, filename: str) -> Path:
+        """Get full path for a file in the frames directory."""
+        return self.frames_dir / filename
+
+    def save_manifest(self, config: Dict[str, Any]):
+        """
+        Save processing configuration to manifest.json.
+
+        Args:
+            config: Dictionary of processing options
+        """
+        manifest_path = self.output_dir / "manifest.json"
+
+        manifest = {
+            "video": {
+                "name": self.video_path.name,
+                "path": str(self.video_path.absolute()),
+            },
+            "processed_at": datetime.now().isoformat(),
+            "configuration": config,
+            "outputs": {
+                "frames": str(self.frames_dir.relative_to(self.output_dir)),
+                "enhanced_transcript": f"{self.video_path.stem}_enhanced.txt",
+                "whisper_transcript": f"{self.video_path.stem}.json" if config.get("run_whisper") else None,
+                "analysis": f"{self.video_path.stem}_{'vision' if config.get('use_vision') else 'ocr'}.json"
+            }
+        }
+
+        with open(manifest_path, 'w', encoding='utf-8') as f:
+            json.dump(manifest, f, indent=2, ensure_ascii=False)
+
+        logger.info(f"Saved manifest: {manifest_path}")
+
+    def load_manifest(self) -> Optional[Dict[str, Any]]:
+        """
+        Load existing manifest if it exists.
+
+        Returns:
+            Manifest dictionary or None
+        """
+        manifest_path = self.output_dir / "manifest.json"
+
+        if manifest_path.exists():
+            with open(manifest_path, 'r', encoding='utf-8') as f:
+                return json.load(f)
+
+        return None
+
+    def list_outputs(self) -> Dict[str, Any]:
+        """
+        List all output files in the directory.
+
+        Returns:
+            Dictionary of output files and their status
+        """
+        video_name = self.video_path.stem
+
+        return {
+            "output_dir": str(self.output_dir),
+            "manifest": (self.output_dir / "manifest.json").exists(),
+            "enhanced_transcript": (self.output_dir / f"{video_name}_enhanced.txt").exists(),
+            "whisper_transcript": (self.output_dir / f"{video_name}.json").exists(),
+            "vision_analysis": (self.output_dir / f"{video_name}_vision.json").exists(),
+            "ocr_analysis": (self.output_dir / f"{video_name}_ocr.json").exists(),
+            "frames": len(list(self.frames_dir.glob("*.jpg"))) if self.frames_dir.exists() else 0
+        }
--- a/meetus/prompts/code.txt
+++ b/meetus/prompts/code.txt
@@ -0,0 +1,5 @@
+You are analyzing a code screenshot from a meeting recording.
+
+Provide a brief description of what's being shown (1-2 sentences about the context), then extract the visible code exactly as it appears, preserving all formatting, indentation, and structure.
+
+If there's no code visible, just describe what you see on screen.
--- a/meetus/prompts/console.txt
+++ b/meetus/prompts/console.txt
@@ -0,0 +1,5 @@
+You are analyzing console/terminal output from a meeting recording.
+
+Provide a brief description of what's happening (1-2 sentences), then extract the visible commands and output exactly as shown, preserving formatting.
+
+Include any error messages, warnings, or important status information.
--- a/meetus/prompts/dashboard.txt
+++ b/meetus/prompts/dashboard.txt
@@ -0,0 +1,5 @@
+You are analyzing a dashboard/monitoring panel from a meeting recording.
+
+Provide a brief description of what's being monitored (1-2 sentences), then list the key panels, metrics, and their current values. Include any alerts, warnings, or notable trends.
+
+Keep it concise and focused on the important information.
--- a/meetus/prompts/meeting.txt
+++ b/meetus/prompts/meeting.txt
@@ -0,0 +1,10 @@
+You are analyzing a screen capture from a meeting recording.
+
+Provide a brief description of what's being shown (1-2 sentences about the context). Then extract the key information:
+- Any visible text, titles, or headings
+- Code (preserve exact formatting if present)
+- Metrics, data points, or dashboard information
+- Terminal/console commands and output
+- Application or UI elements
+
+Be concise but capture all important details that help understand what was being discussed.
--- a/meetus/transcript_merger.py
+++ b/meetus/transcript_merger.py
@@ -6,6 +6,8 @@ from typing import List, Dict, Optional
 import json
 from pathlib import Path
 import logging
+import base64
+from io import BytesIO

 logger = logging.getLogger(__name__)

@@ -13,11 +15,18 @@ logger = logging.getLogger(__name__)
 class TranscriptMerger:
    """Merge audio transcripts with screen OCR text."""

-    def __init__(self):
-        """Initialize transcript merger."""
-        pass
+    def __init__(self, embed_images: bool = False, embed_quality: int = 80):
+        """
+        Initialize transcript merger.

-    def load_whisper_transcript(self, transcript_path: str) -> List[Dict]:
+        Args:
+            embed_images: Whether to embed frame images as base64
+            embed_quality: JPEG quality for embedded images (0-100)
+        """
+        self.embed_images = embed_images
+        self.embed_quality = embed_quality
+
+    def load_whisper_transcript(self, transcript_path: str, group_interval: Optional[int] = None) -> List[Dict]:
        """
        Load Whisper transcript from file.

@@ -25,6 +34,7 @@ class TranscriptMerger:

        Args:
            transcript_path: Path to transcript file
+            group_interval: If specified, group audio segments into intervals (in seconds)

        Returns:
            List of dicts with 'timestamp' (optional) and 'text'
@@ -35,28 +45,39 @@ class TranscriptMerger:
            with open(path, 'r', encoding='utf-8') as f:
                data = json.load(f)

-            # Handle different Whisper output formats
+            # Handle different Whisper/WhisperX output formats
+            segments = []
            if isinstance(data, dict) and 'segments' in data:
-                # Standard Whisper JSON format
-                return [
+                # Standard Whisper/WhisperX JSON format
+                segments = [
                    {
                        'timestamp': seg.get('start', 0),
                        'text': seg['text'].strip(),
+                        'speaker': seg.get('speaker'),  # WhisperX diarization
                        'type': 'audio'
                    }
                    for seg in data['segments']
                ]
            elif isinstance(data, list):
                # List of segments
-                return [
+                segments = [
                    {
                        'timestamp': seg.get('start', seg.get('timestamp', 0)),
                        'text': seg['text'].strip(),
+                        'speaker': seg.get('speaker'),  # WhisperX diarization
                        'type': 'audio'
                    }
                    for seg in data
                ]

+            # Group by interval if requested, but skip if we have speaker diarization
+            # (merge_transcripts will group by speaker instead)
+            has_speakers = any(seg.get('speaker') for seg in segments)
+            if group_interval and segments and not has_speakers:
+                segments = self.group_audio_by_intervals(segments, group_interval)
+
+            return segments
+
        else:
            # Plain text file - no timestamps
            with open(path, 'r', encoding='utf-8') as f:
@@ -68,6 +89,76 @@ class TranscriptMerger:
                'type': 'audio'
            }]

+    def group_audio_by_intervals(self, segments: List[Dict], interval_seconds: int = 30) -> List[Dict]:
+        """
+        Group audio segments into regular time intervals.
+
+        Instead of word-level timestamps, this creates intervals (e.g., every 30 seconds)
+        with all text spoken during that interval concatenated together.
+
+        Args:
+            segments: List of audio segments with timestamps
+            interval_seconds: Duration of each interval in seconds
+
+        Returns:
+            List of grouped segments with interval timestamps
+        """
+        if not segments:
+            return []
+
+        # Find the max timestamp to determine how many intervals we need
+        max_timestamp = max(seg['timestamp'] for seg in segments)
+        num_intervals = int(max_timestamp / interval_seconds) + 1
+
+        # Create interval buckets
+        intervals = []
+        for i in range(num_intervals):
+            interval_start = i * interval_seconds
+            interval_end = (i + 1) * interval_seconds
+
+            # Collect all text in this interval
+            texts = []
+            for seg in segments:
+                if interval_start <= seg['timestamp'] < interval_end:
+                    texts.append(seg['text'])
+
+            # Only create interval if there's text
+            if texts:
+                intervals.append({
+                    'timestamp': interval_start,
+                    'text': ' '.join(texts),
+                    'type': 'audio'
+                })
+
+        logger.info(f"Grouped {len(segments)} segments into {len(intervals)} intervals of {interval_seconds}s")
+        return intervals
+
+    def _encode_image_base64(self, image_path: str) -> tuple[str, int]:
+        """
+        Encode image as base64 (image already at target quality/size).
+
+        Args:
+            image_path: Path to image file
+
+        Returns:
+            Tuple of (base64_string, size_in_bytes)
+        """
+        try:
+            # Read file directly (already at target quality/resolution)
+            with open(image_path, 'rb') as f:
+                img_bytes = f.read()
+
+            # Encode to base64
+            b64_string = base64.b64encode(img_bytes).decode('utf-8')
+
+            logger.debug(f"Encoded {Path(image_path).name}: {len(img_bytes)} bytes")
+
+            return b64_string, len(img_bytes)
+
+        except Exception as e:
+            logger.error(f"Failed to encode image {image_path}: {e}")
+            return "", 0
+
    def merge_transcripts(
        self,
        audio_segments: List[Dict],
@@ -75,13 +166,14 @@ class TranscriptMerger:
    ) -> List[Dict]:
        """
        Merge audio and screen transcripts by timestamp.
+        Groups consecutive audio from same speaker until a screen frame interrupts.

        Args:
            audio_segments: List of audio transcript segments
            screen_segments: List of screen OCR segments

        Returns:
-            Merged list sorted by timestamp
+            Merged list sorted by timestamp, with audio grouped by speaker
        """
        # Mark segment types
        for seg in audio_segments:
@@ -93,7 +185,46 @@ class TranscriptMerger:
        all_segments = audio_segments + screen_segments
        all_segments.sort(key=lambda x: x['timestamp'])

-        return all_segments
+        # Group consecutive audio segments by speaker (screen frames break groups)
+        grouped = []
+        current_group = None
+
+        for seg in all_segments:
+            if seg['type'] == 'screen':
+                # Screen frame: flush current group and add frame
+                if current_group:
+                    grouped.append(current_group)
+                    current_group = None
+                grouped.append(seg)
+            else:
+                # Audio segment
+                speaker = seg.get('speaker')
+                if current_group is None:
+                    # Start new group
+                    current_group = {
+                        'timestamp': seg['timestamp'],
+                        'text': seg['text'],
+                        'speaker': speaker,
+                        'type': 'audio'
+                    }
+                elif speaker == current_group.get('speaker'):
+                    # Same speaker, append text
+                    current_group['text'] += ' ' + seg['text']
+                else:
+                    # Speaker changed, flush and start new group
+                    grouped.append(current_group)
+                    current_group = {
+                        'timestamp': seg['timestamp'],
+                        'text': seg['text'],
+                        'speaker': speaker,
+                        'type': 'audio'
+                    }
+
+        # Don't forget last group
+        if current_group:
+            grouped.append(current_group)
+
+        return grouped

    def format_for_claude(
        self,
@@ -120,7 +251,7 @@ class TranscriptMerger:
        lines = []
        lines.append("=" * 80)
        lines.append("ENHANCED MEETING TRANSCRIPT")
-        lines.append("Audio transcript + Screen content")
+        lines.append("Audio transcript + Screen frames")
        lines.append("=" * 80)
        lines.append("")

@@ -128,15 +259,27 @@ class TranscriptMerger:
            timestamp = self._format_timestamp(seg['timestamp'])

            if seg['type'] == 'audio':
-                lines.append(f"[{timestamp}] SPEAKER:")
+                speaker = seg.get('speaker', 'SPEAKER')
+                lines.append(f"[{timestamp}] {speaker}:")
                lines.append(f"  {seg['text']}")
                lines.append("")

            else:  # screen
                lines.append(f"[{timestamp}] SCREEN CONTENT:")
-                # Indent screen text for visibility
-                screen_text = seg['text'].replace('\n', '\n  | ')
-                lines.append(f"  | {screen_text}")
+
+                # Show frame path if available
+                if 'frame_path' in seg:
+                    # Get just the filename relative to the enhanced transcript
+                    frame_path = Path(seg['frame_path'])
+                    relative_path = f"frames/{frame_path.name}"
+                    lines.append(f"  Frame: {relative_path}")
+
+                # Include text content if available (fallback or additional context)
+                if 'text' in seg and seg['text'].strip():
+                    screen_text = seg['text'].replace('\n', '\n  | ')
+                    lines.append(f"  TEXT:")
+                    lines.append(f"  | {screen_text}")
+
                lines.append("")

        return "\n".join(lines)
@@ -147,7 +290,10 @@ class TranscriptMerger:

        for seg in segments:
            timestamp = self._format_timestamp(seg['timestamp'])
-            prefix = "SPEAKER" if seg['type'] == 'audio' else "SCREEN"
+            if seg['type'] == 'audio':
+                prefix = seg.get('speaker', 'SPEAKER')
+            else:
+                prefix = "SCREEN"
            text = seg['text'].replace('\n', ' ')[:200]  # Truncate long screen text
            lines.append(f"[{timestamp}] {prefix}: {text}")

--- a/meetus/vision_processor.py
+++ b/meetus/vision_processor.py
@@ -6,6 +6,7 @@ from typing import List, Tuple, Dict, Optional
 from pathlib import Path
 import logging
 from difflib import SequenceMatcher
+import os

 logger = logging.getLogger(__name__)

@@ -13,15 +14,24 @@ logger = logging.getLogger(__name__)
 class VisionProcessor:
    """Process frames using local vision models via Ollama."""

-    def __init__(self, model: str = "llava:13b"):
+    def __init__(self, model: str = "llava:13b", prompts_dir: Optional[str] = None):
        """
        Initialize vision processor.

        Args:
            model: Ollama vision model to use (llava:13b, llava:7b, llava-llama3, bakllava)
+            prompts_dir: Directory containing prompt files (default: meetus/prompts/)
        """
        self.model = model
        self._client = None
+
+        # Set prompts directory
+        if prompts_dir:
+            self.prompts_dir = Path(prompts_dir)
+        else:
+            # Default to meetus/prompts/ relative to this file
+            self.prompts_dir = Path(__file__).parent / "prompts"
+
        self._init_client()

    def _init_client(self):
@@ -53,61 +63,44 @@ class VisionProcessor:
                "Also install Ollama: https://ollama.ai/download"
            )

-    def analyze_frame(self, image_path: str, context: str = "meeting") -> str:
+    def _load_prompt(self, context: str) -> str:
+        """
+        Load prompt from file.
+
+        Args:
+            context: Context name (meeting, dashboard, code, console)
+
+        Returns:
+            Prompt text
+        """
+        prompt_file = self.prompts_dir / f"{context}.txt"
+
+        if prompt_file.exists():
+            with open(prompt_file, 'r', encoding='utf-8') as f:
+                return f.read().strip()
+        else:
+            # Fallback to default prompt
+            logger.warning(f"Prompt file not found: {prompt_file}, using default")
+            return "Analyze this image and describe what you see in detail."
+
+    def analyze_frame(self, image_path: str, context: str = "meeting", audio_context: str = "") -> str:
        """
        Analyze a single frame using local vision model.

        Args:
            image_path: Path to image file
            context: Context hint for analysis (meeting, dashboard, code, console)
+            audio_context: Optional audio transcript around this timestamp for context

        Returns:
            Analyzed content description
        """
-        # Context-specific prompts
-        prompts = {
-            "meeting": """Analyze this screen capture from a meeting recording. Extract:
-1. Any visible text (titles, labels, headings)
-2. Key metrics, numbers, or data points shown
-3. Dashboard panels or visualizations (describe what they show)
-4. Code snippets (preserve formatting and context)
-5. Console/terminal output (commands and results)
-6. Application names or UI elements
+        # Load prompt from file
+        prompt = self._load_prompt(context)

-Focus on information that would help someone understand what was being discussed.
-Be concise but include all important details. If there's code, preserve it exactly.""",
-
-            "dashboard": """Analyze this dashboard/monitoring panel. Extract:
-1. Panel titles and metrics names
-2. Current values and units
-3. Trends (up/down/stable)
-4. Alerts or warnings
-5. Time ranges shown
-6. Any anomalies or notable patterns
-
-Format as structured data.""",
-
-            "code": """Analyze this code screenshot. Extract:
-1. Programming language
-2. File name or path (if visible)
-3. Code content (preserve exact formatting)
-4. Comments
-5. Function/class names
-6. Any error messages or warnings
-
-Preserve code exactly as shown.""",
-
-            "console": """Analyze this console/terminal output. Extract:
-1. Commands executed
-2. Output/results
-3. Error messages
-4. Warnings or status messages
-5. File paths or URLs
-
-Preserve formatting and structure."""
-        }
-
-        prompt = prompts.get(context, prompts["meeting"])
+        # Add audio context if available
+        if audio_context:
+            prompt = f"Audio context (what's being discussed around this time):\n{audio_context}\n\n{prompt}"

        try:
            # Use Ollama's chat API with vision
@@ -135,7 +128,8 @@ Preserve formatting and structure."""
        frames_info: List[Tuple[str, float]],
        context: str = "meeting",
        deduplicate: bool = True,
-        similarity_threshold: float = 0.85
+        similarity_threshold: float = 0.85,
+        audio_segments: Optional[List[Dict]] = None
    ) -> List[Dict]:
        """
        Process multiple frames with vision analysis.
@@ -158,17 +152,25 @@ Preserve formatting and structure."""
        for idx, (frame_path, timestamp) in enumerate(frames_info, 1):
            logger.info(f"Analyzing frame {idx}/{total} at {timestamp:.2f}s...")

-            text = self.analyze_frame(frame_path, context)
+            # Get audio context around this timestamp (±30 seconds)
+            audio_context = self._get_audio_context(timestamp, audio_segments, window=30)
+
+            text = self.analyze_frame(frame_path, context, audio_context)

            if not text:
                logger.warning(f"No content extracted from frame at {timestamp:.2f}s")
                continue

+            # Debug: Show what was extracted
+            logger.debug(f"Frame {idx} ({timestamp:.2f}s): Extracted {len(text)} chars")
+            logger.debug(f"Content preview: {text[:150]}{'...' if len(text) > 150 else ''}")
+
            # Deduplicate similar consecutive frames
-            if deduplicate:
+            if deduplicate and prev_text:
                similarity = self._text_similarity(prev_text, text)
+                logger.debug(f"Similarity to previous frame: {similarity:.2f} (threshold: {similarity_threshold})")
                if similarity > similarity_threshold:
-                    logger.debug(f"Skipping duplicate frame at {timestamp:.2f}s (similarity: {similarity:.2f})")
+                    logger.debug(f"⊘ Skipping duplicate frame at {timestamp:.2f}s (similarity: {similarity:.2f})")
                    continue

            results.append({
@@ -182,6 +184,29 @@ Preserve formatting and structure."""
        logger.info(f"Extracted content from {len(results)} frames (deduplication: {deduplicate})")
        return results

+    def _get_audio_context(self, timestamp: float, audio_segments: Optional[List[Dict]], window: int = 30) -> str:
+        """
+        Get audio transcript around a given timestamp.
+
+        Args:
+            timestamp: Target timestamp in seconds
+            audio_segments: List of audio segments with 'timestamp' and 'text' keys
+            window: Time window in seconds (±window around timestamp)
+
+        Returns:
+            Concatenated audio text from the time window
+        """
+        if not audio_segments:
+            return ""
+
+        relevant = [seg for seg in audio_segments
+                    if abs(seg.get('timestamp', 0) - timestamp) <= window]
+
+        if not relevant:
+            return ""
+
+        return " ".join([seg['text'] for seg in relevant])
+
    def _text_similarity(self, text1: str, text2: str) -> float:
        """
        Calculate similarity between two texts.
--- a/meetus/workflow.py
+++ b/meetus/workflow.py
@@ -0,0 +1,523 @@
+"""
+Orchestrate the video processing workflow.
+Coordinates frame extraction, analysis, and transcript merging.
+"""
+from pathlib import Path
+import logging
+import os
+import subprocess
+import shutil
+from typing import Dict, Any, Optional
+
+from .output_manager import OutputManager
+from .cache_manager import CacheManager
+from .frame_extractor import FrameExtractor
+from .ocr_processor import OCRProcessor
+from .vision_processor import VisionProcessor
+from .transcript_merger import TranscriptMerger
+
+logger = logging.getLogger(__name__)
+
+
+class WorkflowConfig:
+    """Configuration for the processing workflow."""
+
+    def __init__(self, **kwargs):
+        """Initialize configuration from keyword arguments."""
+        # Video and paths
+        self.video_path = Path(kwargs['video'])
+        self.transcript_path = kwargs.get('transcript')
+        self.output_dir = kwargs.get('output_dir', 'output')
+        self.custom_output = kwargs.get('output')
+
+        # Whisper options
+        self.run_whisper = kwargs.get('run_whisper', False)
+        self.whisper_model = kwargs.get('whisper_model', 'medium')
+        self.diarize = kwargs.get('diarize', False)
+
+        # Frame extraction
+        self.scene_detection = kwargs.get('scene_detection', False)
+        self.scene_threshold = kwargs.get('scene_threshold', 15.0)
+        self.interval = kwargs.get('interval', 5)
+
+        # Analysis options
+        self.use_vision = kwargs.get('use_vision', False)
+        self.use_hybrid = kwargs.get('use_hybrid', False)
+        self.hybrid_llm_cleanup = kwargs.get('hybrid_llm_cleanup', False)
+        self.hybrid_llm_model = kwargs.get('hybrid_llm_model', 'llama3.2:3b')
+        self.vision_model = kwargs.get('vision_model', 'llava:13b')
+        self.vision_context = kwargs.get('vision_context', 'meeting')
+        self.ocr_engine = kwargs.get('ocr_engine', 'tesseract')
+
+        # Validation: can't use both vision and hybrid
+        if self.use_vision and self.use_hybrid:
+            raise ValueError("Cannot use both --use-vision and --use-hybrid. Choose one.")
+
+        # Validation: LLM cleanup requires hybrid mode
+        if self.hybrid_llm_cleanup and not self.use_hybrid:
+            raise ValueError("--hybrid-llm-cleanup requires --use-hybrid")
+
+        # Processing options
+        self.no_deduplicate = kwargs.get('no_deduplicate', False)
+        self.no_cache = kwargs.get('no_cache', False)
+        self.skip_cache_frames = kwargs.get('skip_cache_frames', False)
+        self.skip_cache_whisper = kwargs.get('skip_cache_whisper', False)
+        self.skip_cache_analysis = kwargs.get('skip_cache_analysis', False)
+        self.extract_only = kwargs.get('extract_only', False)
+        self.format = kwargs.get('format', 'detailed')
+        self.embed_images = kwargs.get('embed_images', False)
+        self.embed_quality = kwargs.get('embed_quality', 80)
+
+    def to_dict(self) -> Dict[str, Any]:
+        """Convert config to dictionary for manifest."""
+        return {
+            "whisper": {
+                "enabled": self.run_whisper,
+                "model": self.whisper_model
+            },
+            "frame_extraction": {
+                "method": "scene_detection" if self.scene_detection else "interval",
+                "interval_seconds": self.interval if not self.scene_detection else None,
+                "scene_threshold": self.scene_threshold if self.scene_detection else None
+            },
+            "analysis": {
+                "method": "vision" if self.use_vision else ("hybrid" if self.use_hybrid else "ocr"),
+                "vision_model": self.vision_model if self.use_vision else None,
+                "vision_context": self.vision_context if self.use_vision else None,
+                "ocr_engine": self.ocr_engine if (not self.use_vision) else None,
+                "deduplication": not self.no_deduplicate
+            },
+            "output_format": self.format
+        }
+
+
+class ProcessingWorkflow:
+    """Orchestrate the complete video processing workflow."""
+
+    def __init__(self, config: WorkflowConfig):
+        """
+        Initialize workflow.
+
+        Args:
+            config: Workflow configuration
+        """
+        self.config = config
+        self.output_mgr = OutputManager(
+            config.video_path,
+            config.output_dir,
+            use_cache=not config.no_cache
+        )
+        self.cache_mgr = CacheManager(
+            self.output_mgr.output_dir,
+            self.output_mgr.frames_dir,
+            config.video_path.stem,
+            use_cache=not config.no_cache,
+            skip_cache_frames=config.skip_cache_frames,
+            skip_cache_whisper=config.skip_cache_whisper,
+            skip_cache_analysis=config.skip_cache_analysis
+        )
+
+    def run(self) -> Dict[str, Any]:
+        """
+        Run the complete processing workflow.
+
+        Returns:
+            Dictionary with output paths and status
+        """
+        logger.info("=" * 80)
+        logger.info("MEETING PROCESSOR")
+        logger.info("=" * 80)
+        logger.info(f"Video: {self.config.video_path.name}")
+
+        # Determine analysis method
+        if self.config.use_vision:
+            analysis_method = f"Vision Model ({self.config.vision_model})"
+            logger.info(f"Analysis: {analysis_method}")
+            logger.info(f"Context: {self.config.vision_context}")
+        elif self.config.use_hybrid:
+            analysis_method = f"Hybrid (OpenCV + {self.config.ocr_engine})"
+            logger.info(f"Analysis: {analysis_method}")
+        else:
+            analysis_method = f"OCR ({self.config.ocr_engine})"
+            logger.info(f"Analysis: {analysis_method}")
+
+        logger.info(f"Frame extraction: {'Scene detection' if self.config.scene_detection else f'Every {self.config.interval}s'}")
+        logger.info(f"Caching: {'Disabled' if self.config.no_cache else 'Enabled'}")
+        logger.info("=" * 80)
+
+        # Step 0: Whisper transcription
+        transcript_path = self._run_whisper()
+
+        # Step 1: Extract frames
+        frames_info = self._extract_frames()
+
+        if not frames_info:
+            logger.error("No frames extracted")
+            raise RuntimeError("Frame extraction failed")
+
+        # Step 2: Analyze frames
+        screen_segments = self._analyze_frames(frames_info)
+
+        if self.config.extract_only:
+            logger.info("Done! (extract-only mode)")
+            return self._build_result(transcript_path, screen_segments)
+
+        # Step 3: Merge with transcript
+        enhanced_transcript = self._merge_transcripts(transcript_path, screen_segments)
+
+        # Save manifest
+        self.output_mgr.save_manifest(self.config.to_dict())
+
+        # Build final result
+        return self._build_result(transcript_path, screen_segments, enhanced_transcript)
+
+    def _run_whisper(self) -> Optional[str]:
+        """Run Whisper transcription if requested, or use cached/provided transcript."""
+        # First, check cache (regardless of run_whisper flag)
+        cached = self.cache_mgr.get_whisper_cache()
+        if cached:
+            return str(cached)
+
+        # If no cache and not running whisper/diarize, use provided transcript path (if any)
+        if not self.config.run_whisper and not self.config.diarize:
+            return self.config.transcript_path
+
+        logger.info("=" * 80)
+        logger.info("STEP 0: Running Whisper Transcription")
+        logger.info("=" * 80)
+
+        # Determine which transcription tool to use
+        use_diarize = getattr(self.config, 'diarize', False)
+
+        if use_diarize:
+            if not shutil.which("whisperx"):
+                logger.error("WhisperX is not installed. Install it with: pip install whisperx")
+                raise RuntimeError("WhisperX not installed (required for --diarize)")
+            transcribe_cmd = "whisperx"
+        else:
+            if not shutil.which("whisper"):
+                logger.error("Whisper is not installed. Install it with: pip install openai-whisper")
+                raise RuntimeError("Whisper not installed")
+            transcribe_cmd = "whisper"
+
+        # Unload Ollama model to free GPU memory for Whisper (if using vision)
+        if self.config.use_vision:
+            logger.info("Freeing GPU memory for Whisper...")
+            try:
+                subprocess.run(["ollama", "stop", self.config.vision_model],
+                             capture_output=True, check=False)
+                logger.info("✓ Ollama model unloaded")
+            except Exception as e:
+                logger.warning(f"Could not unload Ollama model: {e}")
+
+        if use_diarize:
+            logger.info(f"Running WhisperX transcription with diarization (model: {self.config.whisper_model})...")
+        else:
+            logger.info(f"Running Whisper transcription (model: {self.config.whisper_model})...")
+        logger.info("This may take a few minutes depending on video length...")
+
+        # Build command
+        cmd = [
+            transcribe_cmd,
+            str(self.config.video_path),
+            "--model", self.config.whisper_model,
+            "--output_format", "json",
+            "--output_dir", str(self.output_mgr.output_dir),
+        ]
+        if use_diarize:
+            cmd.append("--diarize")
+
+        try:
+            # Set up environment with cuDNN library path for whisperx
+            env = os.environ.copy()
+            if use_diarize:
+                import site
+                site_packages = site.getsitepackages()[0]
+                cudnn_path = Path(site_packages) / "nvidia" / "cudnn" / "lib"
+                if cudnn_path.exists():
+                    env["LD_LIBRARY_PATH"] = str(cudnn_path) + ":" + env.get("LD_LIBRARY_PATH", "")
+
+            subprocess.run(cmd, check=True, capture_output=True, text=True, env=env)
+
+            transcript_path = self.output_mgr.get_path(f"{self.config.video_path.stem}.json")
+
+            if transcript_path.exists():
+                logger.info(f"✓ Whisper transcription completed: {transcript_path.name}")
+
+                # Debug: Show transcript preview
+                try:
+                    import json
+                    with open(transcript_path, 'r', encoding='utf-8') as f:
+                        whisper_data = json.load(f)
+
+                    if 'segments' in whisper_data:
+                        logger.debug(f"Whisper produced {len(whisper_data['segments'])} segments")
+                        if whisper_data['segments']:
+                            logger.debug(f"First segment: {whisper_data['segments'][0]}")
+                            logger.debug(f"Last segment: {whisper_data['segments'][-1]}")
+
+                    if 'text' in whisper_data:
+                        text_preview = whisper_data['text'][:200] + "..." if len(whisper_data.get('text', '')) > 200 else whisper_data.get('text', '')
+                        logger.debug(f"Transcript preview: {text_preview}")
+                except Exception as e:
+                    logger.debug(f"Could not parse whisper output for debug: {e}")
+
+                logger.info("")
+                return str(transcript_path)
+            else:
+                logger.error("Whisper completed but transcript file not found")
+                raise RuntimeError("Whisper output missing")
+
+        except subprocess.CalledProcessError as e:
+            logger.error(f"Whisper failed: {e.stderr}")
+            raise
+
+    def _extract_frames(self):
+        """Extract frames from video."""
+        logger.info("Step 1: Extracting frames from video...")
+
+        # Check cache
+        cached_frames = self.cache_mgr.get_frames_cache()
+        if cached_frames:
+            return cached_frames
+
+        # Clean up old frames if regenerating
+        if self.config.skip_cache_frames and self.output_mgr.frames_dir.exists():
+            old_frames = list(self.output_mgr.frames_dir.glob("*.jpg"))
+            if old_frames:
+                logger.info(f"Cleaning up {len(old_frames)} old frames...")
+                for old_frame in old_frames:
+                    old_frame.unlink()
+                logger.info("✓ Cleanup complete")
+
+        # Extract frames (use embed quality so saved files match embedded images)
+        if self.config.scene_detection:
+            logger.info(f"Extracting frames with scene detection (threshold={self.config.scene_threshold})...")
+        else:
+            logger.info(f"Extracting frames every {self.config.interval}s...")
+
+        extractor = FrameExtractor(
+            str(self.config.video_path),
+            str(self.output_mgr.frames_dir),
+            quality=self.config.embed_quality
+        )
+
+        if self.config.scene_detection:
+            frames_info = extractor.extract_scene_changes(threshold=self.config.scene_threshold)
+        else:
+            frames_info = extractor.extract_by_interval(self.config.interval)
+
+        logger.info(f"✓ Extracted {len(frames_info)} frames")
+        return frames_info
+
+    def _analyze_frames(self, frames_info):
+        """Analyze frames with vision, hybrid, or OCR."""
+        # Skip analysis if just embedding images
+        if self.config.embed_images:
+            logger.info("Step 2: Skipping analysis (images will be embedded)")
+            # Create minimal segments with just frame paths and timestamps
+            screen_segments = [
+                {
+                    'timestamp': timestamp,
+                    'text': '',  # No text extraction needed
+                    'frame_path': frame_path
+                }
+                for frame_path, timestamp in frames_info
+            ]
+            logger.info(f"✓ Prepared {len(screen_segments)} frames for embedding")
+            return screen_segments
+
+        # Determine analysis type
+        if self.config.use_vision:
+            analysis_type = 'vision'
+        elif self.config.use_hybrid:
+            analysis_type = 'hybrid'
+        else:
+            analysis_type = 'ocr'
+
+        # Check cache
+        cached_analysis = self.cache_mgr.get_analysis_cache(analysis_type)
+        if cached_analysis:
+            return cached_analysis
+
+        if self.config.use_vision:
+            return self._run_vision_analysis(frames_info)
+        elif self.config.use_hybrid:
+            return self._run_hybrid_analysis(frames_info)
+        else:
+            return self._run_ocr_analysis(frames_info)
+
+    def _run_vision_analysis(self, frames_info):
+        """Run vision analysis on frames."""
+        logger.info("Step 2: Running vision analysis on extracted frames...")
+        logger.info(f"Loading vision model {self.config.vision_model} to GPU...")
+
+        # Load audio segments for context if transcript exists
+        audio_segments = []
+        transcript_path = self.config.transcript_path or self._get_cached_transcript()
+
+        if transcript_path:
+            transcript_file = Path(transcript_path)
+            if transcript_file.exists():
+                logger.info("Loading audio transcript for context...")
+                merger = TranscriptMerger()
+                audio_segments = merger.load_whisper_transcript(str(transcript_file))
+                logger.info(f"✓ Loaded {len(audio_segments)} audio segments for context")
+
+        try:
+            vision = VisionProcessor(model=self.config.vision_model)
+            screen_segments = vision.process_frames(
+                frames_info,
+                context=self.config.vision_context,
+                deduplicate=not self.config.no_deduplicate,
+                audio_segments=audio_segments
+            )
+            logger.info(f"✓ Analyzed {len(screen_segments)} frames with vision model")
+
+            # Debug: Show sample analysis results
+            if screen_segments:
+                logger.debug(f"First analysis result: timestamp={screen_segments[0].get('timestamp')}, text_length={len(screen_segments[0].get('text', ''))}")
+                logger.debug(f"First analysis text preview: {screen_segments[0].get('text', '')[:200]}...")
+                if len(screen_segments) > 1:
+                    logger.debug(f"Last analysis result: timestamp={screen_segments[-1].get('timestamp')}, text_length={len(screen_segments[-1].get('text', ''))}")
+
+            # Cache results
+            self.cache_mgr.save_analysis('vision', screen_segments)
+            return screen_segments
+
+        except ImportError as e:
+            logger.error(f"{e}")
+            raise
+
+    def _get_cached_transcript(self) -> Optional[str]:
+        """Get cached Whisper transcript if available."""
+        cached = self.cache_mgr.get_whisper_cache()
+        return str(cached) if cached else None
+
+    def _run_hybrid_analysis(self, frames_info):
+        """Run hybrid analysis on frames (OpenCV + OCR)."""
+        if self.config.hybrid_llm_cleanup:
+            logger.info("Step 2: Running hybrid analysis (OpenCV + OCR + LLM cleanup)...")
+        else:
+            logger.info("Step 2: Running hybrid analysis (OpenCV text detection + OCR)...")
+
+        try:
+            from .hybrid_processor import HybridProcessor
+
+            hybrid = HybridProcessor(
+                ocr_engine=self.config.ocr_engine,
+                use_llm_cleanup=self.config.hybrid_llm_cleanup,
+                llm_model=self.config.hybrid_llm_model
+            )
+            screen_segments = hybrid.process_frames(
+                frames_info,
+                deduplicate=not self.config.no_deduplicate
+            )
+            logger.info(f"✓ Processed {len(screen_segments)} frames with hybrid analysis")
+
+            # Debug: Show sample hybrid results
+            if screen_segments:
+                logger.debug(f"First hybrid result: timestamp={screen_segments[0].get('timestamp')}, text_length={len(screen_segments[0].get('text', ''))}")
+                logger.debug(f"First hybrid text preview: {screen_segments[0].get('text', '')[:200]}...")
+                if len(screen_segments) > 1:
+                    logger.debug(f"Last hybrid result: timestamp={screen_segments[-1].get('timestamp')}, text_length={len(screen_segments[-1].get('text', ''))}")
+
+            # Cache results
+            self.cache_mgr.save_analysis('hybrid', screen_segments)
+            return screen_segments
+
+        except ImportError as e:
+            logger.error(f"{e}")
+            raise
+
+    def _run_ocr_analysis(self, frames_info):
+        """Run OCR analysis on frames."""
+        logger.info("Step 2: Running OCR on extracted frames...")
+
+        try:
+            ocr = OCRProcessor(engine=self.config.ocr_engine)
+            screen_segments = ocr.process_frames(
+                frames_info,
+                deduplicate=not self.config.no_deduplicate
+            )
+            logger.info(f"✓ Processed {len(screen_segments)} frames with OCR")
+
+            # Debug: Show sample OCR results
+            if screen_segments:
+                logger.debug(f"First OCR result: timestamp={screen_segments[0].get('timestamp')}, text_length={len(screen_segments[0].get('text', ''))}")
+                logger.debug(f"First OCR text preview: {screen_segments[0].get('text', '')[:200]}...")
+                if len(screen_segments) > 1:
+                    logger.debug(f"Last OCR result: timestamp={screen_segments[-1].get('timestamp')}, text_length={len(screen_segments[-1].get('text', ''))}")
+
+            # Cache results
+            self.cache_mgr.save_analysis('ocr', screen_segments)
+            return screen_segments
+
+        except ImportError as e:
+            logger.error(f"{e}")
+            logger.error(f"To install {self.config.ocr_engine}:")
+            logger.error(f"  pip install {self.config.ocr_engine}")
+            raise
+
+    def _merge_transcripts(self, transcript_path, screen_segments):
+        """Merge audio and screen transcripts."""
+        merger = TranscriptMerger(
+            embed_images=self.config.embed_images,
+            embed_quality=self.config.embed_quality
+        )
+
+        # Load audio transcript if available
+        audio_segments = []
+        if transcript_path:
+            logger.info("Step 3: Merging with Whisper transcript...")
+            transcript_file = Path(transcript_path)
+
+            if not transcript_file.exists():
+                logger.warning(f"Transcript not found: {transcript_path}")
+                logger.info("Proceeding with screen content only...")
+            else:
+                # Group audio into 30-second intervals for cleaner reference timestamps
+                audio_segments = merger.load_whisper_transcript(str(transcript_file), group_interval=30)
+                logger.info(f"✓ Loaded {len(audio_segments)} audio segments")
+        else:
+            logger.info("No transcript provided, using screen content only...")
+
+        # Merge and format
+        merged = merger.merge_transcripts(audio_segments, screen_segments)
+        formatted = merger.format_for_claude(merged, format_style=self.config.format)
+
+        # Save output
+        if self.config.custom_output:
+            output_path = self.config.custom_output
+        else:
+            output_path = self.output_mgr.get_path(f"{self.config.video_path.stem}_enhanced.txt")
+
+        merger.save_transcript(formatted, str(output_path))
+
+        logger.info("=" * 80)
+        logger.info("✓ PROCESSING COMPLETE!")
+        logger.info("=" * 80)
+        logger.info(f"Output directory: {self.output_mgr.output_dir}")
+        logger.info(f"Enhanced transcript: {Path(output_path).name}")
+        logger.info("")
+
+        return output_path
+
+    def _build_result(self, transcript_path=None, screen_segments=None, enhanced_transcript=None):
+        """Build result dictionary."""
+        # Determine analysis filename
+        if self.config.use_vision:
+            analysis_type = 'vision'
+        elif self.config.use_hybrid:
+            analysis_type = 'hybrid'
+        else:
+            analysis_type = 'ocr'
+
+        return {
+            "output_dir": str(self.output_mgr.output_dir),
+            "transcript": transcript_path,
+            "analysis": f"{self.config.video_path.stem}_{analysis_type}.json",
+            "frames_count": len(screen_segments) if screen_segments else 0,
+            "enhanced_transcript": enhanced_transcript,
+            "manifest": str(self.output_mgr.get_path("manifest.json"))
+        }
--- a/process_meeting.py
+++ b/process_meeting.py
@@ -1,34 +1,19 @@
 #!/usr/bin/env python3
 """
 Process meeting recordings to extract audio + screen content.
-Combines Whisper transcripts with OCR from screen shares.
+Combines Whisper transcripts with vision analysis or OCR from screen shares.
 """
 import argparse
-from pathlib import Path
 import sys
-import json
 import logging
-import subprocess
-import shutil

-from meetus.frame_extractor import FrameExtractor
-from meetus.ocr_processor import OCRProcessor
-from meetus.vision_processor import VisionProcessor
-from meetus.transcript_merger import TranscriptMerger
-
-logger = logging.getLogger(__name__)
+from meetus.workflow import WorkflowConfig, ProcessingWorkflow


 def setup_logging(verbose: bool = False):
-    """
-    Configure logging for the application.
-
-    Args:
-        verbose: If True, set DEBUG level, otherwise INFO
-    """
+    """Configure logging for the application."""
    level = logging.DEBUG if verbose else logging.INFO

-    # Configure root logger
    logging.basicConfig(
        level=level,
        format='%(asctime)s - %(levelname)s - %(message)s',
@@ -41,158 +26,121 @@ def setup_logging(verbose: bool = False):
    logging.getLogger('paddleocr').setLevel(logging.WARNING)


-def run_whisper(video_path: Path, model: str = "base", output_dir: str = "output") -> Path:
-    """
-    Run Whisper transcription on video file.
-
-    Args:
-        video_path: Path to video file
-        model: Whisper model to use (tiny, base, small, medium, large)
-        output_dir: Directory to save output
-
-    Returns:
-        Path to generated JSON transcript
-    """
-    # Check if whisper is installed
-    if not shutil.which("whisper"):
-        logger.error("Whisper is not installed. Install it with: pip install openai-whisper")
-        sys.exit(1)
-
-    logger.info(f"Running Whisper transcription (model: {model})...")
-    logger.info("This may take a few minutes depending on video length...")
-
-    # Run whisper command
-    cmd = [
-        "whisper",
-        str(video_path),
-        "--model", model,
-        "--output_format", "json",
-        "--output_dir", output_dir
-    ]
-
-    try:
-        result = subprocess.run(
-            cmd,
-            check=True,
-            capture_output=True,
-            text=True
-        )
-
-        # Whisper outputs to <output_dir>/<video_stem>.json
-        transcript_path = Path(output_dir) / f"{video_path.stem}.json"
-
-        if transcript_path.exists():
-            logger.info(f"✓ Whisper transcription completed: {transcript_path}")
-            return transcript_path
-        else:
-            logger.error("Whisper completed but transcript file not found")
-            sys.exit(1)
-
-    except subprocess.CalledProcessError as e:
-        logger.error(f"Whisper failed: {e.stderr}")
-        sys.exit(1)
-
-
 def main():
    parser = argparse.ArgumentParser(
        description="Extract screen content from meeting recordings and merge with transcripts",
        formatter_class=argparse.RawDescriptionHelpFormatter,
        epilog="""
 Examples:
-  # Run Whisper + vision analysis (recommended for code/dashboards)
-  python process_meeting.py samples/meeting.mkv --run-whisper --use-vision
+  # Reference frames for LLM analysis (recommended - transcript includes frame paths)
+  python process_meeting.py samples/meeting.mkv --run-whisper --embed-images --scene-detection

-  # Use vision with specific context hint
-  python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --vision-context code
+  # Adjust frame extraction quality (lower = smaller files)
+  python process_meeting.py samples/meeting.mkv --run-whisper --embed-images --embed-quality 60 --scene-detection

-  # Traditional OCR approach
-  python process_meeting.py samples/meeting.mkv --run-whisper
+  # Hybrid approach: OpenCV + OCR (extracts text from frames)
+  python process_meeting.py samples/meeting.mkv --run-whisper --use-hybrid --scene-detection

-  # Re-run analysis using cached frames and transcript
-  python process_meeting.py samples/meeting.mkv --use-vision
+  # Hybrid + LLM cleanup (best for code formatting)
+  python process_meeting.py samples/meeting.mkv --run-whisper --use-hybrid --hybrid-llm-cleanup --scene-detection

-  # Force reprocessing (ignore cache)
-  python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --no-cache
-
-  # Use scene detection for fewer frames
-  python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --scene-detection
+  # Iterate on scene threshold (reuse whisper transcript)
+  python process_meeting.py samples/meeting.mkv --embed-images --scene-detection --scene-threshold 5 --skip-cache-frames --skip-cache-analysis
        """
    )

+    # Required arguments
    parser.add_argument(
        'video',
        help='Path to video file'
    )

+    # Whisper options
    parser.add_argument(
        '--transcript', '-t',
        help='Path to Whisper transcript (JSON or TXT)',
        default=None
    )
-
    parser.add_argument(
        '--run-whisper',
        action='store_true',
        help='Run Whisper transcription before processing'
    )
-
    parser.add_argument(
        '--whisper-model',
        choices=['tiny', 'base', 'small', 'medium', 'large'],
-        help='Whisper model to use (default: base)',
-        default='base'
+        help='Whisper model to use (default: medium)',
+        default='medium'
+    )
+    parser.add_argument(
+        '--diarize',
+        action='store_true',
+        help='Use WhisperX with speaker diarization (requires whisperx and HuggingFace token)'
    )

+    # Output options
    parser.add_argument(
        '--output', '-o',
-        help='Output file for enhanced transcript (default: output/<video>_enhanced.txt)',
+        help='Output file for enhanced transcript (default: auto-generated in output directory)',
        default=None
    )
-
    parser.add_argument(
        '--output-dir',
-        help='Directory for output files (default: output/)',
+        help='Base directory for outputs (default: output/)',
        default='output'
    )

-    parser.add_argument(
-        '--frames-dir',
-        help='Directory to save extracted frames (default: frames/)',
-        default='frames'
-    )
-
+    # Frame extraction options
    parser.add_argument(
        '--interval',
        type=int,
        help='Extract frame every N seconds (default: 5)',
        default=5
    )
-
    parser.add_argument(
        '--scene-detection',
        action='store_true',
        help='Use scene detection instead of interval extraction'
    )
+    parser.add_argument(
+        '--scene-threshold',
+        type=float,
+        help='Scene detection threshold (0-100, lower=more sensitive, default: 15)',
+        default=15.0
+    )

+    # Analysis options
    parser.add_argument(
        '--ocr-engine',
        choices=['tesseract', 'easyocr', 'paddleocr'],
        help='OCR engine to use (default: tesseract)',
        default='tesseract'
    )
-
    parser.add_argument(
        '--use-vision',
        action='store_true',
        help='Use local vision model (Ollama) instead of OCR for better context understanding'
    )
-
+    parser.add_argument(
+        '--use-hybrid',
+        action='store_true',
+        help='Use hybrid approach: OpenCV text detection + OCR (more accurate than vision models)'
+    )
+    parser.add_argument(
+        '--hybrid-llm-cleanup',
+        action='store_true',
+        help='Use LLM to clean up OCR output and preserve code formatting (requires --use-hybrid)'
+    )
+    parser.add_argument(
+        '--hybrid-llm-model',
+        help='LLM model for cleanup (default: llama3.2:3b)',
+        default='llama3.2:3b'
+    )
    parser.add_argument(
        '--vision-model',
        help='Vision model to use with Ollama (default: llava:13b)',
        default='llava:13b'
    )
-
    parser.add_argument(
        '--vision-context',
        choices=['meeting', 'dashboard', 'code', 'console'],
@@ -200,31 +148,56 @@ Examples:
        default='meeting'
    )

+    # Processing options
    parser.add_argument(
        '--no-cache',
        action='store_true',
        help='Disable caching - reprocess everything even if outputs exist'
    )
-
+    parser.add_argument(
+        '--skip-cache-frames',
+        action='store_true',
+        help='Skip cached frames, re-extract from video (but keep whisper/analysis cache)'
+    )
+    parser.add_argument(
+        '--skip-cache-whisper',
+        action='store_true',
+        help='Skip cached whisper transcript, re-run transcription (but keep frames/analysis cache)'
+    )
+    parser.add_argument(
+        '--skip-cache-analysis',
+        action='store_true',
+        help='Skip cached analysis, re-run OCR/vision (but keep frames/whisper cache)'
+    )
    parser.add_argument(
        '--no-deduplicate',
        action='store_true',
        help='Disable text deduplication'
    )
-
    parser.add_argument(
        '--extract-only',
        action='store_true',
-        help='Only extract frames and OCR, skip transcript merging'
+        help='Only extract frames and analyze, skip transcript merging'
    )
-
    parser.add_argument(
        '--format',
        choices=['detailed', 'compact'],
        help='Output format style (default: detailed)',
        default='detailed'
    )
+    parser.add_argument(
+        '--embed-images',
+        action='store_true',
+        help='Skip OCR/vision analysis and reference frame files directly (faster, lets LLM analyze images)'
+    )
+    parser.add_argument(
+        '--embed-quality',
+        type=int,
+        help='JPEG quality for extracted frames (default: 80, lower = smaller files)',
+        default=80
+    )

+    # Logging
    parser.add_argument(
        '--verbose', '-v',
        action='store_true',
@@ -236,166 +209,38 @@ Examples:
    # Setup logging
    setup_logging(args.verbose)

-    # Validate video path
-    video_path = Path(args.video)
-    if not video_path.exists():
-        logger.error(f"Video file not found: {args.video}")
-        sys.exit(1)
+    try:
+        # Create workflow configuration
+        config = WorkflowConfig(**vars(args))

-    # Create output directory
-    output_dir = Path(args.output_dir)
-    output_dir.mkdir(parents=True, exist_ok=True)
+        # Run processing workflow
+        workflow = ProcessingWorkflow(config)
+        result = workflow.run()

-    # Set default output path
-    if args.output is None:
-        args.output = str(output_dir / f"{video_path.stem}_enhanced.txt")
+        # Print final summary
+        print("\n" + "=" * 80)
+        print("✓ SUCCESS!")
+        print("=" * 80)
+        print(f"Output directory: {result['output_dir']}")
+        if result.get('enhanced_transcript'):
+            print(f"Enhanced transcript ready for AI summarization!")
+        print("=" * 80)

-    # Define cache paths
-    whisper_cache = output_dir / f"{video_path.stem}.json"
-    analysis_cache = output_dir / f"{video_path.stem}_{'vision' if args.use_vision else 'ocr'}.json"
-    frames_cache_dir = Path(args.frames_dir)
+        return 0

-    # Check for cached Whisper transcript
-    if args.run_whisper:
-        if not args.no_cache and whisper_cache.exists():
-            logger.info(f"✓ Found cached Whisper transcript: {whisper_cache}")
-            args.transcript = str(whisper_cache)
-        else:
-            logger.info("=" * 80)
-            logger.info("STEP 0: Running Whisper Transcription")
-            logger.info("=" * 80)
-            transcript_path = run_whisper(video_path, args.whisper_model, str(output_dir))
-            args.transcript = str(transcript_path)
-            logger.info("")
-
-    logger.info("=" * 80)
-    logger.info("MEETING PROCESSOR")
-    logger.info("=" * 80)
-    logger.info(f"Video: {video_path.name}")
-    logger.info(f"Analysis: {'Vision Model' if args.use_vision else f'OCR ({args.ocr_engine})'}")
-    if args.use_vision:
-        logger.info(f"Vision Model: {args.vision_model}")
-        logger.info(f"Context: {args.vision_context}")
-    logger.info(f"Frame extraction: {'Scene detection' if args.scene_detection else f'Every {args.interval}s'}")
-    if args.transcript:
-        logger.info(f"Transcript: {args.transcript}")
-    logger.info(f"Caching: {'Disabled' if args.no_cache else 'Enabled'}")
-    logger.info("=" * 80)
-
-    # Step 1: Extract frames (with caching)
-    logger.info("Step 1: Extracting frames from video...")
-
-    # Check if frames already exist
-    existing_frames = list(frames_cache_dir.glob(f"{video_path.stem}_*.jpg")) if frames_cache_dir.exists() else []
-
-    if not args.no_cache and existing_frames and len(existing_frames) > 0:
-        logger.info(f"✓ Found {len(existing_frames)} cached frames in {args.frames_dir}/")
-        # Build frames_info from existing files
-        frames_info = []
-        for frame_path in sorted(existing_frames):
-            # Try to extract timestamp from filename (e.g., video_00001_12.34s.jpg)
-            try:
-                timestamp_str = frame_path.stem.split('_')[-1].rstrip('s')
-                timestamp = float(timestamp_str)
-            except:
-                timestamp = 0.0
-            frames_info.append((str(frame_path), timestamp))
-    else:
-        extractor = FrameExtractor(str(video_path), args.frames_dir)
-
-        if args.scene_detection:
-            frames_info = extractor.extract_scene_changes()
-        else:
-            frames_info = extractor.extract_by_interval(args.interval)
-
-        if not frames_info:
-            logger.error("No frames extracted")
-            sys.exit(1)
-
-        logger.info(f"✓ Extracted {len(frames_info)} frames")
-
-    # Step 2: Run analysis on frames (with caching)
-    if not args.no_cache and analysis_cache.exists():
-        logger.info(f"✓ Found cached analysis results: {analysis_cache}")
-        with open(analysis_cache, 'r', encoding='utf-8') as f:
-            screen_segments = json.load(f)
-        logger.info(f"✓ Loaded {len(screen_segments)} analyzed frames from cache")
-    else:
-        if args.use_vision:
-            # Use vision model
-            logger.info("Step 2: Running vision analysis on extracted frames...")
-            try:
-                vision = VisionProcessor(model=args.vision_model)
-                screen_segments = vision.process_frames(
-                    frames_info,
-                    context=args.vision_context,
-                    deduplicate=not args.no_deduplicate
-                )
-                logger.info(f"✓ Analyzed {len(screen_segments)} frames with vision model")
-
-            except ImportError as e:
-                logger.error(f"{e}")
-                sys.exit(1)
-        else:
-            # Use OCR
-            logger.info("Step 2: Running OCR on extracted frames...")
-            try:
-                ocr = OCRProcessor(engine=args.ocr_engine)
-                screen_segments = ocr.process_frames(
-                    frames_info,
-                    deduplicate=not args.no_deduplicate
-                )
-                logger.info(f"✓ Processed {len(screen_segments)} frames with OCR")
-
-            except ImportError as e:
-                logger.error(f"{e}")
-                logger.error(f"To install {args.ocr_engine}:")
-                logger.error(f"  pip install {args.ocr_engine}")
-                sys.exit(1)
-
-        # Save analysis results as JSON
-        with open(analysis_cache, 'w', encoding='utf-8') as f:
-            json.dump(screen_segments, f, indent=2, ensure_ascii=False)
-        logger.info(f"✓ Saved analysis results to: {analysis_cache}")
-
-    if args.extract_only:
-        logger.info("Done! (extract-only mode)")
-        return
-
-    # Step 3: Merge with transcript (if provided)
-    merger = TranscriptMerger()
-
-    if args.transcript:
-        logger.info("Step 3: Merging with Whisper transcript...")
-        transcript_path = Path(args.transcript)
-
-        if not transcript_path.exists():
-            logger.warning(f"Transcript not found: {args.transcript}")
-            logger.info("Proceeding with screen content only...")
-            audio_segments = []
-        else:
-            audio_segments = merger.load_whisper_transcript(str(transcript_path))
-            logger.info(f"✓ Loaded {len(audio_segments)} audio segments")
-    else:
-        logger.info("No transcript provided, using screen content only...")
-        audio_segments = []
-
-    # Merge and format
-    merged = merger.merge_transcripts(audio_segments, screen_segments)
-    formatted = merger.format_for_claude(merged, format_style=args.format)
-
-    # Save output
-    merger.save_transcript(formatted, args.output)
-
-    logger.info("=" * 80)
-    logger.info("✓ PROCESSING COMPLETE!")
-    logger.info("=" * 80)
-    logger.info(f"Enhanced transcript: {args.output}")
-    logger.info(f"OCR data: {ocr_output}")
-    logger.info(f"Frames: {args.frames_dir}/")
-    logger.info("")
-    logger.info("You can now use the enhanced transcript with Claude for summarization!")
+    except FileNotFoundError as e:
+        logging.error(f"File not found: {e}")
+        return 1
+    except RuntimeError as e:
+        logging.error(f"Processing failed: {e}")
+        return 1
+    except KeyboardInterrupt:
+        logging.warning("\nProcessing interrupted by user")
+        return 130
+    except Exception as e:
+        logging.exception(f"Unexpected error: {e}")
+        return 1


 if __name__ == '__main__':
-    main()
+    sys.exit(main())
--- a/requirements.txt
+++ b/requirements.txt
@@ -1,6 +1,7 @@
 # Core dependencies
 opencv-python>=4.8.0
 Pillow>=10.0.0
+ffmpeg-python>=0.2.0

 # Vision analysis (recommended for better results)
 # Requires Ollama to be installed: https://ollama.ai/download
Author	SHA1	Message	Date
Mariano Gabriel	eb8b1f4f11	updated readme	2025-12-04 20:24:52 -03:00
Mariano Gabriel	331cccb15f	updated readme	2025-12-04 20:15:16 -03:00
Mariano Gabriel	7d7ec15ff7	add whisperx support	2025-12-03 06:48:45 -03:00
Mariano Gabriel	7b919beda6	add whisperx support	2025-12-02 02:33:39 -03:00
Mariano Gabriel	118ef04223	embed images	2025-10-28 08:02:45 -03:00
Mariano Gabriel	b1e1daf278	scene detection quality and caching	2025-10-28 05:52:31 -03:00
Mariano Gabriel	c871af2def	group text	2025-10-23 14:49:14 -03:00
Mariano Gabriel	cdf7ad1199	update prompts	2025-10-20 17:36:31 -03:00
Mariano Gabriel	b9c3cbfbab	take turns using the GPU	2025-10-20 01:12:13 -03:00
Mariano Gabriel	cd7b0aed07	refactor	2025-10-20 00:03:41 -03:00