updated readme

add whisperx support
2025-12-04 20:24:52 -03:00 · 2025-12-04 20:15:16 -03:00 · 2025-12-03 06:48:45 -03:00 · 2025-12-02 02:33:39 -03:00 · 2025-10-28 08:02:45 -03:00 · 2025-10-28 05:52:31 -03:00
21 changed files with 2331 additions and 606 deletions
--- a/.gitignore
+++ b/.gitignore
@@ -2,10 +2,11 @@
 samples/*
 !samples/.gitkeep
-# Output files
+# Output directories (timestamped folders for each video)
 output/*
 !output/.gitkeep
-# Extracted frames
+# Python cache
-frames/
+__pycache__
-__pycache__
+*.pyc
 .pytest_cache/
--- a/README.md
+++ b/README.md
@@ -1,34 +1,21 @@
 # Meeting Processor
-Extract screen content from meeting recordings and merge with Whisper transcripts for better AI summarization.
+Extract screen content from meeting recordings and merge with Whisper/WhisperX transcripts for better AI summarization.
 ## Overview
 This tool enhances meeting transcripts by combining:
- **Audio transcription** (from Whisper)
+- **Audio transcription** (Whisper or WhisperX with speaker diarization)
- **Screen content analysis** (Vision models or OCR)
+- **Screen content extraction** via FFmpeg scene detection
 - **Frame embedding** for direct LLM analysis
-### Vision Analysis vs OCR
+The result is a rich, timestamped transcript with embedded screen frames that provides full context for AI summarization.
 - **Vision Models** (recommended): Uses local LLaVA model via Ollama to understand context - great for dashboards, code, consoles
 - **OCR**: Traditional text extraction - faster but less context-aware
 The result is a rich, timestamped transcript that provides full context for AI summarization.
 ## Installation
 ### 1. System Dependencies
-**Ollama** (required for vision analysis):
+**FFmpeg** (required for scene detection and frame extraction):
 ```bash
 # Install from https://ollama.ai/download
 # Then pull a vision model:
 ollama pull llava:13b
 # or for lighter model:
 ollama pull llava:7b
 ```
 **FFmpeg** (for scene detection):
 ```bash
 # Ubuntu/Debian
 sudo apt-get install ffmpeg
@@ -37,210 +24,152 @@ sudo apt-get install ffmpeg
 brew install ffmpeg
 ```
 **Tesseract OCR** (optional, if not using vision):
 ```bash
 # Ubuntu/Debian
 sudo apt-get install tesseract-ocr
 # macOS
 brew install tesseract
 # Arch Linux
 sudo pacman -S tesseract
 ```
 ### 2. Python Dependencies
 ```bash
 pip install -r requirements.txt
 ```
-### 3. Whisper (for audio transcription)
+### 3. Whisper or WhisperX (for audio transcription)
 **Standard Whisper:**
 ```bash
 pip install openai-whisper
 ```
-### 4. Optional: Install Alternative OCR Engines
+**WhisperX** (recommended - includes speaker diarization):
 If you prefer OCR over vision analysis:
 ```bash
-# EasyOCR (better for rotated/handwritten text)
+pip install whisperx
 pip install easyocr
 # PaddleOCR (better for code/terminal screens)
 pip install paddleocr
 ```
 For speaker diarization, you'll need a HuggingFace token with access to pyannote models.
 ## Quick Start
-### Recommended: Vision Analysis (Best for Code/Dashboards)
+### Recommended Usage
 ```bash
-python process_meeting.py samples/meeting.mkv --run-whisper --use-vision
+python process_meeting.py samples/meeting.mkv --embed-images --scene-detection --scene-threshold 10 --diarize
 ```
 This will:
-1. Run Whisper transcription (audio → text)
+1. Run WhisperX transcription with speaker diarization
-2. Extract frames every 5 seconds
+2. Extract frames at scene changes (threshold 10 = moderately sensitive)
-3. Use LLaVA vision model to analyze frames with context
+3. Create an enhanced transcript with frame file references
-4. Merge audio + screen content
+4. Save everything to `output/` folder
-5. Save everything to `output/` folder
+
 The `--embed-images` flag adds frame paths to the transcript (e.g., `Frame: frames/video_00257.jpg`), keeping the transcript small while frames stay in `frames/` folder for LLM access.
 ### Re-run with Cached Results
 Already ran it once? Re-run instantly using cached results:
 ```bash
-# Uses cached transcript, frames, and analysis
+# Uses cached transcript and frames
-python process_meeting.py samples/meeting.mkv --use-vision
+python process_meeting.py samples/meeting.mkv --embed-images
-# Force reprocessing
+# Skip only specific cached items
-python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --no-cache
+python process_meeting.py samples/meeting.mkv --embed-images --skip-cache-frames
-```
+python process_meeting.py samples/meeting.mkv --embed-images --skip-cache-whisper
-### Traditional OCR (Faster, Less Context-Aware)
+# Force complete reprocessing
-
+python process_meeting.py samples/meeting.mkv --embed-images --scene-detection --diarize --no-cache
 ```bash
 python process_meeting.py samples/meeting.mkv --run-whisper
 ```
 ## Usage Examples
-### Vision Analysis with Context Hints
+### Scene Detection Options
 ```bash
-# For code-heavy meetings
+# Default threshold (15)
-python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --vision-context code
+python process_meeting.py samples/meeting.mkv --embed-images --scene-detection --diarize
-# For dashboard/monitoring meetings (Grafana, GCP, etc.)
+# More sensitive (more frames, threshold: 5)
-python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --vision-context dashboard
+python process_meeting.py samples/meeting.mkv --embed-images --scene-detection --scene-threshold 5 --diarize
-# For console/terminal sessions
+# Less sensitive (fewer frames, threshold: 30)
-python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --vision-context console
+python process_meeting.py samples/meeting.mkv --embed-images --scene-detection --scene-threshold 30 --diarize
 ```
-### Different Vision Models
+### Fixed Interval Extraction (alternative to scene detection)
 ```bash
 # Lighter/faster model (7B parameters)
 python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --vision-model llava:7b
 # Default model (13B parameters, better quality)
 python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --vision-model llava:13b
 # Alternative models
 python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --vision-model bakllava
 ```
 ### Extract frames at different intervals
 ```bash
 # Every 10 seconds
-python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --interval 10
+python process_meeting.py samples/meeting.mkv --embed-images --interval 10 --diarize
 # Every 3 seconds (more detailed)
-python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --interval 3
+python process_meeting.py samples/meeting.mkv --embed-images --interval 3 --diarize
 ```
 ### Use scene detection (smarter, fewer frames)
 ```bash
 python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --scene-detection
 ```
 ### Traditional OCR (if you prefer)
 ```bash
 # Tesseract (default)
 python process_meeting.py samples/meeting.mkv --run-whisper
 # EasyOCR
 python process_meeting.py samples/meeting.mkv --run-whisper --ocr-engine easyocr
 # PaddleOCR
 python process_meeting.py samples/meeting.mkv --run-whisper --ocr-engine paddleocr
 ```
 ### Caching Examples
 ```bash
 # First run - processes everything
-python process_meeting.py samples/meeting.mkv --run-whisper --use-vision
+python process_meeting.py samples/meeting.mkv --embed-images --scene-detection --scene-threshold 10 --diarize
-# Second run - uses cached transcript and frames, only re-merges
+# Iterate on scene threshold (reuse whisper transcript)
-python process_meeting.py samples/meeting.mkv
+python process_meeting.py samples/meeting.mkv --embed-images --scene-detection --scene-threshold 5 --skip-cache-frames --skip-cache-analysis
-# Switch from OCR to vision using existing frames
+# Re-run whisper only
-python process_meeting.py samples/meeting.mkv --use-vision
+python process_meeting.py samples/meeting.mkv --embed-images --skip-cache-whisper
 # Force complete reprocessing
-python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --no-cache
+python process_meeting.py samples/meeting.mkv --embed-images --scene-detection --diarize --no-cache
 ```
 ### Custom output location
 ```bash
-python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --output-dir my_outputs/
+python process_meeting.py samples/meeting.mkv --embed-images --scene-detection --diarize --output-dir my_outputs/
 ```
 ### Enable verbose logging
 ```bash
-# Show detailed debug information
+python process_meeting.py samples/meeting.mkv --embed-images --scene-detection --diarize --verbose
 python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --verbose
 ```
 ## Output Files
-All output files are saved to the `output/` directory by default:
+Each video gets its own timestamped output directory:
- **`output/<video>_enhanced.txt`** - Enhanced transcript ready for AI summarization
+```
- **`output/<video>.json`** - Whisper transcript (if `--run-whisper` was used)
+output/
- **`output/<video>_vision.json`** - Vision analysis results with timestamps (if `--use-vision`)
+└── 20241019_143022-meeting/
- **`output/<video>_ocr.json`** - OCR results with timestamps (if using OCR)
+    ├── manifest.json                    # Processing configuration
- **`frames/`** - Extracted video frames (JPG files)
+    ├── meeting_enhanced.txt             # Enhanced transcript for AI
    ├── meeting.json                     # Whisper/WhisperX transcript
    └── frames/                          # Extracted video frames
        ├── frame_00001_5.00s.jpg
        ├── frame_00002_10.00s.jpg
        └── ...
 ```
 ### Caching Behavior
-The tool automatically caches intermediate results to speed up re-runs:
+The tool automatically reuses the most recent output directory for the same video:
- **Whisper transcript**: Cached as `output/<video>.json`
+- **First run**: Creates new timestamped directory (e.g., `20241019_143022-meeting/`)
- **Extracted frames**: Cached in `frames/<video>_*.jpg`
+- **Subsequent runs**: Reuses the same directory and cached results
- **Analysis results**: Cached as `output/<video>_vision.json` or `output/<video>_ocr.json`
+- **Cached items**: Whisper transcript, extracted frames, analysis results
-Re-running with the same video will use cached results unless `--no-cache` is specified.
+**Fine-grained cache control:**
 - `--no-cache`: Force complete reprocessing
 - `--skip-cache-frames`: Re-extract frames only
 - `--skip-cache-whisper`: Re-run transcription only
 - `--skip-cache-analysis`: Re-run analysis only
 This allows you to iterate on scene detection thresholds without re-running Whisper!
 ## Workflow for Meeting Analysis
 ### Complete Workflow (One Command!)
 ```bash
-# Process everything in one step with vision analysis
+python process_meeting.py samples/meeting.mkv --embed-images --scene-detection --scene-threshold 10 --diarize
 python process_meeting.py samples/alo-intro1.mkv --run-whisper --use-vision --scene-detection
 # Output will be in output/alo-intro1_enhanced.txt
 ```
 ### Typical Iterative Workflow
 ```bash
 # First run - full processing
-python process_meeting.py samples/meeting.mkv --run-whisper --use-vision
+python process_meeting.py samples/meeting.mkv --embed-images --scene-detection --scene-threshold 10 --diarize
-# Review results, then re-run with different context if needed
+# Adjust scene threshold (keeps cached whisper transcript)
-python process_meeting.py samples/meeting.mkv --use-vision --vision-context code
+python process_meeting.py samples/meeting.mkv --embed-images --scene-detection --scene-threshold 5 --skip-cache-frames --skip-cache-analysis
 # Or switch to a different vision model
 python process_meeting.py samples/meeting.mkv --use-vision --vision-model llava:7b
 # All use cached frames and transcript!
 ```
 ### Traditional Workflow (Separate Steps)
 ```bash
 # 1. Extract audio and transcribe with Whisper (optional, if not using --run-whisper)
 whisper samples/alo-intro1.mkv --model base --output_format json --output_dir output
 # 2. Process video to extract screen content with vision
 python process_meeting.py samples/alo-intro1.mkv \
    --transcript output/alo-intro1.json \
    --use-vision \
    --scene-detection
 # 3. Use the enhanced transcript with AI
 # Copy the content from output/alo-intro1_enhanced.txt and paste into Claude or your LLM
 ```
 ### Example Prompt for Claude
@@ -260,64 +189,54 @@ Please summarize this meeting transcript. Pay special attention to:
 ```
 usage: process_meeting.py [-h] [--transcript TRANSCRIPT] [--run-whisper]
                          [--whisper-model {tiny,base,small,medium,large}]
-                          [--output OUTPUT] [--output-dir OUTPUT_DIR]
+                          [--diarize] [--output OUTPUT] [--output-dir OUTPUT_DIR]
-                          [--frames-dir FRAMES_DIR] [--interval INTERVAL]
+                          [--interval INTERVAL] [--scene-detection]
-                          [--scene-detection]
+                          [--scene-threshold SCENE_THRESHOLD]
-                          [--ocr-engine {tesseract,easyocr,paddleocr}]
+                          [--embed-images] [--embed-quality EMBED_QUALITY]
-                          [--no-deduplicate] [--extract-only]
+                          [--no-cache] [--skip-cache-frames] [--skip-cache-whisper]
-                          [--format {detailed,compact}] [--verbose]
+                          [--skip-cache-analysis] [--no-deduplicate]
-                          video
+                          [--extract-only] [--format {detailed,compact}]
                          [--verbose] video
-Options:
+Main Options:
-  video                 Path to video file
+  video                   Path to video file
-  --transcript, -t      Path to Whisper transcript (JSON or TXT)
+  --diarize               Use WhisperX with speaker diarization
-  --run-whisper         Run Whisper transcription before processing
+  --embed-images          Add frame file references to transcript (recommended)
-  --whisper-model       Whisper model: tiny, base, small, medium, large (default: base)
+
-  --output, -o          Output file for enhanced transcript
+Frame Extraction:
-  --output-dir          Directory for output files (default: output/)
+  --scene-detection       Use FFmpeg scene detection (recommended)
-  --frames-dir          Directory to save extracted frames (default: frames/)
+  --scene-threshold       Detection sensitivity 0-100 (default: 15, lower=more sensitive)
-  --interval            Extract frame every N seconds (default: 5)
+  --interval              Extract frame every N seconds (alternative to scene detection)
-  --scene-detection     Use scene detection instead of interval extraction
+
-  --ocr-engine          OCR engine: tesseract, easyocr, paddleocr (default: tesseract)
+Caching:
-  --no-deduplicate      Disable text deduplication
+  --no-cache              Force complete reprocessing
-  --extract-only        Only extract frames and OCR, skip transcript merging
+  --skip-cache-frames     Re-extract frames only
-  --format              Output format: detailed or compact (default: detailed)
+  --skip-cache-whisper    Re-run transcription only
-  --verbose, -v         Enable verbose logging (DEBUG level)
+  --skip-cache-analysis   Re-run analysis only
 Other:
  --run-whisper           Run Whisper (without diarization)
  --whisper-model         Whisper model: tiny, base, small, medium, large (default: medium)
  --transcript, -t        Path to existing Whisper transcript (JSON or TXT)
  --output, -o            Output file for enhanced transcript
  --output-dir            Directory for output files (default: output/)
  --verbose, -v           Enable verbose logging
 ```
 ## Tips for Best Results
 ### Vision vs OCR: When to Use Each
 **Use Vision Models (`--use-vision`) when:**
 - ✅ Analyzing dashboards (Grafana, GCP Console, monitoring tools)
 - ✅ Code walkthroughs or debugging sessions
 - ✅ Complex layouts with mixed content
 - ✅ Need contextual understanding, not just text extraction
 - ✅ Working with charts, graphs, or visualizations
 - ⚠️ Trade-off: Slower (requires GPU/CPU for local model)
 **Use OCR when:**
 - ✅ Simple text extraction from slides or documents
 - ✅ Need maximum speed
 - ✅ Limited computational resources
 - ✅ Presentations with mostly text
 - ⚠️ Trade-off: Less context-aware, may miss visual relationships
 ### Context Hints for Vision Analysis
 - **`--vision-context meeting`**: General purpose (default)
 - **`--vision-context code`**: Optimized for code screenshots, preserves formatting
 - **`--vision-context dashboard`**: Extracts metrics, trends, panel names
 - **`--vision-context console`**: Captures commands, output, error messages
 ### Scene Detection vs Interval
- **Scene detection**: Better for presentations with distinct slides. More efficient.
+- **Scene detection** (`--scene-detection`): Recommended. Captures frames when content changes. More efficient.
- **Interval extraction**: Better for continuous screen sharing (coding, browsing). More thorough.
+- **Interval extraction** (`--interval N`): Alternative for continuous content. Captures every N seconds.
-### Vision Model Selection
+### Scene Detection Threshold
- **`llava:7b`**: Faster, lower memory (~4GB RAM), good quality
+- Lower values (5-10): More sensitive, captures more frames
- **`llava:13b`**: Better quality, slower, needs ~8GB RAM (default)
+- Default (15): Good balance for most meetings
- **`bakllava`**: Alternative with different strengths
+- Higher values (20-30): Less sensitive, fewer frames
 ### Whisper vs WhisperX
 - **Whisper** (`--run-whisper`): Standard transcription, fast
 - **WhisperX** (`--run-whisper --diarize`): Adds speaker identification, requires HuggingFace token
 ### Deduplication
 - Enabled by default - removes similar consecutive frames
@@ -325,73 +244,75 @@ Options:
 ## Troubleshooting
-### Vision Model Issues
+### Frame Extraction Issues
 **"ollama package not installed"**
 ```bash
 pip install ollama
 ```
 **"Ollama not found" or connection errors**
 ```bash
 # Install Ollama first: https://ollama.ai/download
 # Then pull a vision model:
 ollama pull llava:13b
 ```
 **Vision analysis is slow**
 - Use lighter model: `--vision-model llava:7b`
 - Reduce frame count: `--scene-detection` or `--interval 10`
 - Check if Ollama is using GPU (much faster)
 **Poor vision analysis results**
 - Try different context hint: `--vision-context code` or `--vision-context dashboard`
 - Use larger model: `--vision-model llava:13b`
 - Ensure frames are clear (check video resolution)
 ### OCR Issues
 **"pytesseract not installed"**
 ```bash
 pip install pytesseract
 sudo apt-get install tesseract-ocr  # Don't forget system package!
 ```
 **Poor OCR quality**
 - **Solution**: Switch to vision analysis with `--use-vision`
 - Or try different OCR engine: `--ocr-engine easyocr`
 - Check if video resolution is sufficient
 - Use `--no-deduplicate` to keep more frames
 ### General Issues
 **"No frames extracted"**
 - Check video file is valid: `ffmpeg -i video.mkv`
- Try lower interval: `--interval 3`
+- Try lower scene threshold: `--scene-threshold 5`
- Check disk space in frames directory
+- Try interval extraction: `--interval 3`
 - Check disk space in output directory
 **Scene detection not working**
 - Fallback to interval extraction automatically
 - Ensure FFmpeg is installed
 - Falls back to interval extraction automatically
 - Try manual interval: `--interval 5`
 ### Whisper/WhisperX Issues
 **WhisperX diarization not working**
 - Ensure you have a HuggingFace token set
 - Token needs access to pyannote models
 - Fall back to standard Whisper without `--diarize`
 ### Cache Issues
 **Cache not being used**
 - Ensure you're using the same video filename
 - Check that output directory contains cached files
 - Use `--verbose` to see what's being cached/loaded
 **Want to re-run specific steps**
 - `--skip-cache-frames`: Re-extract frames
 - `--skip-cache-whisper`: Re-run transcription
 - `--skip-cache-analysis`: Re-run analysis
 - `--no-cache`: Force complete reprocessing
 ## Experimental Features
 ### OCR and Vision Analysis
 OCR (`--ocr-engine`) and Vision analysis (`--use-vision`) options are available but experimental. The recommended approach is to use `--embed-images` which embeds frame references directly in the transcript, letting your LLM analyze the images.
 ```bash
 # Experimental: OCR extraction
 python process_meeting.py samples/meeting.mkv --run-whisper --ocr-engine tesseract
 # Experimental: Vision model analysis
 python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --vision-model llava:13b
 # Experimental: Hybrid OpenCV + OCR
 python process_meeting.py samples/meeting.mkv --run-whisper --use-hybrid
 ```
 ## Project Structure
 ```
 meetus/
-├── meetus/                  # Main package
+├── meetus/                     # Main package
 │   ├── __init__.py
-│   ├── frame_extractor.py   # Video frame extraction
+│   ├── workflow.py             # Processing orchestrator
-│   ├── ocr_processor.py     # OCR processing
+│   ├── output_manager.py       # Output directory & manifest management
-│   └── transcript_merger.py # Transcript merging
+│   ├── cache_manager.py        # Caching logic
-├── process_meeting.py       # Main CLI script
+│   ├── frame_extractor.py      # Video frame extraction (FFmpeg scene detection)
-├── requirements.txt         # Python dependencies
+│   ├── vision_processor.py     # Vision model analysis (experimental)
-└── README.md               # This file
+│   ├── ocr_processor.py        # OCR processing (experimental)
 │   └── transcript_merger.py    # Transcript merging
 ├── process_meeting.py          # Main CLI script
 ├── requirements.txt            # Python dependencies
 ├── output/                     # Timestamped output directories
 │   └── YYYYMMDD_HHMMSS-video/  # Auto-generated per video
 ├── samples/                    # Sample videos (gitignored)
 └── README.md                   # This file
 ```
 ## License
--- a/def/01-scene-detection-quality-caching.md
+++ b/def/01-scene-detection-quality-caching.md
@@ -0,0 +1,80 @@
 # 01 - Scene Detection Sensitivity, Image Quality, and Granular Caching
 ## Date
 2025-10-28
 ## Context
 Last run on zaca-run-scrapers sample (Zed editor walkthrough) only detected 19 frames with 7+ minute gaps. Whisper wasn't running (flag not passed). JPEG compression quality was poor for code/text readability.
 ## Problems Identified
 1. **Scene detection too conservative** - Default threshold of 30.0 missed file switches and scrolling in clean UI (Zed vs VS Code)
 2. **No whisper transcription** - User expected it to run but `--run-whisper` is opt-in
 3. **Poor JPEG quality** - Default compression made code/text hard to read for OCR/vision
 4. **Subprocess-based FFmpeg** - Using shell commands instead of Python library
 5. **All-or-nothing caching** - `--no-cache` regenerates everything including slow whisper transcription
 ## Changes Made
 ### 1. Scene Detection Sensitivity
 **Files:** `meetus/frame_extractor.py`, `process_meeting.py`, `meetus/workflow.py`
 - Lowered default threshold: `30.0` → `15.0` (more sensitive for clean UIs)
 - Added `--scene-threshold` CLI argument (0-100, lower = more sensitive)
 - Added threshold to manifest for tracking
 - Updated docstring with usage guidelines:
  - 15.0: Good for clean UIs like Zed
  - 20-30: Busy UIs like VS Code
  - 5-10: Very subtle changes
 ### 2. JPEG Quality Improvements
 **Files:** `meetus/frame_extractor.py`
 - **Interval extraction**: Added `cv2.IMWRITE_JPEG_QUALITY, 95` (line 60)
 - **Scene detection**: Added `-q:v 2` to FFmpeg (best quality, line 94)
 ### 3. Migration to ffmpeg-python
 **Files:** `meetus/frame_extractor.py`, `requirements.txt`
 - Replaced `subprocess.run()` with `ffmpeg-python` library
 - Cleaner, more Pythonic API
 - Better error handling with `ffmpeg.Error`
 - Added to requirements.txt
 ### 4. Granular Cache Control
 **Files:** `process_meeting.py`, `meetus/workflow.py`, `meetus/cache_manager.py`
 Added three new flags for selective cache invalidation:
 - `--skip-cache-frames`: Regenerate frames (useful when tuning scene threshold)
 - `--skip-cache-whisper`: Rerun whisper transcription
 - `--skip-cache-analysis`: Rerun OCR/vision analysis
 **Key design:**
 - `--no-cache`: Still works as before (new directory + regenerate everything)
 - New flags: Reuse existing output directory but selectively invalidate caches
 - Frames are cleaned up when regenerating to avoid stale data
 ## Typical Workflow
 ```bash
 # First run - generate everything including whisper (expensive, once)
 python process_meeting.py samples/video.mkv --run-whisper --scene-detection --use-vision
 # Iterate on scene threshold without re-running whisper
 python process_meeting.py samples/video.mkv --scene-detection --scene-threshold 10 --use-vision --skip-cache-frames --skip-cache-analysis
 # Try even more sensitive
 python process_meeting.py samples/video.mkv --scene-detection --scene-threshold 5 --use-vision --skip-cache-frames --skip-cache-analysis
 ```
 ## Notes
 - Whisper is the most expensive and reliable step → always cache it during iteration
 - Scene detection needs tuning per UI style (Zed vs VS Code)
 - Vision analysis should regenerate when frames change
 - Walking through code (file switches, scrolling) should trigger scene changes
 ## Files Modified
 - `meetus/frame_extractor.py` - Scene threshold, quality, ffmpeg-python
 - `meetus/workflow.py` - Cache flags, frame cleanup
 - `meetus/cache_manager.py` - Granular cache checks
 - `process_meeting.py` - CLI arguments
 - `requirements.txt` - Added ffmpeg-python
--- a/def/02-hybrid-opencv-ocr-llm.md
+++ b/def/02-hybrid-opencv-ocr-llm.md
@@ -0,0 +1,111 @@
 # 02 - Hybrid OpenCV + OCR + LLM Approach
 ## Date
 2025-10-28
 ## Context
 Vision models (llava) were hallucinating text content badly - showing HTML code when there was none, inventing text that didn't exist. Pure OCR was fast and accurate but lost code formatting and structure.
 ## Problem
 - **Vision models**: Hallucinate text content, can't be trusted for accurate extraction
 - **Pure OCR**: Accurate text but messy output, lost indentation/formatting
 - **Need**: Accurate text extraction + preserved code structure
 ## Solution: Three-Stage Hybrid Approach
 ### Stage 1: OpenCV Text Detection
 Use morphological operations to find text regions:
 - Adaptive thresholding (handles varying lighting)
 - Dilation with horizontal kernel to connect text lines
 - Contour detection to find bounding boxes
 - Filter by area and aspect ratio
 - Merge overlapping regions
 ### Stage 2: Region-Based OCR
 - Sort regions by reading order (top-to-bottom, left-to-right)
 - Crop each region from original image
 - Run OCR on cropped regions (more accurate than full frame)
 - Tesseract with PSM 6 mode to preserve layout
 - Preserve indentation in cleaning step
 ### Stage 3: Optional LLM Cleanup
 - Take accurate OCR output (no hallucination)
 - Use lightweight LLM (llama3.2:3b for speed) to:
  - Fix obvious OCR errors (l→1, O→0)
  - Restore code indentation and structure
  - Preserve exact text content
  - No added explanations or hallucinated content
 ## Benefits
 ✓ **Accurate**: OCR reads actual pixels, no hallucination
 ✓ **Fast**: OpenCV detection is instant, focused OCR is quick
 ✓ **Structured**: Regions separated with headers showing position
 ✓ **Formatted**: Optional LLM cleanup preserves/restores code structure
 ✓ **Deterministic**: Same input = same output (unlike vision models)
 ## Implementation
 **New file:** `meetus/hybrid_processor.py`
 - `HybridProcessor` class with OpenCV detection + OCR + optional LLM
 - Region sorting for proper reading order
 - Visual separators between regions
 **CLI flags:**
 ```bash
 --use-hybrid                 # Enable hybrid mode
 --hybrid-llm-cleanup        # Add LLM post-processing (optional)
 --hybrid-llm-model MODEL    # LLM model (default: llama3.2:3b)
 ```
 **OCR improvements:**
 - Tesseract PSM 6 mode for better layout preservation
 - Modified text cleaning to keep indentation
 - `preserve_layout` parameter
 ## Usage
 ```bash
 # Basic hybrid (OpenCV + OCR)
 python process_meeting.py samples/video.mkv --use-hybrid --scene-detection
 # With LLM cleanup for best code formatting
 python process_meeting.py samples/video.mkv --use-hybrid --hybrid-llm-cleanup --scene-detection -v
 # Iterate on threshold
 python process_meeting.py samples/video.mkv --use-hybrid --scene-detection --scene-threshold 5 --skip-cache-frames --skip-cache-analysis
 ```
 ## Output Format
 ```
 [Region 1 at y=120]
 function calculateTotal(items) {
  return items.reduce((sum, item) => sum + item.price, 0);
 }
 ============================================================
 [Region 2 at y=450]
 const result = calculateTotal(cartItems);
 console.log('Total:', result);
 ```
 ## Performance
 - **Without LLM cleanup**: Very fast (~2-3s per frame)
 - **With LLM cleanup**: Slower but still faster than vision models (~5-8s per frame)
 - **Accuracy**: Much better than vision model hallucinations
 ## When to Use What
 | Method | Best For | Pros | Cons |
 |--------|----------|------|------|
 | **Hybrid** | Code/terminal text extraction | Accurate, fast, no hallucination | Formatting may be messy |
 | **Hybrid + LLM** | Code with preserved structure | Accurate + formatted | Slower, needs Ollama |
 | **Vision** | Understanding layout/context | Semantic understanding | Hallucinates text |
 | **Pure OCR** | Simple text, no structure needed | Fast, simple | Full-frame, no region detection |
 ## Files Modified
 - `meetus/hybrid_processor.py` - New hybrid processor
 - `meetus/ocr_processor.py` - Layout preservation
 - `meetus/workflow.py` - Hybrid mode integration
 - `process_meeting.py` - CLI flags and examples
--- a/def/03-embed-images-for-llm.md
+++ b/def/03-embed-images-for-llm.md
@@ -0,0 +1,100 @@
 # 03 - Embed Images for LLM Analysis
 ## Date
 2025-10-28
 ## Context
 Hybrid OCR approach was fast and accurate but formatting was messy. Vision models hallucinated text. Rather than fighting with text extraction, a better approach is to embed the actual frame images in the enhanced transcript and let the end-user's LLM analyze them with full audio context.
 ## Problem
 - OCR/vision models either hallucinate or produce messy text
 - Code formatting/indentation is hard to preserve
 - User wants to analyze frames with their own LLM (Claude, GPT, etc.)
 - Need to keep file size reasonable (~200KB per image is too big)
 ## Solution: Image Embedding
 Instead of extracting text, embed the actual frame images as base64 in the enhanced transcript. The LLM can then:
 - See the actual screen content (no hallucination)
 - Understand code structure, layout, and formatting visually
 - Have full audio transcript context for each frame
 - Analyze dashboards, terminals, editors with perfect accuracy
 ## Implementation
 **Quality Optimization:**
 - Default JPEG quality: 80 (good tradeoff between size and readability)
 - Configurable via `--embed-quality` (0-100)
 - Typical sizes at quality 80: ~40-80KB per image (vs 200KB original)
 **Format:**
 ```
 [MM:SS] SPEAKER:
  Audio transcript text here
 [MM:SS] SCREEN CONTENT:
  IMAGE (base64, 52KB):
  <image>data:image/jpeg;base64,/9j/4AAQSkZJRg...</image>
  TEXT:
  | Optional OCR text for reference
 ```
 **Features:**
 - Base64 encoding for easy embedding
 - Size tracking and reporting
 - Optional text content alongside images
 - Works with scene detection for smart frame selection
 ## Usage
 ```bash
 # Basic: Embed images at quality 80 (default)
 python process_meeting.py samples/video.mkv --run-whisper --embed-images --scene-detection --no-cache -v
 # Lower quality for smaller files (still readable)
 python process_meeting.py samples/video.mkv --run-whisper --embed-images --embed-quality 60 --scene-detection --no-cache -v
 # Higher quality for detailed code
 python process_meeting.py samples/video.mkv --run-whisper --embed-images --embed-quality 90 --scene-detection --no-cache -v
 # Iterate on scene threshold (reuse whisper)
 python process_meeting.py samples/video.mkv --embed-images --scene-detection --scene-threshold 5 --skip-cache-frames --skip-cache-analysis -v
 ```
 ## File Sizes
 **Example for 20 frames:**
 - Quality 60: ~30-50KB per image = 0.6-1MB total
 - Quality 80: ~40-80KB per image = 0.8-1.6MB total (recommended)
 - Quality 90: ~80-120KB per image = 1.6-2.4MB total
 - Original: ~200KB per image = 4MB total
 ## Benefits
 ✓ **No hallucination**: LLM sees actual pixels
 ✓ **Perfect formatting**: Code structure preserved visually
 ✓ **Full context**: Audio transcript + visual frame together
 ✓ **User's choice**: Use your preferred LLM (Claude, GPT, etc.)
 ✓ **Reasonable size**: Quality 80 gives 4x smaller files vs original
 ✓ **Simple workflow**: One file contains everything
 ## Use Cases
 **Code walkthroughs:** LLM can see actual code structure and indentation
 **Dashboard analysis:** Charts, graphs, metrics visible to LLM
 **Terminal sessions:** Commands and output in proper context
 **UI reviews:** Actual interface visible with audio commentary
 ## Files Modified
 - `meetus/transcript_merger.py` - Image encoding and embedding
 - `meetus/workflow.py` - Wire through config
 - `process_meeting.py` - CLI flags
 - `meetus/output_manager.py` - Cleaner directory naming (date + increment)
 ## Output Directory Naming
 Also changed output directory format for clarity:
 - Old: `20251028_054553-video` (confusing timestamps)
 - New: `20251028-001-video` (clear date + run number)
--- a/def/04-fix-whisper-cache-loading.md
+++ b/def/04-fix-whisper-cache-loading.md
@@ -0,0 +1,78 @@
 # 04 - Fix Whisper Cache Loading
 ## Date
 2025-10-28
 ## Problem
 Enhanced transcript was not including the audio segments from cached whisper transcripts when running without the `--run-whisper` flag.
 Example command that failed:
 ```bash
 python process_meeting.py samples/zaca-run-scrapers.mkv --embed-images --scene-detection --scene-threshold 10 --skip-cache-frames -v
 ```
 Result: Enhanced transcript only contained embedded images, no audio segments (0 "SPEAKER" entries).
 ## Root Cause
 In `workflow.py`, the `_run_whisper()` method was checking the `run_whisper` flag **before** checking the cache:
 ```python
 def _run_whisper(self) -> Optional[str]:
    if not self.config.run_whisper:
        return self.config.transcript_path  # Returns None if --transcript not specified
    # Cache check NEVER REACHED if run_whisper is False
    cached = self.cache_mgr.get_whisper_cache()
    if cached:
        return str(cached)
 ```
 This meant:
 - User runs command without `--run-whisper`
 - Method returns None immediately
 - Cached whisper transcript is never discovered
 - No audio segments in enhanced output
 ## Solution
 Reorder the logic to check cache **first**, regardless of flags:
 ```python
 def _run_whisper(self) -> Optional[str]:
    """Run Whisper transcription if requested, or use cached/provided transcript."""
    # First, check cache (regardless of run_whisper flag)
    cached = self.cache_mgr.get_whisper_cache()
    if cached:
        return str(cached)
    # If no cache and not running whisper, use provided transcript path (if any)
    if not self.config.run_whisper:
        return self.config.transcript_path
    # If no cache and run_whisper is True, run whisper transcription
    # ... rest of whisper code
 ```
 ## New Behavior
 1. Cache is checked first (regardless of `--run-whisper` flag)
 2. If cached whisper exists, use it
 3. If no cache and `--run-whisper` not specified, use `--transcript` path (or None)
 4. If no cache and `--run-whisper` specified, run whisper
 ## Benefits
 ✓ Cached whisper transcripts are always discovered and used
 ✓ User can iterate on frame extraction/analysis without re-running whisper
 ✓ Enhanced transcripts now properly include both audio + visual content
 ✓ Granular cache flags (`--skip-cache-frames`, `--skip-cache-whisper`) work as expected
 ## Use Case
 ```bash
 # First run: Generate whisper transcript + extract frames
 python process_meeting.py samples/video.mkv --run-whisper --embed-images --scene-detection -v
 # Second run: Iterate on scene threshold without re-running whisper
 python process_meeting.py samples/video.mkv --embed-images --scene-detection --scene-threshold 10 --skip-cache-frames -v
 # Now correctly includes cached whisper transcript in enhanced output!
 ```
 ## Files Modified
 - `meetus/workflow.py` - Reordered logic in `_run_whisper()` method (lines 172-181)
--- a/def/05-reference-frames-instead-of-embedding.md
+++ b/def/05-reference-frames-instead-of-embedding.md
@@ -0,0 +1,124 @@
 # 05 - Reference Frame Files Instead of Embedding
 ## Date
 2025-10-28
 ## Context
 Embedding base64 images made the enhanced transcript files very large (3.7MB for ~40 frames). This made them harder to work with and slower to process.
 ## Problem
 - Enhanced transcript with embedded base64 images was 3.7MB
 - Large file size makes it slow to read/process
 - Difficult to inspect individual frames
 - Harder to share and version control
 ## Solution: Reference Frame Paths
 Instead of embedding base64 image data, reference the frame files by their relative paths.
 ### Before (Embedded):
 ```
 [00:08] SCREEN CONTENT:
  IMAGE (base64, 85KB):
  <image>data:image/jpeg;base64,/9j/4AAQSkZJRg...</image>
 ```
 File size: 3.7MB
 ### After (Referenced):
 ```
 [00:08] SCREEN CONTENT:
  Frame: frames/zaca-run-scrapers_00257.jpg
 ```
 File size: ~50KB
 ## Implementation
 **Directory Structure:**
 ```
 output/20251028-003-zaca-run-scrapers/
 ├── frames/
 │   ├── zaca-run-scrapers_00257.jpg
 │   ├── zaca-run-scrapers_00487.jpg
 │   └── ...
 ├── zaca-run-scrapers.json (whisper transcript)
 └── zaca-run-scrapers_enhanced.txt (references frames/ directory)
 ```
 **Enhanced Transcript Format:**
 ```
 ================================================================================
 ENHANCED MEETING TRANSCRIPT
 Audio transcript + Screen frames
 ================================================================================
 [00:30] SPEAKER:
  Bueno, te dio un tour para el proyecto...
 [00:08] SCREEN CONTENT:
  Frame: frames/zaca-run-scrapers_00257.jpg
 [01:00] SPEAKER:
  Mayormente en Scrapping lo que tenemos...
 [01:15] SCREEN CONTENT:
  Frame: frames/zaca-run-scrapers_00487.jpg
  TEXT:
  | Code snippet from screen (if OCR was used)
 ```
 ## Benefits
 ✓ **Much smaller files**: ~50KB vs 3.7MB (74x smaller!)
 ✓ **Easier to inspect**: Can view individual frames directly
 ✓ **LLM can access images**: Frame paths allow LLM to load images on demand
 ✓ **Better version control**: Text files are small and diffable
 ✓ **Cleaner structure**: Frames organized in dedicated directory
 ✓ **Flexible**: Can still do OCR/vision analysis if needed (adds TEXT section)
 ## Flags
 **`--embed-images`**: Skip OCR/vision analysis, just reference frame files
 - Faster (no analysis needed)
 - Lets LLM analyze raw images
 - Enhanced transcript only contains frame references
 **Without `--embed-images`**: Run OCR/vision analysis
 - Extracts text from frames
 - Enhanced transcript includes both frame reference AND extracted text
 - Useful for code/dashboard analysis
 ## Usage
 ```bash
 # Reference frames only (no OCR, faster)
 python process_meeting.py samples/video.mkv --run-whisper --embed-images --scene-detection -v
 # Reference frames + OCR text extraction
 python process_meeting.py samples/video.mkv --run-whisper --use-hybrid --scene-detection -v
 # Adjust frame quality (smaller files)
 python process_meeting.py samples/video.mkv --run-whisper --embed-images --embed-quality 60 --scene-detection -v
 ```
 ## Files Modified
 - `meetus/transcript_merger.py` - Modified `_format_detailed()` to output frame paths instead of base64
 - `process_meeting.py` - Updated help text and examples to reflect frame referencing
 - All processors (OCR, vision, hybrid) already include `frame_path` in results (no changes needed)
 ## Workflow Example
 ```bash
 # First run: Generate everything
 python process_meeting.py samples/meeting.mkv --run-whisper --embed-images --scene-detection -v
 # Result:
 # - output/20251028-004-meeting/
 #   - frames/ (40 frames, ~80KB each)
 #   - meeting.json (whisper transcript)
 #   - meeting_enhanced.txt (~50KB, references frames/)
 # LLM can now:
 # 1. Read enhanced transcript
 # 2. See timeline of audio + screen changes
 # 3. Load individual frames as needed from frames/ directory
 ```
--- a/meetus/cache_manager.py
+++ b/meetus/cache_manager.py
@@ -0,0 +1,162 @@
 """
 Manage caching for frames, transcripts, and analysis results.
 """
 from pathlib import Path
 import json
 import logging
 from typing import List, Tuple, Dict, Optional
 logger = logging.getLogger(__name__)
 class CacheManager:
    """Manage caching of intermediate processing results."""
    def __init__(self, output_dir: Path, frames_dir: Path, video_name: str, use_cache: bool = True,
                 skip_cache_frames: bool = False, skip_cache_whisper: bool = False,
                 skip_cache_analysis: bool = False):
        """
        Initialize cache manager.
        Args:
            output_dir: Output directory for cached files
            frames_dir: Directory for cached frames
            video_name: Name of the video (stem)
            use_cache: Whether to use caching globally
            skip_cache_frames: Skip cached frames specifically
            skip_cache_whisper: Skip cached whisper specifically
            skip_cache_analysis: Skip cached analysis specifically
        """
        self.output_dir = output_dir
        self.frames_dir = frames_dir
        self.video_name = video_name
        self.use_cache = use_cache
        self.skip_cache_frames = skip_cache_frames
        self.skip_cache_whisper = skip_cache_whisper
        self.skip_cache_analysis = skip_cache_analysis
    def get_whisper_cache(self) -> Optional[Path]:
        """
        Check for cached Whisper transcript.
        Returns:
            Path to cached transcript or None
        """
        if not self.use_cache or self.skip_cache_whisper:
            return None
        cache_path = self.output_dir / f"{self.video_name}.json"
        if cache_path.exists():
            logger.info(f"✓ Found cached Whisper transcript: {cache_path.name}")
            # Debug: Show cached transcript info
            try:
                import json
                with open(cache_path, 'r', encoding='utf-8') as f:
                    data = json.load(f)
                if 'segments' in data:
                    logger.debug(f"Cached transcript has {len(data['segments'])} segments")
            except Exception as e:
                logger.debug(f"Could not parse cached whisper for debug: {e}")
            return cache_path
        return None
    def get_frames_cache(self) -> Optional[List[Tuple[str, float]]]:
        """
        Check for cached frames.
        Returns:
            List of (frame_path, timestamp) tuples or None
        """
        if not self.use_cache or self.skip_cache_frames or not self.frames_dir.exists():
            return None
        existing_frames = list(self.frames_dir.glob("*.jpg"))
        if not existing_frames:
            return None
        logger.info(f"✓ Found {len(existing_frames)} cached frames in {self.frames_dir.name}/")
        logger.debug(f"Frame filenames: {[f.name for f in sorted(existing_frames)[:3]]}...")
        # Build frames_info from existing files
        frames_info = []
        for frame_path in sorted(existing_frames):
            # Try to extract timestamp from filename (e.g., frame_00001_12.34s.jpg)
            try:
                timestamp_str = frame_path.stem.split('_')[-1].rstrip('s')
                timestamp = float(timestamp_str)
            except:
                timestamp = 0.0
            frames_info.append((str(frame_path), timestamp))
        return frames_info
    def get_analysis_cache(self, analysis_type: str) -> Optional[List[Dict]]:
        """
        Check for cached analysis results.
        Args:
            analysis_type: 'vision' or 'ocr'
        Returns:
            List of analysis results or None
        """
        if not self.use_cache or self.skip_cache_analysis:
            return None
        cache_path = self.output_dir / f"{self.video_name}_{analysis_type}.json"
        if cache_path.exists():
            logger.info(f"✓ Found cached {analysis_type} analysis: {cache_path.name}")
            with open(cache_path, 'r', encoding='utf-8') as f:
                results = json.load(f)
            logger.info(f"✓ Loaded {len(results)} analyzed frames from cache")
            # Debug: Show first cached result
            if results:
                logger.debug(f"First cached result: timestamp={results[0].get('timestamp')}, text_length={len(results[0].get('text', ''))}")
            return results
        return None
    def save_analysis(self, analysis_type: str, results: List[Dict]):
        """
        Save analysis results to cache.
        Args:
            analysis_type: 'vision' or 'ocr'
            results: Analysis results to save
        """
        cache_path = self.output_dir / f"{self.video_name}_{analysis_type}.json"
        with open(cache_path, 'w', encoding='utf-8') as f:
            json.dump(results, f, indent=2, ensure_ascii=False)
        logger.info(f"✓ Saved {analysis_type} analysis to: {cache_path.name}")
    def cache_exists(self, analysis_type: Optional[str] = None) -> Dict[str, bool]:
        """
        Check what caches exist.
        Args:
            analysis_type: Optional specific analysis type to check
        Returns:
            Dictionary of cache status
        """
        status = {
            "whisper": (self.output_dir / f"{self.video_name}.json").exists(),
            "frames": len(list(self.frames_dir.glob("frame_*.jpg"))) > 0 if self.frames_dir.exists() else False,
        }
        if analysis_type:
            status[analysis_type] = (self.output_dir / f"{self.video_name}_{analysis_type}.json").exists()
        else:
            status["vision"] = (self.output_dir / f"{self.video_name}_vision.json").exists()
            status["ocr"] = (self.output_dir / f"{self.video_name}_ocr.json").exists()
        return status
--- a/meetus/frame_extractor.py
+++ b/meetus/frame_extractor.py
@@ -6,9 +6,9 @@ import cv2
 import os
 from pathlib import Path
 from typing import List, Tuple, Optional
 import subprocess
 import json
 import logging
 import re
 logger = logging.getLogger(__name__)
@@ -16,17 +16,19 @@ logger = logging.getLogger(__name__)
 class FrameExtractor:
    """Extract frames from video files."""
-    def __init__(self, video_path: str, output_dir: str = "frames"):
+    def __init__(self, video_path: str, output_dir: str = "frames", quality: int = 75):
        """
        Initialize frame extractor.
        Args:
            video_path: Path to video file
            output_dir: Directory to save extracted frames
            quality: JPEG quality for saved frames (0-100)
        """
        self.video_path = video_path
        self.output_dir = Path(output_dir)
        self.output_dir.mkdir(parents=True, exist_ok=True)
        self.quality = quality
    def extract_by_interval(self, interval_seconds: int = 5) -> List[Tuple[str, float]]:
        """
@@ -56,7 +58,16 @@ class FrameExtractor:
                frame_filename = f"frame_{saved_count:05d}_{timestamp:.2f}s.jpg"
                frame_path = self.output_dir / frame_filename
-                cv2.imwrite(str(frame_path), frame)
+                # Downscale to 1600px width for smaller file size (but still readable)
                height, width = frame.shape[:2]
                if width > 1600:
                    ratio = 1600 / width
                    new_width = 1600
                    new_height = int(height * ratio)
                    frame = cv2.resize(frame, (new_width, new_height), interpolation=cv2.INTER_LANCZOS4)
                # Save with configured quality (matches embed quality)
                cv2.imwrite(str(frame_path), frame, [cv2.IMWRITE_JPEG_QUALITY, self.quality])
                frames_info.append((str(frame_path), timestamp))
                saved_count += 1
@@ -66,48 +77,80 @@ class FrameExtractor:
        logger.info(f"Extracted {saved_count} frames at {interval_seconds}s intervals")
        return frames_info
-    def extract_scene_changes(self, threshold: float = 30.0) -> List[Tuple[str, float]]:
+    def extract_scene_changes(self, threshold: float = 15.0) -> List[Tuple[str, float]]:
        """
        Extract frames only on scene changes using FFmpeg.
        More efficient than interval-based extraction.
        Args:
            threshold: Scene change detection threshold (0-100, lower = more sensitive)
                      Default: 15.0 (good for clean UIs like Zed)
                      Higher values (20-30) for busy UIs like VS Code
                      Lower values (5-10) for very subtle changes
        Returns:
            List of (frame_path, timestamp) tuples
        """
        try:
            import ffmpeg
        except ImportError:
            raise ImportError("ffmpeg-python not installed. Run: pip install ffmpeg-python")
        video_name = Path(self.video_path).stem
        output_pattern = self.output_dir / f"{video_name}_%05d.jpg"
        # Use FFmpeg's scene detection filter
        cmd = [
            'ffmpeg',
            '-i', self.video_path,
            '-vf', f'select=gt(scene\\,{threshold/100}),showinfo',
            '-vsync', 'vfr',
            '-frame_pts', '1',
            str(output_pattern),
            '-loglevel', 'info'
        ]
        try:
-            result = subprocess.run(cmd, capture_output=True, text=True, check=True)
+            # Use FFmpeg's scene detection filter with downscaling
            stream = ffmpeg.input(self.video_path)
            stream = ffmpeg.filter(stream, 'select', f'gt(scene,{threshold/100})')
            stream = ffmpeg.filter(stream, 'showinfo')
            # Scale to 1600px width (maintains aspect ratio, still readable)
            # Use simple conditional: if width > 1600, scale to 1600, else keep original
            stream = ffmpeg.filter(stream, 'scale', w='min(1600,iw)', h=-1)
-            # Parse output to get frame timestamps
+            # Convert JPEG quality (0-100) to FFmpeg qscale (2-31, lower=better)
            # Rough mapping: qscale ≈ (100 - quality) / 10, clamped to 2-31
            qscale = max(2, min(31, int((100 - self.quality) / 10 + 2)))
            stream = ffmpeg.output(
                stream,
                str(output_pattern),
                vsync='vfr',
                frame_pts=1,
                **{'q:v': str(qscale)}  # Matches configured quality
            )
            # Run with stderr capture to get showinfo output
            _, stderr = ffmpeg.run(stream, capture_stderr=True, overwrite_output=True)
            stderr = stderr.decode('utf-8')
            # Parse FFmpeg output to get frame timestamps from showinfo filter
            frames_info = []
-            for img in sorted(self.output_dir.glob(f"{video_name}_*.jpg")):
+
-                # Extract timestamp from filename or use FFprobe
+            # Extract timestamps from stderr (showinfo outputs there)
-                frames_info.append((str(img), 0.0))  # Timestamp extraction can be enhanced
+            timestamp_pattern = r'pts_time:([\d.]+)'
            timestamps = re.findall(timestamp_pattern, stderr)
            # Match frames to timestamps
            frame_files = sorted(self.output_dir.glob(f"{video_name}_*.jpg"))
            for idx, img in enumerate(frame_files):
                # Use extracted timestamp or fallback to index-based estimate
                timestamp = float(timestamps[idx]) if idx < len(timestamps) else idx * 5.0
                frames_info.append((str(img), timestamp))
            logger.info(f"Extracted {len(frames_info)} frames at scene changes")
            return frames_info
-        except subprocess.CalledProcessError as e:
+        except ffmpeg.Error as e:
-            logger.error(f"FFmpeg error: {e.stderr}")
+            logger.error(f"FFmpeg error: {e.stderr.decode() if e.stderr else str(e)}")
            # Fallback to interval extraction
            logger.warning("Falling back to interval extraction...")
            return self.extract_by_interval()
        except Exception as e:
            logger.error(f"Unexpected error during scene extraction: {e}")
            logger.warning("Falling back to interval extraction...")
            return self.extract_by_interval()
    def get_video_duration(self) -> float:
        """Get video duration in seconds."""
--- a/meetus/hybrid_processor.py
+++ b/meetus/hybrid_processor.py
@@ -0,0 +1,355 @@
 """
 Hybrid frame analysis: OpenCV text detection + OCR for accurate extraction.
 Better than pure vision models which tend to hallucinate text content.
 """
 from typing import List, Tuple, Dict, Optional
 from pathlib import Path
 import logging
 import cv2
 import numpy as np
 from difflib import SequenceMatcher
 logger = logging.getLogger(__name__)
 class HybridProcessor:
    """Combine OpenCV text detection with OCR for accurate text extraction."""
    def __init__(self, ocr_engine: str = "tesseract", min_confidence: float = 0.5,
                 use_llm_cleanup: bool = False, llm_model: Optional[str] = None):
        """
        Initialize hybrid processor.
        Args:
            ocr_engine: OCR engine to use ('tesseract', 'easyocr', 'paddleocr')
            min_confidence: Minimum confidence for text detection (0-1)
            use_llm_cleanup: Use LLM to clean up OCR output and preserve formatting
            llm_model: Ollama model for cleanup (default: llama3.2:3b for speed)
        """
        from .ocr_processor import OCRProcessor
        self.ocr = OCRProcessor(engine=ocr_engine)
        self.min_confidence = min_confidence
        self.use_llm_cleanup = use_llm_cleanup
        self.llm_model = llm_model or "llama3.2:3b"
        self._llm_client = None
        if use_llm_cleanup:
            self._init_llm()
    def _init_llm(self):
        """Initialize Ollama client for LLM cleanup."""
        try:
            import ollama
            self._llm_client = ollama
            logger.info(f"LLM cleanup enabled using {self.llm_model}")
        except ImportError:
            logger.warning("ollama package not installed. LLM cleanup disabled.")
            self.use_llm_cleanup = False
    def _cleanup_with_llm(self, raw_text: str) -> str:
        """
        Use LLM to clean up OCR output and preserve code formatting.
        Args:
            raw_text: Raw OCR output
        Returns:
            Cleaned up text with proper formatting
        """
        if not self.use_llm_cleanup or not self._llm_client:
            return raw_text
        prompt = """You are cleaning up OCR output from a code editor screenshot.
 Your task:
 1. Fix any obvious OCR errors (l→1, O→0, etc.)
 2. Preserve or restore code indentation and structure
 3. Keep the exact text content - don't add explanations or comments
 4. If it's code, maintain proper spacing and formatting
 5. Return ONLY the cleaned text, nothing else
 OCR Text:
 """
        try:
            response = self._llm_client.generate(
                model=self.llm_model,
                prompt=prompt + raw_text,
                options={"temperature": 0.1}  # Low temperature for accuracy
            )
            cleaned = response['response'].strip()
            logger.debug(f"LLM cleanup: {len(raw_text)} → {len(cleaned)} chars")
            return cleaned
        except Exception as e:
            logger.warning(f"LLM cleanup failed: {e}, using raw OCR output")
            return raw_text
    def detect_text_regions(self, image_path: str, min_area: int = 100) -> List[Tuple[int, int, int, int]]:
        """
        Detect text regions in image using OpenCV.
        Args:
            image_path: Path to image file
            min_area: Minimum area for text region (pixels)
        Returns:
            List of bounding boxes (x, y, w, h)
        """
        # Read image
        img = cv2.imread(image_path)
        if img is None:
            logger.warning(f"Could not read image: {image_path}")
            return []
        gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
        # Method 1: Morphological operations to find text regions
        # Works well for solid text blocks
        regions = self._detect_by_morphology(gray, min_area)
        if not regions:
            logger.debug(f"No text regions detected in {Path(image_path).name}")
        return regions
    def _detect_by_morphology(self, gray: np.ndarray, min_area: int) -> List[Tuple[int, int, int, int]]:
        """
        Detect text regions using morphological operations.
        Fast and works well for solid text blocks (code editors, terminals).
        Args:
            gray: Grayscale image
            min_area: Minimum area for region
        Returns:
            List of bounding boxes (x, y, w, h)
        """
        # Apply adaptive threshold to handle varying lighting
        binary = cv2.adaptiveThreshold(
            gray, 255, cv2.ADAPTIVE_THRESH_GAUSSIAN_C,
            cv2.THRESH_BINARY_INV, 11, 2
        )
        # Morphological operations to connect text regions
        kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (15, 3))  # Horizontal kernel for text lines
        dilated = cv2.dilate(binary, kernel, iterations=2)
        # Find contours
        contours, _ = cv2.findContours(dilated, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
        # Filter and extract bounding boxes
        regions = []
        for contour in contours:
            x, y, w, h = cv2.boundingRect(contour)
            area = w * h
            # Filter by area and aspect ratio
            if area > min_area and w > 20 and h > 10:  # Reasonable text dimensions
                regions.append((x, y, w, h))
        # Merge overlapping regions
        regions = self._merge_overlapping_regions(regions)
        logger.debug(f"Detected {len(regions)} text regions using morphology")
        return regions
    def _merge_overlapping_regions(
        self, regions: List[Tuple[int, int, int, int]],
        overlap_threshold: float = 0.3
    ) -> List[Tuple[int, int, int, int]]:
        """
        Merge overlapping bounding boxes.
        Args:
            regions: List of (x, y, w, h) tuples
            overlap_threshold: Minimum overlap ratio to merge
        Returns:
            Merged regions
        """
        if not regions:
            return []
        # Sort by y-coordinate (top to bottom)
        regions = sorted(regions, key=lambda r: r[1])
        merged = []
        current = list(regions[0])
        for region in regions[1:]:
            x, y, w, h = region
            cx, cy, cw, ch = current
            # Check for overlap
            x_overlap = max(0, min(cx + cw, x + w) - max(cx, x))
            y_overlap = max(0, min(cy + ch, y + h) - max(cy, y))
            overlap_area = x_overlap * y_overlap
            current_area = cw * ch
            region_area = w * h
            min_area = min(current_area, region_area)
            if overlap_area / min_area > overlap_threshold:
                # Merge regions
                new_x = min(cx, x)
                new_y = min(cy, y)
                new_x2 = max(cx + cw, x + w)
                new_y2 = max(cy + ch, y + h)
                current = [new_x, new_y, new_x2 - new_x, new_y2 - new_y]
            else:
                merged.append(tuple(current))
                current = list(region)
        merged.append(tuple(current))
        return merged
    def extract_text_from_region(self, image_path: str, region: Tuple[int, int, int, int]) -> str:
        """
        Extract text from a specific region using OCR.
        Args:
            image_path: Path to image file
            region: Bounding box (x, y, w, h)
        Returns:
            Extracted text
        """
        from PIL import Image
        # Load image and crop region
        img = Image.open(image_path)
        x, y, w, h = region
        cropped = img.crop((x, y, x + w, y + h))
        # Save to temp file for OCR (or use in-memory)
        import tempfile
        with tempfile.NamedTemporaryFile(suffix='.png', delete=False) as tmp:
            cropped.save(tmp.name)
            text = self.ocr.extract_text(tmp.name)
        # Clean up temp file
        Path(tmp.name).unlink()
        return text
    def analyze_frame(self, image_path: str) -> str:
        """
        Analyze a frame: detect text regions and OCR them.
        Args:
            image_path: Path to image file
        Returns:
            Combined text from all detected regions
        """
        # Detect text regions
        regions = self.detect_text_regions(image_path)
        if not regions:
            # Fallback to full-frame OCR if no regions detected
            logger.debug(f"No regions detected, using full-frame OCR for {Path(image_path).name}")
            raw_text = self.ocr.extract_text(image_path)
            return self._cleanup_with_llm(raw_text) if self.use_llm_cleanup else raw_text
        # Sort regions by reading order (top-to-bottom, left-to-right)
        regions = self._sort_regions_by_reading_order(regions)
        # Extract text from each region
        texts = []
        for idx, region in enumerate(regions):
            x, y, w, h = region
            text = self.extract_text_from_region(image_path, region)
            if text.strip():
                # Add visual separator with region info
                section_header = f"[Region {idx+1} at y={y}]"
                texts.append(f"{section_header}\n{text.strip()}")
                logger.debug(f"Region {idx+1}/{len(regions)} (y={y}): Extracted {len(text)} chars")
        combined = ("\n\n" + "="*60 + "\n\n").join(texts)
        logger.debug(f"Total extracted from {len(regions)} regions: {len(combined)} chars")
        # Apply LLM cleanup if enabled
        if self.use_llm_cleanup:
            combined = self._cleanup_with_llm(combined)
        return combined
    def _sort_regions_by_reading_order(self, regions: List[Tuple[int, int, int, int]]) -> List[Tuple[int, int, int, int]]:
        """
        Sort regions in reading order (top-to-bottom, left-to-right).
        Args:
            regions: List of (x, y, w, h) tuples
        Returns:
            Sorted regions
        """
        # Sort primarily by y (top to bottom), secondarily by x (left to right)
        # Group regions that are on roughly the same line (within 20px)
        sorted_regions = sorted(regions, key=lambda r: (r[1] // 20, r[0]))
        return sorted_regions
    def process_frames(
        self,
        frames_info: List[Tuple[str, float]],
        deduplicate: bool = True,
        similarity_threshold: float = 0.85
    ) -> List[Dict]:
        """
        Process multiple frames with hybrid analysis.
        Args:
            frames_info: List of (frame_path, timestamp) tuples
            deduplicate: Whether to remove similar consecutive analyses
            similarity_threshold: Threshold for considering analyses as duplicates (0-1)
        Returns:
            List of dicts with 'timestamp', 'text', and 'frame_path'
        """
        results = []
        prev_text = ""
        total = len(frames_info)
        logger.info(f"Starting hybrid analysis of {total} frames...")
        for idx, (frame_path, timestamp) in enumerate(frames_info, 1):
            logger.info(f"Analyzing frame {idx}/{total} at {timestamp:.2f}s...")
            text = self.analyze_frame(frame_path)
            if not text:
                logger.warning(f"No content extracted from frame at {timestamp:.2f}s")
                continue
            # Debug: Show what was extracted
            logger.debug(f"Frame {idx} ({timestamp:.2f}s): Extracted {len(text)} chars")
            logger.debug(f"Content preview: {text[:150]}{'...' if len(text) > 150 else ''}")
            # Deduplicate similar consecutive frames
            if deduplicate and prev_text:
                similarity = self._text_similarity(prev_text, text)
                logger.debug(f"Similarity to previous frame: {similarity:.2f} (threshold: {similarity_threshold})")
                if similarity > similarity_threshold:
                    logger.debug(f"⊘ Skipping duplicate frame at {timestamp:.2f}s (similarity: {similarity:.2f})")
                    continue
            results.append({
                'timestamp': timestamp,
                'text': text,
                'frame_path': frame_path
            })
            prev_text = text
        logger.info(f"Extracted content from {len(results)} frames (deduplication: {deduplicate})")
        return results
    def _text_similarity(self, text1: str, text2: str) -> float:
        """
        Calculate similarity between two texts.
        Returns:
            Similarity score between 0 and 1
        """
        return SequenceMatcher(None, text1, text2).ratio()
--- a/meetus/ocr_processor.py
+++ b/meetus/ocr_processor.py
@@ -53,20 +53,25 @@ class OCRProcessor:
        else:
            raise ValueError(f"Unknown OCR engine: {self.engine}")
-    def extract_text(self, image_path: str) -> str:
+    def extract_text(self, image_path: str, preserve_layout: bool = True) -> str:
        """
        Extract text from a single image.
        Args:
            image_path: Path to image file
            preserve_layout: Try to preserve whitespace and layout
        Returns:
            Extracted text
        """
        if self.engine == "tesseract":
            from PIL import Image
            import pytesseract
            image = Image.open(image_path)
-            text = self._ocr_engine.image_to_string(image)
+
            # Use PSM 6 (uniform block of text) to preserve layout better
            config = '--psm 6' if preserve_layout else ''
            text = pytesseract.image_to_string(image, config=config)
        elif self.engine == "easyocr":
            result = self._ocr_engine.readtext(image_path, detail=0)
@@ -81,12 +86,31 @@ class OCRProcessor:
        return self._clean_text(text)
-    def _clean_text(self, text: str) -> str:
+    def _clean_text(self, text: str, preserve_indentation: bool = True) -> str:
-        """Clean up OCR output."""
+        """
-        # Remove excessive whitespace
+        Clean up OCR output.
-        text = re.sub(r'\n\s*\n', '\n', text)
+
-        text = re.sub(r' +', ' ', text)
+        Args:
-        return text.strip()
+            text: Raw OCR text
            preserve_indentation: Keep leading whitespace on lines
        Returns:
            Cleaned text
        """
        if preserve_indentation:
            # Remove excessive blank lines but preserve indentation
            lines = text.split('\n')
            cleaned_lines = []
            for line in lines:
                # Keep line if it has content or is single empty line
                if line.strip() or (cleaned_lines and cleaned_lines[-1].strip()):
                    cleaned_lines.append(line)
            return '\n'.join(cleaned_lines).strip()
        else:
            # Original aggressive cleaning
            text = re.sub(r'\n\s*\n', '\n', text)
            text = re.sub(r' +', ' ', text)
            return text.strip()
    def process_frames(
        self,
@@ -108,18 +132,24 @@ class OCRProcessor:
        results = []
        prev_text = ""
-        for frame_path, timestamp in frames_info:
+        for idx, (frame_path, timestamp) in enumerate(frames_info, 1):
-            logger.debug(f"Processing frame at {timestamp:.2f}s...")
+            logger.debug(f"Processing frame {idx}/{len(frames_info)} at {timestamp:.2f}s...")
            text = self.extract_text(frame_path)
            if not text:
                logger.debug(f"No text extracted from frame at {timestamp:.2f}s")
                continue
            # Debug: Show what was extracted
            logger.debug(f"Frame {idx} ({timestamp:.2f}s): Extracted {len(text)} chars")
            logger.debug(f"Content preview: {text[:150]}{'...' if len(text) > 150 else ''}")
            # Deduplicate similar consecutive frames
-            if deduplicate:
+            if deduplicate and prev_text:
                similarity = self._text_similarity(prev_text, text)
                logger.debug(f"Similarity to previous frame: {similarity:.2f} (threshold: {similarity_threshold})")
                if similarity > similarity_threshold:
-                    logger.debug(f"Skipping duplicate frame at {timestamp:.2f}s (similarity: {similarity:.2f})")
+                    logger.debug(f"⊘ Skipping duplicate frame at {timestamp:.2f}s (similarity: {similarity:.2f})")
                    continue
            results.append({
--- a/meetus/output_manager.py
+++ b/meetus/output_manager.py
@@ -0,0 +1,155 @@
 """
 Manage output directories and manifest files.
 Creates timestamped folders for each video and tracks processing options.
 """
 from pathlib import Path
 from datetime import datetime
 import json
 import logging
 from typing import Dict, Any, Optional
 logger = logging.getLogger(__name__)
 class OutputManager:
    """Manage output directories and manifest files for video processing."""
    def __init__(self, video_path: Path, base_output_dir: str = "output", use_cache: bool = True):
        """
        Initialize output manager.
        Args:
            video_path: Path to the video file being processed
            base_output_dir: Base directory for all outputs
            use_cache: Whether to use existing directories if found
        """
        self.video_path = video_path
        self.base_output_dir = Path(base_output_dir)
        self.use_cache = use_cache
        # Find or create output directory
        self.output_dir = self._get_or_create_output_dir()
        self.frames_dir = self.output_dir / "frames"
        self.frames_dir.mkdir(exist_ok=True)
        logger.info(f"Output directory: {self.output_dir}")
    def _get_or_create_output_dir(self) -> Path:
        """
        Get existing output directory or create a new one with incremental number.
        Returns:
            Path to output directory
        """
        video_name = self.video_path.stem
        # Look for existing directories if caching is enabled
        if self.use_cache and self.base_output_dir.exists():
            existing_dirs = sorted([
                d for d in self.base_output_dir.iterdir()
                if d.is_dir() and d.name.endswith(f"-{video_name}")
            ], reverse=True)  # Most recent first
            if existing_dirs:
                logger.info(f"Found existing output: {existing_dirs[0].name}")
                return existing_dirs[0]
        # Create new directory with date + incremental number
        date_str = datetime.now().strftime("%Y%m%d")
        # Find existing runs for today
        if self.base_output_dir.exists():
            existing_today = [
                d for d in self.base_output_dir.iterdir()
                if d.is_dir() and d.name.startswith(date_str) and d.name.endswith(f"-{video_name}")
            ]
            # Extract run numbers and find max
            run_numbers = []
            for d in existing_today:
                # Format: YYYYMMDD-NNN-videoname
                parts = d.name.split('-')
                if len(parts) >= 2 and parts[1].isdigit():
                    run_numbers.append(int(parts[1]))
            next_run = max(run_numbers) + 1 if run_numbers else 1
        else:
            next_run = 1
        dir_name = f"{date_str}-{next_run:03d}-{video_name}"
        output_dir = self.base_output_dir / dir_name
        output_dir.mkdir(parents=True, exist_ok=True)
        logger.info(f"Created new output directory: {dir_name}")
        return output_dir
    def get_path(self, filename: str) -> Path:
        """Get full path for a file in the output directory."""
        return self.output_dir / filename
    def get_frames_path(self, filename: str) -> Path:
        """Get full path for a file in the frames directory."""
        return self.frames_dir / filename
    def save_manifest(self, config: Dict[str, Any]):
        """
        Save processing configuration to manifest.json.
        Args:
            config: Dictionary of processing options
        """
        manifest_path = self.output_dir / "manifest.json"
        manifest = {
            "video": {
                "name": self.video_path.name,
                "path": str(self.video_path.absolute()),
            },
            "processed_at": datetime.now().isoformat(),
            "configuration": config,
            "outputs": {
                "frames": str(self.frames_dir.relative_to(self.output_dir)),
                "enhanced_transcript": f"{self.video_path.stem}_enhanced.txt",
                "whisper_transcript": f"{self.video_path.stem}.json" if config.get("run_whisper") else None,
                "analysis": f"{self.video_path.stem}_{'vision' if config.get('use_vision') else 'ocr'}.json"
            }
        }
        with open(manifest_path, 'w', encoding='utf-8') as f:
            json.dump(manifest, f, indent=2, ensure_ascii=False)
        logger.info(f"Saved manifest: {manifest_path}")
    def load_manifest(self) -> Optional[Dict[str, Any]]:
        """
        Load existing manifest if it exists.
        Returns:
            Manifest dictionary or None
        """
        manifest_path = self.output_dir / "manifest.json"
        if manifest_path.exists():
            with open(manifest_path, 'r', encoding='utf-8') as f:
                return json.load(f)
        return None
    def list_outputs(self) -> Dict[str, Any]:
        """
        List all output files in the directory.
        Returns:
            Dictionary of output files and their status
        """
        video_name = self.video_path.stem
        return {
            "output_dir": str(self.output_dir),
            "manifest": (self.output_dir / "manifest.json").exists(),
            "enhanced_transcript": (self.output_dir / f"{video_name}_enhanced.txt").exists(),
            "whisper_transcript": (self.output_dir / f"{video_name}.json").exists(),
            "vision_analysis": (self.output_dir / f"{video_name}_vision.json").exists(),
            "ocr_analysis": (self.output_dir / f"{video_name}_ocr.json").exists(),
            "frames": len(list(self.frames_dir.glob("*.jpg"))) if self.frames_dir.exists() else 0
        }
--- a/meetus/prompts/code.txt
+++ b/meetus/prompts/code.txt
@@ -0,0 +1,5 @@
 You are analyzing a code screenshot from a meeting recording.
 Provide a brief description of what's being shown (1-2 sentences about the context), then extract the visible code exactly as it appears, preserving all formatting, indentation, and structure.
 If there's no code visible, just describe what you see on screen.
--- a/meetus/prompts/console.txt
+++ b/meetus/prompts/console.txt
@@ -0,0 +1,5 @@
 You are analyzing console/terminal output from a meeting recording.
 Provide a brief description of what's happening (1-2 sentences), then extract the visible commands and output exactly as shown, preserving formatting.
 Include any error messages, warnings, or important status information.
--- a/meetus/prompts/dashboard.txt
+++ b/meetus/prompts/dashboard.txt
@@ -0,0 +1,5 @@
 You are analyzing a dashboard/monitoring panel from a meeting recording.
 Provide a brief description of what's being monitored (1-2 sentences), then list the key panels, metrics, and their current values. Include any alerts, warnings, or notable trends.
 Keep it concise and focused on the important information.
--- a/meetus/prompts/meeting.txt
+++ b/meetus/prompts/meeting.txt
@@ -0,0 +1,10 @@
 You are analyzing a screen capture from a meeting recording.
 Provide a brief description of what's being shown (1-2 sentences about the context). Then extract the key information:
 - Any visible text, titles, or headings
 - Code (preserve exact formatting if present)
 - Metrics, data points, or dashboard information
 - Terminal/console commands and output
 - Application or UI elements
 Be concise but capture all important details that help understand what was being discussed.
--- a/meetus/transcript_merger.py
+++ b/meetus/transcript_merger.py
@@ -6,6 +6,8 @@ from typing import List, Dict, Optional
 import json
 from pathlib import Path
 import logging
 import base64
 from io import BytesIO
 logger = logging.getLogger(__name__)
@@ -13,11 +15,18 @@ logger = logging.getLogger(__name__)
 class TranscriptMerger:
    """Merge audio transcripts with screen OCR text."""
-    def __init__(self):
+    def __init__(self, embed_images: bool = False, embed_quality: int = 80):
-        """Initialize transcript merger."""
+        """
-        pass
+        Initialize transcript merger.
-    def load_whisper_transcript(self, transcript_path: str) -> List[Dict]:
+        Args:
            embed_images: Whether to embed frame images as base64
            embed_quality: JPEG quality for embedded images (0-100)
        """
        self.embed_images = embed_images
        self.embed_quality = embed_quality
    def load_whisper_transcript(self, transcript_path: str, group_interval: Optional[int] = None) -> List[Dict]:
        """
        Load Whisper transcript from file.
@@ -25,6 +34,7 @@ class TranscriptMerger:
        Args:
            transcript_path: Path to transcript file
            group_interval: If specified, group audio segments into intervals (in seconds)
        Returns:
            List of dicts with 'timestamp' (optional) and 'text'
@@ -35,28 +45,39 @@ class TranscriptMerger:
            with open(path, 'r', encoding='utf-8') as f:
                data = json.load(f)
-            # Handle different Whisper output formats
+            # Handle different Whisper/WhisperX output formats
            segments = []
            if isinstance(data, dict) and 'segments' in data:
-                # Standard Whisper JSON format
+                # Standard Whisper/WhisperX JSON format
-                return [
+                segments = [
                    {
                        'timestamp': seg.get('start', 0),
                        'text': seg['text'].strip(),
                        'speaker': seg.get('speaker'),  # WhisperX diarization
                        'type': 'audio'
                    }
                    for seg in data['segments']
                ]
            elif isinstance(data, list):
                # List of segments
-                return [
+                segments = [
                    {
                        'timestamp': seg.get('start', seg.get('timestamp', 0)),
                        'text': seg['text'].strip(),
                        'speaker': seg.get('speaker'),  # WhisperX diarization
                        'type': 'audio'
                    }
                    for seg in data
                ]
            # Group by interval if requested, but skip if we have speaker diarization
            # (merge_transcripts will group by speaker instead)
            has_speakers = any(seg.get('speaker') for seg in segments)
            if group_interval and segments and not has_speakers:
                segments = self.group_audio_by_intervals(segments, group_interval)
            return segments
        else:
            # Plain text file - no timestamps
            with open(path, 'r', encoding='utf-8') as f:
@@ -68,6 +89,76 @@ class TranscriptMerger:
                'type': 'audio'
            }]
    def group_audio_by_intervals(self, segments: List[Dict], interval_seconds: int = 30) -> List[Dict]:
        """
        Group audio segments into regular time intervals.
        Instead of word-level timestamps, this creates intervals (e.g., every 30 seconds)
        with all text spoken during that interval concatenated together.
        Args:
            segments: List of audio segments with timestamps
            interval_seconds: Duration of each interval in seconds
        Returns:
            List of grouped segments with interval timestamps
        """
        if not segments:
            return []
        # Find the max timestamp to determine how many intervals we need
        max_timestamp = max(seg['timestamp'] for seg in segments)
        num_intervals = int(max_timestamp / interval_seconds) + 1
        # Create interval buckets
        intervals = []
        for i in range(num_intervals):
            interval_start = i * interval_seconds
            interval_end = (i + 1) * interval_seconds
            # Collect all text in this interval
            texts = []
            for seg in segments:
                if interval_start <= seg['timestamp'] < interval_end:
                    texts.append(seg['text'])
            # Only create interval if there's text
            if texts:
                intervals.append({
                    'timestamp': interval_start,
                    'text': ' '.join(texts),
                    'type': 'audio'
                })
        logger.info(f"Grouped {len(segments)} segments into {len(intervals)} intervals of {interval_seconds}s")
        return intervals
    def _encode_image_base64(self, image_path: str) -> tuple[str, int]:
        """
        Encode image as base64 (image already at target quality/size).
        Args:
            image_path: Path to image file
        Returns:
            Tuple of (base64_string, size_in_bytes)
        """
        try:
            # Read file directly (already at target quality/resolution)
            with open(image_path, 'rb') as f:
                img_bytes = f.read()
            # Encode to base64
            b64_string = base64.b64encode(img_bytes).decode('utf-8')
            logger.debug(f"Encoded {Path(image_path).name}: {len(img_bytes)} bytes")
            return b64_string, len(img_bytes)
        except Exception as e:
            logger.error(f"Failed to encode image {image_path}: {e}")
            return "", 0
    def merge_transcripts(
        self,
        audio_segments: List[Dict],
@@ -75,13 +166,14 @@ class TranscriptMerger:
    ) -> List[Dict]:
        """
        Merge audio and screen transcripts by timestamp.
        Groups consecutive audio from same speaker until a screen frame interrupts.
        Args:
            audio_segments: List of audio transcript segments
            screen_segments: List of screen OCR segments
        Returns:
-            Merged list sorted by timestamp
+            Merged list sorted by timestamp, with audio grouped by speaker
        """
        # Mark segment types
        for seg in audio_segments:
@@ -93,7 +185,46 @@ class TranscriptMerger:
        all_segments = audio_segments + screen_segments
        all_segments.sort(key=lambda x: x['timestamp'])
-        return all_segments
+        # Group consecutive audio segments by speaker (screen frames break groups)
        grouped = []
        current_group = None
        for seg in all_segments:
            if seg['type'] == 'screen':
                # Screen frame: flush current group and add frame
                if current_group:
                    grouped.append(current_group)
                    current_group = None
                grouped.append(seg)
            else:
                # Audio segment
                speaker = seg.get('speaker')
                if current_group is None:
                    # Start new group
                    current_group = {
                        'timestamp': seg['timestamp'],
                        'text': seg['text'],
                        'speaker': speaker,
                        'type': 'audio'
                    }
                elif speaker == current_group.get('speaker'):
                    # Same speaker, append text
                    current_group['text'] += ' ' + seg['text']
                else:
                    # Speaker changed, flush and start new group
                    grouped.append(current_group)
                    current_group = {
                        'timestamp': seg['timestamp'],
                        'text': seg['text'],
                        'speaker': speaker,
                        'type': 'audio'
                    }
        # Don't forget last group
        if current_group:
            grouped.append(current_group)
        return grouped
    def format_for_claude(
        self,
@@ -120,7 +251,7 @@ class TranscriptMerger:
        lines = []
        lines.append("=" * 80)
        lines.append("ENHANCED MEETING TRANSCRIPT")
-        lines.append("Audio transcript + Screen content")
+        lines.append("Audio transcript + Screen frames")
        lines.append("=" * 80)
        lines.append("")
@@ -128,15 +259,27 @@ class TranscriptMerger:
            timestamp = self._format_timestamp(seg['timestamp'])
            if seg['type'] == 'audio':
-                lines.append(f"[{timestamp}] SPEAKER:")
+                speaker = seg.get('speaker', 'SPEAKER')
                lines.append(f"[{timestamp}] {speaker}:")
                lines.append(f"  {seg['text']}")
                lines.append("")
            else:  # screen
                lines.append(f"[{timestamp}] SCREEN CONTENT:")
-                # Indent screen text for visibility
+
-                screen_text = seg['text'].replace('\n', '\n  | ')
+                # Show frame path if available
-                lines.append(f"  | {screen_text}")
+                if 'frame_path' in seg:
                    # Get just the filename relative to the enhanced transcript
                    frame_path = Path(seg['frame_path'])
                    relative_path = f"frames/{frame_path.name}"
                    lines.append(f"  Frame: {relative_path}")
                # Include text content if available (fallback or additional context)
                if 'text' in seg and seg['text'].strip():
                    screen_text = seg['text'].replace('\n', '\n  | ')
                    lines.append(f"  TEXT:")
                    lines.append(f"  | {screen_text}")
                lines.append("")
        return "\n".join(lines)
@@ -147,7 +290,10 @@ class TranscriptMerger:
        for seg in segments:
            timestamp = self._format_timestamp(seg['timestamp'])
-            prefix = "SPEAKER" if seg['type'] == 'audio' else "SCREEN"
+            if seg['type'] == 'audio':
                prefix = seg.get('speaker', 'SPEAKER')
            else:
                prefix = "SCREEN"
            text = seg['text'].replace('\n', ' ')[:200]  # Truncate long screen text
            lines.append(f"[{timestamp}] {prefix}: {text}")
--- a/meetus/vision_processor.py
+++ b/meetus/vision_processor.py
@@ -6,6 +6,7 @@ from typing import List, Tuple, Dict, Optional
 from pathlib import Path
 import logging
 from difflib import SequenceMatcher
 import os
 logger = logging.getLogger(__name__)
@@ -13,15 +14,24 @@ logger = logging.getLogger(__name__)
 class VisionProcessor:
    """Process frames using local vision models via Ollama."""
-    def __init__(self, model: str = "llava:13b"):
+    def __init__(self, model: str = "llava:13b", prompts_dir: Optional[str] = None):
        """
        Initialize vision processor.
        Args:
            model: Ollama vision model to use (llava:13b, llava:7b, llava-llama3, bakllava)
            prompts_dir: Directory containing prompt files (default: meetus/prompts/)
        """
        self.model = model
        self._client = None
        # Set prompts directory
        if prompts_dir:
            self.prompts_dir = Path(prompts_dir)
        else:
            # Default to meetus/prompts/ relative to this file
            self.prompts_dir = Path(__file__).parent / "prompts"
        self._init_client()
    def _init_client(self):
@@ -53,61 +63,44 @@ class VisionProcessor:
                "Also install Ollama: https://ollama.ai/download"
            )
-    def analyze_frame(self, image_path: str, context: str = "meeting") -> str:
+    def _load_prompt(self, context: str) -> str:
        """
        Load prompt from file.
        Args:
            context: Context name (meeting, dashboard, code, console)
        Returns:
            Prompt text
        """
        prompt_file = self.prompts_dir / f"{context}.txt"
        if prompt_file.exists():
            with open(prompt_file, 'r', encoding='utf-8') as f:
                return f.read().strip()
        else:
            # Fallback to default prompt
            logger.warning(f"Prompt file not found: {prompt_file}, using default")
            return "Analyze this image and describe what you see in detail."
    def analyze_frame(self, image_path: str, context: str = "meeting", audio_context: str = "") -> str:
        """
        Analyze a single frame using local vision model.
        Args:
            image_path: Path to image file
            context: Context hint for analysis (meeting, dashboard, code, console)
            audio_context: Optional audio transcript around this timestamp for context
        Returns:
            Analyzed content description
        """
-        # Context-specific prompts
+        # Load prompt from file
-        prompts = {
+        prompt = self._load_prompt(context)
            "meeting": """Analyze this screen capture from a meeting recording. Extract:
 1. Any visible text (titles, labels, headings)
 2. Key metrics, numbers, or data points shown
 3. Dashboard panels or visualizations (describe what they show)
 4. Code snippets (preserve formatting and context)
 5. Console/terminal output (commands and results)
 6. Application names or UI elements
-Focus on information that would help someone understand what was being discussed.
+        # Add audio context if available
-Be concise but include all important details. If there's code, preserve it exactly.""",
+        if audio_context:
-
+            prompt = f"Audio context (what's being discussed around this time):\n{audio_context}\n\n{prompt}"
            "dashboard": """Analyze this dashboard/monitoring panel. Extract:
 1. Panel titles and metrics names
 2. Current values and units
 3. Trends (up/down/stable)
 4. Alerts or warnings
 5. Time ranges shown
 6. Any anomalies or notable patterns
 Format as structured data.""",
            "code": """Analyze this code screenshot. Extract:
 1. Programming language
 2. File name or path (if visible)
 3. Code content (preserve exact formatting)
 4. Comments
 5. Function/class names
 6. Any error messages or warnings
 Preserve code exactly as shown.""",
            "console": """Analyze this console/terminal output. Extract:
 1. Commands executed
 2. Output/results
 3. Error messages
 4. Warnings or status messages
 5. File paths or URLs
 Preserve formatting and structure."""
        }
        prompt = prompts.get(context, prompts["meeting"])
        try:
            # Use Ollama's chat API with vision
@@ -135,7 +128,8 @@ Preserve formatting and structure."""
        frames_info: List[Tuple[str, float]],
        context: str = "meeting",
        deduplicate: bool = True,
-        similarity_threshold: float = 0.85
+        similarity_threshold: float = 0.85,
        audio_segments: Optional[List[Dict]] = None
    ) -> List[Dict]:
        """
        Process multiple frames with vision analysis.
@@ -158,17 +152,25 @@ Preserve formatting and structure."""
        for idx, (frame_path, timestamp) in enumerate(frames_info, 1):
            logger.info(f"Analyzing frame {idx}/{total} at {timestamp:.2f}s...")
-            text = self.analyze_frame(frame_path, context)
+            # Get audio context around this timestamp (±30 seconds)
            audio_context = self._get_audio_context(timestamp, audio_segments, window=30)
            text = self.analyze_frame(frame_path, context, audio_context)
            if not text:
                logger.warning(f"No content extracted from frame at {timestamp:.2f}s")
                continue
            # Debug: Show what was extracted
            logger.debug(f"Frame {idx} ({timestamp:.2f}s): Extracted {len(text)} chars")
            logger.debug(f"Content preview: {text[:150]}{'...' if len(text) > 150 else ''}")
            # Deduplicate similar consecutive frames
-            if deduplicate:
+            if deduplicate and prev_text:
                similarity = self._text_similarity(prev_text, text)
                logger.debug(f"Similarity to previous frame: {similarity:.2f} (threshold: {similarity_threshold})")
                if similarity > similarity_threshold:
-                    logger.debug(f"Skipping duplicate frame at {timestamp:.2f}s (similarity: {similarity:.2f})")
+                    logger.debug(f"⊘ Skipping duplicate frame at {timestamp:.2f}s (similarity: {similarity:.2f})")
                    continue
            results.append({
@@ -182,6 +184,29 @@ Preserve formatting and structure."""
        logger.info(f"Extracted content from {len(results)} frames (deduplication: {deduplicate})")
        return results
    def _get_audio_context(self, timestamp: float, audio_segments: Optional[List[Dict]], window: int = 30) -> str:
        """
        Get audio transcript around a given timestamp.
        Args:
            timestamp: Target timestamp in seconds
            audio_segments: List of audio segments with 'timestamp' and 'text' keys
            window: Time window in seconds (±window around timestamp)
        Returns:
            Concatenated audio text from the time window
        """
        if not audio_segments:
            return ""
        relevant = [seg for seg in audio_segments
                    if abs(seg.get('timestamp', 0) - timestamp) <= window]
        if not relevant:
            return ""
        return " ".join([seg['text'] for seg in relevant])
    def _text_similarity(self, text1: str, text2: str) -> float:
        """
        Calculate similarity between two texts.
--- a/meetus/workflow.py
+++ b/meetus/workflow.py
@@ -0,0 +1,523 @@
 """
 Orchestrate the video processing workflow.
 Coordinates frame extraction, analysis, and transcript merging.
 """
 from pathlib import Path
 import logging
 import os
 import subprocess
 import shutil
 from typing import Dict, Any, Optional
 from .output_manager import OutputManager
 from .cache_manager import CacheManager
 from .frame_extractor import FrameExtractor
 from .ocr_processor import OCRProcessor
 from .vision_processor import VisionProcessor
 from .transcript_merger import TranscriptMerger
 logger = logging.getLogger(__name__)
 class WorkflowConfig:
    """Configuration for the processing workflow."""
    def __init__(self, **kwargs):
        """Initialize configuration from keyword arguments."""
        # Video and paths
        self.video_path = Path(kwargs['video'])
        self.transcript_path = kwargs.get('transcript')
        self.output_dir = kwargs.get('output_dir', 'output')
        self.custom_output = kwargs.get('output')
        # Whisper options
        self.run_whisper = kwargs.get('run_whisper', False)
        self.whisper_model = kwargs.get('whisper_model', 'medium')
        self.diarize = kwargs.get('diarize', False)
        # Frame extraction
        self.scene_detection = kwargs.get('scene_detection', False)
        self.scene_threshold = kwargs.get('scene_threshold', 15.0)
        self.interval = kwargs.get('interval', 5)
        # Analysis options
        self.use_vision = kwargs.get('use_vision', False)
        self.use_hybrid = kwargs.get('use_hybrid', False)
        self.hybrid_llm_cleanup = kwargs.get('hybrid_llm_cleanup', False)
        self.hybrid_llm_model = kwargs.get('hybrid_llm_model', 'llama3.2:3b')
        self.vision_model = kwargs.get('vision_model', 'llava:13b')
        self.vision_context = kwargs.get('vision_context', 'meeting')
        self.ocr_engine = kwargs.get('ocr_engine', 'tesseract')
        # Validation: can't use both vision and hybrid
        if self.use_vision and self.use_hybrid:
            raise ValueError("Cannot use both --use-vision and --use-hybrid. Choose one.")
        # Validation: LLM cleanup requires hybrid mode
        if self.hybrid_llm_cleanup and not self.use_hybrid:
            raise ValueError("--hybrid-llm-cleanup requires --use-hybrid")
        # Processing options
        self.no_deduplicate = kwargs.get('no_deduplicate', False)
        self.no_cache = kwargs.get('no_cache', False)
        self.skip_cache_frames = kwargs.get('skip_cache_frames', False)
        self.skip_cache_whisper = kwargs.get('skip_cache_whisper', False)
        self.skip_cache_analysis = kwargs.get('skip_cache_analysis', False)
        self.extract_only = kwargs.get('extract_only', False)
        self.format = kwargs.get('format', 'detailed')
        self.embed_images = kwargs.get('embed_images', False)
        self.embed_quality = kwargs.get('embed_quality', 80)
    def to_dict(self) -> Dict[str, Any]:
        """Convert config to dictionary for manifest."""
        return {
            "whisper": {
                "enabled": self.run_whisper,
                "model": self.whisper_model
            },
            "frame_extraction": {
                "method": "scene_detection" if self.scene_detection else "interval",
                "interval_seconds": self.interval if not self.scene_detection else None,
                "scene_threshold": self.scene_threshold if self.scene_detection else None
            },
            "analysis": {
                "method": "vision" if self.use_vision else ("hybrid" if self.use_hybrid else "ocr"),
                "vision_model": self.vision_model if self.use_vision else None,
                "vision_context": self.vision_context if self.use_vision else None,
                "ocr_engine": self.ocr_engine if (not self.use_vision) else None,
                "deduplication": not self.no_deduplicate
            },
            "output_format": self.format
        }
 class ProcessingWorkflow:
    """Orchestrate the complete video processing workflow."""
    def __init__(self, config: WorkflowConfig):
        """
        Initialize workflow.
        Args:
            config: Workflow configuration
        """
        self.config = config
        self.output_mgr = OutputManager(
            config.video_path,
            config.output_dir,
            use_cache=not config.no_cache
        )
        self.cache_mgr = CacheManager(
            self.output_mgr.output_dir,
            self.output_mgr.frames_dir,
            config.video_path.stem,
            use_cache=not config.no_cache,
            skip_cache_frames=config.skip_cache_frames,
            skip_cache_whisper=config.skip_cache_whisper,
            skip_cache_analysis=config.skip_cache_analysis
        )
    def run(self) -> Dict[str, Any]:
        """
        Run the complete processing workflow.
        Returns:
            Dictionary with output paths and status
        """
        logger.info("=" * 80)
        logger.info("MEETING PROCESSOR")
        logger.info("=" * 80)
        logger.info(f"Video: {self.config.video_path.name}")
        # Determine analysis method
        if self.config.use_vision:
            analysis_method = f"Vision Model ({self.config.vision_model})"
            logger.info(f"Analysis: {analysis_method}")
            logger.info(f"Context: {self.config.vision_context}")
        elif self.config.use_hybrid:
            analysis_method = f"Hybrid (OpenCV + {self.config.ocr_engine})"
            logger.info(f"Analysis: {analysis_method}")
        else:
            analysis_method = f"OCR ({self.config.ocr_engine})"
            logger.info(f"Analysis: {analysis_method}")
        logger.info(f"Frame extraction: {'Scene detection' if self.config.scene_detection else f'Every {self.config.interval}s'}")
        logger.info(f"Caching: {'Disabled' if self.config.no_cache else 'Enabled'}")
        logger.info("=" * 80)
        # Step 0: Whisper transcription
        transcript_path = self._run_whisper()
        # Step 1: Extract frames
        frames_info = self._extract_frames()
        if not frames_info:
            logger.error("No frames extracted")
            raise RuntimeError("Frame extraction failed")
        # Step 2: Analyze frames
        screen_segments = self._analyze_frames(frames_info)
        if self.config.extract_only:
            logger.info("Done! (extract-only mode)")
            return self._build_result(transcript_path, screen_segments)
        # Step 3: Merge with transcript
        enhanced_transcript = self._merge_transcripts(transcript_path, screen_segments)
        # Save manifest
        self.output_mgr.save_manifest(self.config.to_dict())
        # Build final result
        return self._build_result(transcript_path, screen_segments, enhanced_transcript)
    def _run_whisper(self) -> Optional[str]:
        """Run Whisper transcription if requested, or use cached/provided transcript."""
        # First, check cache (regardless of run_whisper flag)
        cached = self.cache_mgr.get_whisper_cache()
        if cached:
            return str(cached)
        # If no cache and not running whisper/diarize, use provided transcript path (if any)
        if not self.config.run_whisper and not self.config.diarize:
            return self.config.transcript_path
        logger.info("=" * 80)
        logger.info("STEP 0: Running Whisper Transcription")
        logger.info("=" * 80)
        # Determine which transcription tool to use
        use_diarize = getattr(self.config, 'diarize', False)
        if use_diarize:
            if not shutil.which("whisperx"):
                logger.error("WhisperX is not installed. Install it with: pip install whisperx")
                raise RuntimeError("WhisperX not installed (required for --diarize)")
            transcribe_cmd = "whisperx"
        else:
            if not shutil.which("whisper"):
                logger.error("Whisper is not installed. Install it with: pip install openai-whisper")
                raise RuntimeError("Whisper not installed")
            transcribe_cmd = "whisper"
        # Unload Ollama model to free GPU memory for Whisper (if using vision)
        if self.config.use_vision:
            logger.info("Freeing GPU memory for Whisper...")
            try:
                subprocess.run(["ollama", "stop", self.config.vision_model],
                             capture_output=True, check=False)
                logger.info("✓ Ollama model unloaded")
            except Exception as e:
                logger.warning(f"Could not unload Ollama model: {e}")
        if use_diarize:
            logger.info(f"Running WhisperX transcription with diarization (model: {self.config.whisper_model})...")
        else:
            logger.info(f"Running Whisper transcription (model: {self.config.whisper_model})...")
        logger.info("This may take a few minutes depending on video length...")
        # Build command
        cmd = [
            transcribe_cmd,
            str(self.config.video_path),
            "--model", self.config.whisper_model,
            "--output_format", "json",
            "--output_dir", str(self.output_mgr.output_dir),
        ]
        if use_diarize:
            cmd.append("--diarize")
        try:
            # Set up environment with cuDNN library path for whisperx
            env = os.environ.copy()
            if use_diarize:
                import site
                site_packages = site.getsitepackages()[0]
                cudnn_path = Path(site_packages) / "nvidia" / "cudnn" / "lib"
                if cudnn_path.exists():
                    env["LD_LIBRARY_PATH"] = str(cudnn_path) + ":" + env.get("LD_LIBRARY_PATH", "")
            subprocess.run(cmd, check=True, capture_output=True, text=True, env=env)
            transcript_path = self.output_mgr.get_path(f"{self.config.video_path.stem}.json")
            if transcript_path.exists():
                logger.info(f"✓ Whisper transcription completed: {transcript_path.name}")
                # Debug: Show transcript preview
                try:
                    import json
                    with open(transcript_path, 'r', encoding='utf-8') as f:
                        whisper_data = json.load(f)
                    if 'segments' in whisper_data:
                        logger.debug(f"Whisper produced {len(whisper_data['segments'])} segments")
                        if whisper_data['segments']:
                            logger.debug(f"First segment: {whisper_data['segments'][0]}")
                            logger.debug(f"Last segment: {whisper_data['segments'][-1]}")
                    if 'text' in whisper_data:
                        text_preview = whisper_data['text'][:200] + "..." if len(whisper_data.get('text', '')) > 200 else whisper_data.get('text', '')
                        logger.debug(f"Transcript preview: {text_preview}")
                except Exception as e:
                    logger.debug(f"Could not parse whisper output for debug: {e}")
                logger.info("")
                return str(transcript_path)
            else:
                logger.error("Whisper completed but transcript file not found")
                raise RuntimeError("Whisper output missing")
        except subprocess.CalledProcessError as e:
            logger.error(f"Whisper failed: {e.stderr}")
            raise
    def _extract_frames(self):
        """Extract frames from video."""
        logger.info("Step 1: Extracting frames from video...")
        # Check cache
        cached_frames = self.cache_mgr.get_frames_cache()
        if cached_frames:
            return cached_frames
        # Clean up old frames if regenerating
        if self.config.skip_cache_frames and self.output_mgr.frames_dir.exists():
            old_frames = list(self.output_mgr.frames_dir.glob("*.jpg"))
            if old_frames:
                logger.info(f"Cleaning up {len(old_frames)} old frames...")
                for old_frame in old_frames:
                    old_frame.unlink()
                logger.info("✓ Cleanup complete")
        # Extract frames (use embed quality so saved files match embedded images)
        if self.config.scene_detection:
            logger.info(f"Extracting frames with scene detection (threshold={self.config.scene_threshold})...")
        else:
            logger.info(f"Extracting frames every {self.config.interval}s...")
        extractor = FrameExtractor(
            str(self.config.video_path),
            str(self.output_mgr.frames_dir),
            quality=self.config.embed_quality
        )
        if self.config.scene_detection:
            frames_info = extractor.extract_scene_changes(threshold=self.config.scene_threshold)
        else:
            frames_info = extractor.extract_by_interval(self.config.interval)
        logger.info(f"✓ Extracted {len(frames_info)} frames")
        return frames_info
    def _analyze_frames(self, frames_info):
        """Analyze frames with vision, hybrid, or OCR."""
        # Skip analysis if just embedding images
        if self.config.embed_images:
            logger.info("Step 2: Skipping analysis (images will be embedded)")
            # Create minimal segments with just frame paths and timestamps
            screen_segments = [
                {
                    'timestamp': timestamp,
                    'text': '',  # No text extraction needed
                    'frame_path': frame_path
                }
                for frame_path, timestamp in frames_info
            ]
            logger.info(f"✓ Prepared {len(screen_segments)} frames for embedding")
            return screen_segments
        # Determine analysis type
        if self.config.use_vision:
            analysis_type = 'vision'
        elif self.config.use_hybrid:
            analysis_type = 'hybrid'
        else:
            analysis_type = 'ocr'
        # Check cache
        cached_analysis = self.cache_mgr.get_analysis_cache(analysis_type)
        if cached_analysis:
            return cached_analysis
        if self.config.use_vision:
            return self._run_vision_analysis(frames_info)
        elif self.config.use_hybrid:
            return self._run_hybrid_analysis(frames_info)
        else:
            return self._run_ocr_analysis(frames_info)
    def _run_vision_analysis(self, frames_info):
        """Run vision analysis on frames."""
        logger.info("Step 2: Running vision analysis on extracted frames...")
        logger.info(f"Loading vision model {self.config.vision_model} to GPU...")
        # Load audio segments for context if transcript exists
        audio_segments = []
        transcript_path = self.config.transcript_path or self._get_cached_transcript()
        if transcript_path:
            transcript_file = Path(transcript_path)
            if transcript_file.exists():
                logger.info("Loading audio transcript for context...")
                merger = TranscriptMerger()
                audio_segments = merger.load_whisper_transcript(str(transcript_file))
                logger.info(f"✓ Loaded {len(audio_segments)} audio segments for context")
        try:
            vision = VisionProcessor(model=self.config.vision_model)
            screen_segments = vision.process_frames(
                frames_info,
                context=self.config.vision_context,
                deduplicate=not self.config.no_deduplicate,
                audio_segments=audio_segments
            )
            logger.info(f"✓ Analyzed {len(screen_segments)} frames with vision model")
            # Debug: Show sample analysis results
            if screen_segments:
                logger.debug(f"First analysis result: timestamp={screen_segments[0].get('timestamp')}, text_length={len(screen_segments[0].get('text', ''))}")
                logger.debug(f"First analysis text preview: {screen_segments[0].get('text', '')[:200]}...")
                if len(screen_segments) > 1:
                    logger.debug(f"Last analysis result: timestamp={screen_segments[-1].get('timestamp')}, text_length={len(screen_segments[-1].get('text', ''))}")
            # Cache results
            self.cache_mgr.save_analysis('vision', screen_segments)
            return screen_segments
        except ImportError as e:
            logger.error(f"{e}")
            raise
    def _get_cached_transcript(self) -> Optional[str]:
        """Get cached Whisper transcript if available."""
        cached = self.cache_mgr.get_whisper_cache()
        return str(cached) if cached else None
    def _run_hybrid_analysis(self, frames_info):
        """Run hybrid analysis on frames (OpenCV + OCR)."""
        if self.config.hybrid_llm_cleanup:
            logger.info("Step 2: Running hybrid analysis (OpenCV + OCR + LLM cleanup)...")
        else:
            logger.info("Step 2: Running hybrid analysis (OpenCV text detection + OCR)...")
        try:
            from .hybrid_processor import HybridProcessor
            hybrid = HybridProcessor(
                ocr_engine=self.config.ocr_engine,
                use_llm_cleanup=self.config.hybrid_llm_cleanup,
                llm_model=self.config.hybrid_llm_model
            )
            screen_segments = hybrid.process_frames(
                frames_info,
                deduplicate=not self.config.no_deduplicate
            )
            logger.info(f"✓ Processed {len(screen_segments)} frames with hybrid analysis")
            # Debug: Show sample hybrid results
            if screen_segments:
                logger.debug(f"First hybrid result: timestamp={screen_segments[0].get('timestamp')}, text_length={len(screen_segments[0].get('text', ''))}")
                logger.debug(f"First hybrid text preview: {screen_segments[0].get('text', '')[:200]}...")
                if len(screen_segments) > 1:
                    logger.debug(f"Last hybrid result: timestamp={screen_segments[-1].get('timestamp')}, text_length={len(screen_segments[-1].get('text', ''))}")
            # Cache results
            self.cache_mgr.save_analysis('hybrid', screen_segments)
            return screen_segments
        except ImportError as e:
            logger.error(f"{e}")
            raise
    def _run_ocr_analysis(self, frames_info):
        """Run OCR analysis on frames."""
        logger.info("Step 2: Running OCR on extracted frames...")
        try:
            ocr = OCRProcessor(engine=self.config.ocr_engine)
            screen_segments = ocr.process_frames(
                frames_info,
                deduplicate=not self.config.no_deduplicate
            )
            logger.info(f"✓ Processed {len(screen_segments)} frames with OCR")
            # Debug: Show sample OCR results
            if screen_segments:
                logger.debug(f"First OCR result: timestamp={screen_segments[0].get('timestamp')}, text_length={len(screen_segments[0].get('text', ''))}")
                logger.debug(f"First OCR text preview: {screen_segments[0].get('text', '')[:200]}...")
                if len(screen_segments) > 1:
                    logger.debug(f"Last OCR result: timestamp={screen_segments[-1].get('timestamp')}, text_length={len(screen_segments[-1].get('text', ''))}")
            # Cache results
            self.cache_mgr.save_analysis('ocr', screen_segments)
            return screen_segments
        except ImportError as e:
            logger.error(f"{e}")
            logger.error(f"To install {self.config.ocr_engine}:")
            logger.error(f"  pip install {self.config.ocr_engine}")
            raise
    def _merge_transcripts(self, transcript_path, screen_segments):
        """Merge audio and screen transcripts."""
        merger = TranscriptMerger(
            embed_images=self.config.embed_images,
            embed_quality=self.config.embed_quality
        )
        # Load audio transcript if available
        audio_segments = []
        if transcript_path:
            logger.info("Step 3: Merging with Whisper transcript...")
            transcript_file = Path(transcript_path)
            if not transcript_file.exists():
                logger.warning(f"Transcript not found: {transcript_path}")
                logger.info("Proceeding with screen content only...")
            else:
                # Group audio into 30-second intervals for cleaner reference timestamps
                audio_segments = merger.load_whisper_transcript(str(transcript_file), group_interval=30)
                logger.info(f"✓ Loaded {len(audio_segments)} audio segments")
        else:
            logger.info("No transcript provided, using screen content only...")
        # Merge and format
        merged = merger.merge_transcripts(audio_segments, screen_segments)
        formatted = merger.format_for_claude(merged, format_style=self.config.format)
        # Save output
        if self.config.custom_output:
            output_path = self.config.custom_output
        else:
            output_path = self.output_mgr.get_path(f"{self.config.video_path.stem}_enhanced.txt")
        merger.save_transcript(formatted, str(output_path))
        logger.info("=" * 80)
        logger.info("✓ PROCESSING COMPLETE!")
        logger.info("=" * 80)
        logger.info(f"Output directory: {self.output_mgr.output_dir}")
        logger.info(f"Enhanced transcript: {Path(output_path).name}")
        logger.info("")
        return output_path
    def _build_result(self, transcript_path=None, screen_segments=None, enhanced_transcript=None):
        """Build result dictionary."""
        # Determine analysis filename
        if self.config.use_vision:
            analysis_type = 'vision'
        elif self.config.use_hybrid:
            analysis_type = 'hybrid'
        else:
            analysis_type = 'ocr'
        return {
            "output_dir": str(self.output_mgr.output_dir),
            "transcript": transcript_path,
            "analysis": f"{self.config.video_path.stem}_{analysis_type}.json",
            "frames_count": len(screen_segments) if screen_segments else 0,
            "enhanced_transcript": enhanced_transcript,
            "manifest": str(self.output_mgr.get_path("manifest.json"))
        }
--- a/process_meeting.py
+++ b/process_meeting.py
@@ -1,34 +1,19 @@
 #!/usr/bin/env python3
 """
 Process meeting recordings to extract audio + screen content.
-Combines Whisper transcripts with OCR from screen shares.
+Combines Whisper transcripts with vision analysis or OCR from screen shares.
 """
 import argparse
 from pathlib import Path
 import sys
 import json
 import logging
 import subprocess
 import shutil
-from meetus.frame_extractor import FrameExtractor
+from meetus.workflow import WorkflowConfig, ProcessingWorkflow
 from meetus.ocr_processor import OCRProcessor
 from meetus.vision_processor import VisionProcessor
 from meetus.transcript_merger import TranscriptMerger
 logger = logging.getLogger(__name__)
 def setup_logging(verbose: bool = False):
-    """
+    """Configure logging for the application."""
    Configure logging for the application.
    Args:
        verbose: If True, set DEBUG level, otherwise INFO
    """
    level = logging.DEBUG if verbose else logging.INFO
    # Configure root logger
    logging.basicConfig(
        level=level,
        format='%(asctime)s - %(levelname)s - %(message)s',
@@ -41,158 +26,121 @@ def setup_logging(verbose: bool = False):
    logging.getLogger('paddleocr').setLevel(logging.WARNING)
 def run_whisper(video_path: Path, model: str = "base", output_dir: str = "output") -> Path:
    """
    Run Whisper transcription on video file.
    Args:
        video_path: Path to video file
        model: Whisper model to use (tiny, base, small, medium, large)
        output_dir: Directory to save output
    Returns:
        Path to generated JSON transcript
    """
    # Check if whisper is installed
    if not shutil.which("whisper"):
        logger.error("Whisper is not installed. Install it with: pip install openai-whisper")
        sys.exit(1)
    logger.info(f"Running Whisper transcription (model: {model})...")
    logger.info("This may take a few minutes depending on video length...")
    # Run whisper command
    cmd = [
        "whisper",
        str(video_path),
        "--model", model,
        "--output_format", "json",
        "--output_dir", output_dir
    ]
    try:
        result = subprocess.run(
            cmd,
            check=True,
            capture_output=True,
            text=True
        )
        # Whisper outputs to <output_dir>/<video_stem>.json
        transcript_path = Path(output_dir) / f"{video_path.stem}.json"
        if transcript_path.exists():
            logger.info(f"✓ Whisper transcription completed: {transcript_path}")
            return transcript_path
        else:
            logger.error("Whisper completed but transcript file not found")
            sys.exit(1)
    except subprocess.CalledProcessError as e:
        logger.error(f"Whisper failed: {e.stderr}")
        sys.exit(1)
 def main():
    parser = argparse.ArgumentParser(
        description="Extract screen content from meeting recordings and merge with transcripts",
        formatter_class=argparse.RawDescriptionHelpFormatter,
        epilog="""
 Examples:
-  # Run Whisper + vision analysis (recommended for code/dashboards)
+  # Reference frames for LLM analysis (recommended - transcript includes frame paths)
-  python process_meeting.py samples/meeting.mkv --run-whisper --use-vision
+  python process_meeting.py samples/meeting.mkv --run-whisper --embed-images --scene-detection
-  # Use vision with specific context hint
+  # Adjust frame extraction quality (lower = smaller files)
-  python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --vision-context code
+  python process_meeting.py samples/meeting.mkv --run-whisper --embed-images --embed-quality 60 --scene-detection
-  # Traditional OCR approach
+  # Hybrid approach: OpenCV + OCR (extracts text from frames)
-  python process_meeting.py samples/meeting.mkv --run-whisper
+  python process_meeting.py samples/meeting.mkv --run-whisper --use-hybrid --scene-detection
-  # Re-run analysis using cached frames and transcript
+  # Hybrid + LLM cleanup (best for code formatting)
-  python process_meeting.py samples/meeting.mkv --use-vision
+  python process_meeting.py samples/meeting.mkv --run-whisper --use-hybrid --hybrid-llm-cleanup --scene-detection
-  # Force reprocessing (ignore cache)
+  # Iterate on scene threshold (reuse whisper transcript)
-  python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --no-cache
+  python process_meeting.py samples/meeting.mkv --embed-images --scene-detection --scene-threshold 5 --skip-cache-frames --skip-cache-analysis
  # Use scene detection for fewer frames
  python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --scene-detection
        """
    )
    # Required arguments
    parser.add_argument(
        'video',
        help='Path to video file'
    )
    # Whisper options
    parser.add_argument(
        '--transcript', '-t',
        help='Path to Whisper transcript (JSON or TXT)',
        default=None
    )
    parser.add_argument(
        '--run-whisper',
        action='store_true',
        help='Run Whisper transcription before processing'
    )
    parser.add_argument(
        '--whisper-model',
        choices=['tiny', 'base', 'small', 'medium', 'large'],
-        help='Whisper model to use (default: base)',
+        help='Whisper model to use (default: medium)',
-        default='base'
+        default='medium'
    )
    parser.add_argument(
        '--diarize',
        action='store_true',
        help='Use WhisperX with speaker diarization (requires whisperx and HuggingFace token)'
    )
    # Output options
    parser.add_argument(
        '--output', '-o',
-        help='Output file for enhanced transcript (default: output/<video>_enhanced.txt)',
+        help='Output file for enhanced transcript (default: auto-generated in output directory)',
        default=None
    )
    parser.add_argument(
        '--output-dir',
-        help='Directory for output files (default: output/)',
+        help='Base directory for outputs (default: output/)',
        default='output'
    )
-    parser.add_argument(
+    # Frame extraction options
        '--frames-dir',
        help='Directory to save extracted frames (default: frames/)',
        default='frames'
    )
    parser.add_argument(
        '--interval',
        type=int,
        help='Extract frame every N seconds (default: 5)',
        default=5
    )
    parser.add_argument(
        '--scene-detection',
        action='store_true',
        help='Use scene detection instead of interval extraction'
    )
    parser.add_argument(
        '--scene-threshold',
        type=float,
        help='Scene detection threshold (0-100, lower=more sensitive, default: 15)',
        default=15.0
    )
    # Analysis options
    parser.add_argument(
        '--ocr-engine',
        choices=['tesseract', 'easyocr', 'paddleocr'],
        help='OCR engine to use (default: tesseract)',
        default='tesseract'
    )
    parser.add_argument(
        '--use-vision',
        action='store_true',
        help='Use local vision model (Ollama) instead of OCR for better context understanding'
    )
-
+    parser.add_argument(
        '--use-hybrid',
        action='store_true',
        help='Use hybrid approach: OpenCV text detection + OCR (more accurate than vision models)'
    )
    parser.add_argument(
        '--hybrid-llm-cleanup',
        action='store_true',
        help='Use LLM to clean up OCR output and preserve code formatting (requires --use-hybrid)'
    )
    parser.add_argument(
        '--hybrid-llm-model',
        help='LLM model for cleanup (default: llama3.2:3b)',
        default='llama3.2:3b'
    )
    parser.add_argument(
        '--vision-model',
        help='Vision model to use with Ollama (default: llava:13b)',
        default='llava:13b'
    )
    parser.add_argument(
        '--vision-context',
        choices=['meeting', 'dashboard', 'code', 'console'],
@@ -200,31 +148,56 @@ Examples:
        default='meeting'
    )
    # Processing options
    parser.add_argument(
        '--no-cache',
        action='store_true',
        help='Disable caching - reprocess everything even if outputs exist'
    )
-
+    parser.add_argument(
        '--skip-cache-frames',
        action='store_true',
        help='Skip cached frames, re-extract from video (but keep whisper/analysis cache)'
    )
    parser.add_argument(
        '--skip-cache-whisper',
        action='store_true',
        help='Skip cached whisper transcript, re-run transcription (but keep frames/analysis cache)'
    )
    parser.add_argument(
        '--skip-cache-analysis',
        action='store_true',
        help='Skip cached analysis, re-run OCR/vision (but keep frames/whisper cache)'
    )
    parser.add_argument(
        '--no-deduplicate',
        action='store_true',
        help='Disable text deduplication'
    )
    parser.add_argument(
        '--extract-only',
        action='store_true',
-        help='Only extract frames and OCR, skip transcript merging'
+        help='Only extract frames and analyze, skip transcript merging'
    )
    parser.add_argument(
        '--format',
        choices=['detailed', 'compact'],
        help='Output format style (default: detailed)',
        default='detailed'
    )
    parser.add_argument(
        '--embed-images',
        action='store_true',
        help='Skip OCR/vision analysis and reference frame files directly (faster, lets LLM analyze images)'
    )
    parser.add_argument(
        '--embed-quality',
        type=int,
        help='JPEG quality for extracted frames (default: 80, lower = smaller files)',
        default=80
    )
    # Logging
    parser.add_argument(
        '--verbose', '-v',
        action='store_true',
@@ -236,166 +209,38 @@ Examples:
    # Setup logging
    setup_logging(args.verbose)
-    # Validate video path
+    try:
-    video_path = Path(args.video)
+        # Create workflow configuration
-    if not video_path.exists():
+        config = WorkflowConfig(**vars(args))
        logger.error(f"Video file not found: {args.video}")
        sys.exit(1)
-    # Create output directory
+        # Run processing workflow
-    output_dir = Path(args.output_dir)
+        workflow = ProcessingWorkflow(config)
-    output_dir.mkdir(parents=True, exist_ok=True)
+        result = workflow.run()
-    # Set default output path
+        # Print final summary
-    if args.output is None:
+        print("\n" + "=" * 80)
-        args.output = str(output_dir / f"{video_path.stem}_enhanced.txt")
+        print("✓ SUCCESS!")
        print("=" * 80)
        print(f"Output directory: {result['output_dir']}")
        if result.get('enhanced_transcript'):
            print(f"Enhanced transcript ready for AI summarization!")
        print("=" * 80)
-    # Define cache paths
+        return 0
    whisper_cache = output_dir / f"{video_path.stem}.json"
    analysis_cache = output_dir / f"{video_path.stem}_{'vision' if args.use_vision else 'ocr'}.json"
    frames_cache_dir = Path(args.frames_dir)
-    # Check for cached Whisper transcript
+    except FileNotFoundError as e:
-    if args.run_whisper:
+        logging.error(f"File not found: {e}")
-        if not args.no_cache and whisper_cache.exists():
+        return 1
-            logger.info(f"✓ Found cached Whisper transcript: {whisper_cache}")
+    except RuntimeError as e:
-            args.transcript = str(whisper_cache)
+        logging.error(f"Processing failed: {e}")
-        else:
+        return 1
-            logger.info("=" * 80)
+    except KeyboardInterrupt:
-            logger.info("STEP 0: Running Whisper Transcription")
+        logging.warning("\nProcessing interrupted by user")
-            logger.info("=" * 80)
+        return 130
-            transcript_path = run_whisper(video_path, args.whisper_model, str(output_dir))
+    except Exception as e:
-            args.transcript = str(transcript_path)
+        logging.exception(f"Unexpected error: {e}")
-            logger.info("")
+        return 1
    logger.info("=" * 80)
    logger.info("MEETING PROCESSOR")
    logger.info("=" * 80)
    logger.info(f"Video: {video_path.name}")
    logger.info(f"Analysis: {'Vision Model' if args.use_vision else f'OCR ({args.ocr_engine})'}")
    if args.use_vision:
        logger.info(f"Vision Model: {args.vision_model}")
        logger.info(f"Context: {args.vision_context}")
    logger.info(f"Frame extraction: {'Scene detection' if args.scene_detection else f'Every {args.interval}s'}")
    if args.transcript:
        logger.info(f"Transcript: {args.transcript}")
    logger.info(f"Caching: {'Disabled' if args.no_cache else 'Enabled'}")
    logger.info("=" * 80)
    # Step 1: Extract frames (with caching)
    logger.info("Step 1: Extracting frames from video...")
    # Check if frames already exist
    existing_frames = list(frames_cache_dir.glob(f"{video_path.stem}_*.jpg")) if frames_cache_dir.exists() else []
    if not args.no_cache and existing_frames and len(existing_frames) > 0:
        logger.info(f"✓ Found {len(existing_frames)} cached frames in {args.frames_dir}/")
        # Build frames_info from existing files
        frames_info = []
        for frame_path in sorted(existing_frames):
            # Try to extract timestamp from filename (e.g., video_00001_12.34s.jpg)
            try:
                timestamp_str = frame_path.stem.split('_')[-1].rstrip('s')
                timestamp = float(timestamp_str)
            except:
                timestamp = 0.0
            frames_info.append((str(frame_path), timestamp))
    else:
        extractor = FrameExtractor(str(video_path), args.frames_dir)
        if args.scene_detection:
            frames_info = extractor.extract_scene_changes()
        else:
            frames_info = extractor.extract_by_interval(args.interval)
        if not frames_info:
            logger.error("No frames extracted")
            sys.exit(1)
        logger.info(f"✓ Extracted {len(frames_info)} frames")
    # Step 2: Run analysis on frames (with caching)
    if not args.no_cache and analysis_cache.exists():
        logger.info(f"✓ Found cached analysis results: {analysis_cache}")
        with open(analysis_cache, 'r', encoding='utf-8') as f:
            screen_segments = json.load(f)
        logger.info(f"✓ Loaded {len(screen_segments)} analyzed frames from cache")
    else:
        if args.use_vision:
            # Use vision model
            logger.info("Step 2: Running vision analysis on extracted frames...")
            try:
                vision = VisionProcessor(model=args.vision_model)
                screen_segments = vision.process_frames(
                    frames_info,
                    context=args.vision_context,
                    deduplicate=not args.no_deduplicate
                )
                logger.info(f"✓ Analyzed {len(screen_segments)} frames with vision model")
            except ImportError as e:
                logger.error(f"{e}")
                sys.exit(1)
        else:
            # Use OCR
            logger.info("Step 2: Running OCR on extracted frames...")
            try:
                ocr = OCRProcessor(engine=args.ocr_engine)
                screen_segments = ocr.process_frames(
                    frames_info,
                    deduplicate=not args.no_deduplicate
                )
                logger.info(f"✓ Processed {len(screen_segments)} frames with OCR")
            except ImportError as e:
                logger.error(f"{e}")
                logger.error(f"To install {args.ocr_engine}:")
                logger.error(f"  pip install {args.ocr_engine}")
                sys.exit(1)
        # Save analysis results as JSON
        with open(analysis_cache, 'w', encoding='utf-8') as f:
            json.dump(screen_segments, f, indent=2, ensure_ascii=False)
        logger.info(f"✓ Saved analysis results to: {analysis_cache}")
    if args.extract_only:
        logger.info("Done! (extract-only mode)")
        return
    # Step 3: Merge with transcript (if provided)
    merger = TranscriptMerger()
    if args.transcript:
        logger.info("Step 3: Merging with Whisper transcript...")
        transcript_path = Path(args.transcript)
        if not transcript_path.exists():
            logger.warning(f"Transcript not found: {args.transcript}")
            logger.info("Proceeding with screen content only...")
            audio_segments = []
        else:
            audio_segments = merger.load_whisper_transcript(str(transcript_path))
            logger.info(f"✓ Loaded {len(audio_segments)} audio segments")
    else:
        logger.info("No transcript provided, using screen content only...")
        audio_segments = []
    # Merge and format
    merged = merger.merge_transcripts(audio_segments, screen_segments)
    formatted = merger.format_for_claude(merged, format_style=args.format)
    # Save output
    merger.save_transcript(formatted, args.output)
    logger.info("=" * 80)
    logger.info("✓ PROCESSING COMPLETE!")
    logger.info("=" * 80)
    logger.info(f"Enhanced transcript: {args.output}")
    logger.info(f"OCR data: {ocr_output}")
    logger.info(f"Frames: {args.frames_dir}/")
    logger.info("")
    logger.info("You can now use the enhanced transcript with Claude for summarization!")
 if __name__ == '__main__':
-    main()
+    sys.exit(main())
--- a/requirements.txt
+++ b/requirements.txt
@@ -1,6 +1,7 @@
 # Core dependencies
 opencv-python>=4.8.0
 Pillow>=10.0.0
 ffmpeg-python>=0.2.0
 # Vision analysis (recommended for better results)
 # Requires Ollama to be installed: https://ollama.ai/download
Author	SHA1	Message	Date
Mariano Gabriel	eb8b1f4f11	updated readme	2025-12-04 20:24:52 -03:00
Mariano Gabriel	331cccb15f	updated readme	2025-12-04 20:15:16 -03:00
Mariano Gabriel	7d7ec15ff7	add whisperx support	2025-12-03 06:48:45 -03:00
Mariano Gabriel	7b919beda6	add whisperx support	2025-12-02 02:33:39 -03:00
Mariano Gabriel	118ef04223	embed images	2025-10-28 08:02:45 -03:00
Mariano Gabriel	b1e1daf278	scene detection quality and caching	2025-10-28 05:52:31 -03:00
Mariano Gabriel	c871af2def	group text	2025-10-23 14:49:14 -03:00
Mariano Gabriel	cdf7ad1199	update prompts	2025-10-20 17:36:31 -03:00
Mariano Gabriel	b9c3cbfbab	take turns using the GPU	2025-10-20 01:12:13 -03:00
Mariano Gabriel	cd7b0aed07	refactor	2025-10-20 00:03:41 -03:00