add vision processor

2025-10-19 22:58:28 -03:00
parent ae89564373
commit a999bc9093
4 changed files with 511 additions and 107 deletions
--- a/README.md
+++ b/README.md
@@ -1,12 +1,17 @@
 # Meeting Processor
-Extract screen content from meeting recordings and merge with Whisper transcripts for better Claude summarization.
+Extract screen content from meeting recordings and merge with Whisper transcripts for better AI summarization.
 ## Overview
 This tool enhances meeting transcripts by combining:
 - **Audio transcription** (from Whisper)
- **Screen content** (OCR from screen shares)
+- **Screen content analysis** (Vision models or OCR)
 ### Vision Analysis vs OCR
 - **Vision Models** (recommended): Uses local LLaVA model via Ollama to understand context - great for dashboards, code, consoles
 - **OCR**: Traditional text extraction - faster but less context-aware
 The result is a rich, timestamped transcript that provides full context for AI summarization.
@@ -14,16 +19,13 @@ The result is a rich, timestamped transcript that provides full context for AI s
 ### 1. System Dependencies
-**Tesseract OCR** (recommended):
+**Ollama** (required for vision analysis):
 ```bash
-# Ubuntu/Debian
+# Install from https://ollama.ai/download
-sudo apt-get install tesseract-ocr
+# Then pull a vision model:
-
+ollama pull llava:13b
-# macOS
+# or for lighter model:
-brew install tesseract
+ollama pull llava:7b
 # Arch Linux
 sudo pacman -S tesseract
 ```
 **FFmpeg** (for scene detection):
@@ -35,6 +37,18 @@ sudo apt-get install ffmpeg
 brew install ffmpeg
 ```
 **Tesseract OCR** (optional, if not using vision):
 ```bash
 # Ubuntu/Debian
 sudo apt-get install tesseract-ocr
 # macOS
 brew install tesseract
 # Arch Linux
 sudo pacman -S tesseract
 ```
 ### 2. Python Dependencies
 ```bash
@@ -49,6 +63,7 @@ pip install openai-whisper
 ### 4. Optional: Install Alternative OCR Engines
 If you prefer OCR over vision analysis:
 ```bash
 # EasyOCR (better for rotated/handwritten text)
 pip install easyocr
@@ -59,118 +74,173 @@ pip install paddleocr
 ## Quick Start
-### Recommended: Run Everything in One Command
+### Recommended: Vision Analysis (Best for Code/Dashboards)
 ```bash
-python process_meeting.py samples/meeting.mkv --run-whisper
+python process_meeting.py samples/meeting.mkv --run-whisper --use-vision
 ```
 This will:
 1. Run Whisper transcription (audio → text)
 2. Extract frames every 5 seconds
-3. Run OCR to extract screen text
+3. Use LLaVA vision model to analyze frames with context
 4. Merge audio + screen content
 5. Save everything to `output/` folder
-### Alternative: Use Existing Whisper Transcript
+### Re-run with Cached Results
-If you already have a Whisper transcript:
+Already ran it once? Re-run instantly using cached results:
 ```bash
-python process_meeting.py samples/meeting.mkv --transcript output/meeting.json
+# Uses cached transcript, frames, and analysis
 python process_meeting.py samples/meeting.mkv --use-vision
 # Force reprocessing
 python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --no-cache
 ```
-### Screen Content Only (No Audio)
+### Traditional OCR (Faster, Less Context-Aware)
 ```bash
-python process_meeting.py samples/meeting.mkv
+python process_meeting.py samples/meeting.mkv --run-whisper
 ```
 ## Usage Examples
-### Run with different Whisper models
+### Vision Analysis with Context Hints
 ```bash
-# Tiny model (fastest, less accurate)
+# For code-heavy meetings
-python process_meeting.py samples/meeting.mkv --run-whisper --whisper-model tiny
+python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --vision-context code
-# Small model (balanced)
+# For dashboard/monitoring meetings (Grafana, GCP, etc.)
-python process_meeting.py samples/meeting.mkv --run-whisper --whisper-model small
+python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --vision-context dashboard
-# Large model (slowest, most accurate)
+# For console/terminal sessions
-python process_meeting.py samples/meeting.mkv --run-whisper --whisper-model large
+python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --vision-context console
 ```
 ### Different Vision Models
 ```bash
 # Lighter/faster model (7B parameters)
 python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --vision-model llava:7b
 # Default model (13B parameters, better quality)
 python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --vision-model llava:13b
 # Alternative models
 python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --vision-model bakllava
 ```
 ### Extract frames at different intervals
 ```bash
-# Every 10 seconds (with Whisper)
+# Every 10 seconds
-python process_meeting.py samples/meeting.mkv --run-whisper --interval 10
+python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --interval 10
 # Every 3 seconds (more detailed)
-python process_meeting.py samples/meeting.mkv --run-whisper --interval 3
+python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --interval 3
 ```
 ### Use scene detection (smarter, fewer frames)
 ```bash
-python process_meeting.py samples/meeting.mkv --run-whisper --scene-detection
+python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --scene-detection
 ```
-### Use different OCR engines
+### Traditional OCR (if you prefer)
 ```bash
-# EasyOCR (good for varied layouts)
+# Tesseract (default)
 python process_meeting.py samples/meeting.mkv --run-whisper
 # EasyOCR
 python process_meeting.py samples/meeting.mkv --run-whisper --ocr-engine easyocr
-# PaddleOCR (good for code/terminal)
+# PaddleOCR
 python process_meeting.py samples/meeting.mkv --run-whisper --ocr-engine paddleocr
 ```
-### Extract frames only (no merging)
+### Caching Examples
 ```bash
-python process_meeting.py samples/meeting.mkv --extract-only
+# First run - processes everything
 python process_meeting.py samples/meeting.mkv --run-whisper --use-vision
 # Second run - uses cached transcript and frames, only re-merges
 python process_meeting.py samples/meeting.mkv
 # Switch from OCR to vision using existing frames
 python process_meeting.py samples/meeting.mkv --use-vision
 # Force complete reprocessing
 python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --no-cache
 ```
 ### Custom output location
 ```bash
-python process_meeting.py samples/meeting.mkv --run-whisper --output-dir my_outputs/
+python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --output-dir my_outputs/
 ```
 ### Enable verbose logging
 ```bash
 # Show detailed debug information
-python process_meeting.py samples/meeting.mkv --run-whisper --verbose
+python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --verbose
 ```
 ## Output Files
 All output files are saved to the `output/` directory by default:
- **`output/<video>_enhanced.txt`** - Enhanced transcript ready for Claude
+- **`output/<video>_enhanced.txt`** - Enhanced transcript ready for AI summarization
 - **`output/<video>.json`** - Whisper transcript (if `--run-whisper` was used)
- **`output/<video>_ocr.json`** - Raw OCR data with timestamps
+- **`output/<video>_vision.json`** - Vision analysis results with timestamps (if `--use-vision`)
 - **`output/<video>_ocr.json`** - OCR results with timestamps (if using OCR)
 - **`frames/`** - Extracted video frames (JPG files)
 ### Caching Behavior
 The tool automatically caches intermediate results to speed up re-runs:
 - **Whisper transcript**: Cached as `output/<video>.json`
 - **Extracted frames**: Cached in `frames/<video>_*.jpg`
 - **Analysis results**: Cached as `output/<video>_vision.json` or `output/<video>_ocr.json`
 Re-running with the same video will use cached results unless `--no-cache` is specified.
 ## Workflow for Meeting Analysis
 ### Complete Workflow (One Command!)
 ```bash
-# Process everything in one step
+# Process everything in one step with vision analysis
-python process_meeting.py samples/alo-intro1.mkv --run-whisper --scene-detection
+python process_meeting.py samples/alo-intro1.mkv --run-whisper --use-vision --scene-detection
 # Output will be in output/alo-intro1_enhanced.txt
 ```
 ### Typical Iterative Workflow
 ```bash
 # First run - full processing
 python process_meeting.py samples/meeting.mkv --run-whisper --use-vision
 # Review results, then re-run with different context if needed
 python process_meeting.py samples/meeting.mkv --use-vision --vision-context code
 # Or switch to a different vision model
 python process_meeting.py samples/meeting.mkv --use-vision --vision-model llava:7b
 # All use cached frames and transcript!
 ```
 ### Traditional Workflow (Separate Steps)
 ```bash
 # 1. Extract audio and transcribe with Whisper (optional, if not using --run-whisper)
 whisper samples/alo-intro1.mkv --model base --output_format json --output_dir output
-# 2. Process video to extract screen content
+# 2. Process video to extract screen content with vision
 python process_meeting.py samples/alo-intro1.mkv \
    --transcript output/alo-intro1.json \
    --use-vision \
    --scene-detection
-# 3. Use the enhanced transcript with Claude
+# 3. Use the enhanced transcript with AI
-# Copy the content from output/alo-intro1_enhanced.txt and paste into Claude
+# Copy the content from output/alo-intro1_enhanced.txt and paste into Claude or your LLM
 ```
 ### Example Prompt for Claude
@@ -217,42 +287,99 @@ Options:
 ## Tips for Best Results
 ### Vision vs OCR: When to Use Each
 **Use Vision Models (`--use-vision`) when:**
 - ✅ Analyzing dashboards (Grafana, GCP Console, monitoring tools)
 - ✅ Code walkthroughs or debugging sessions
 - ✅ Complex layouts with mixed content
 - ✅ Need contextual understanding, not just text extraction
 - ✅ Working with charts, graphs, or visualizations
 - ⚠️ Trade-off: Slower (requires GPU/CPU for local model)
 **Use OCR when:**
 - ✅ Simple text extraction from slides or documents
 - ✅ Need maximum speed
 - ✅ Limited computational resources
 - ✅ Presentations with mostly text
 - ⚠️ Trade-off: Less context-aware, may miss visual relationships
 ### Context Hints for Vision Analysis
 - **`--vision-context meeting`**: General purpose (default)
 - **`--vision-context code`**: Optimized for code screenshots, preserves formatting
 - **`--vision-context dashboard`**: Extracts metrics, trends, panel names
 - **`--vision-context console`**: Captures commands, output, error messages
 ### Scene Detection vs Interval
 - **Scene detection**: Better for presentations with distinct slides. More efficient.
 - **Interval extraction**: Better for continuous screen sharing (coding, browsing). More thorough.
-### OCR Engine Selection
+### Vision Model Selection
- **Tesseract**: Best for clean slides, documents, presentations. Fast and lightweight.
+- **`llava:7b`**: Faster, lower memory (~4GB RAM), good quality
- **EasyOCR**: Better for handwriting, rotated text, or varied fonts.
+- **`llava:13b`**: Better quality, slower, needs ~8GB RAM (default)
- **PaddleOCR**: Excellent for code, terminal outputs, and mixed languages.
+- **`bakllava`**: Alternative with different strengths
 ### Deduplication
 - Enabled by default - removes similar consecutive frames
- Disable with `--no-deduplicate` if slides change subtly
+- Disable with `--no-deduplicate` if slides/screens change subtly
 ## Troubleshooting
-### "pytesseract not installed"
+### Vision Model Issues
 **"ollama package not installed"**
 ```bash
 pip install ollama
 ```
 **"Ollama not found" or connection errors**
 ```bash
 # Install Ollama first: https://ollama.ai/download
 # Then pull a vision model:
 ollama pull llava:13b
 ```
 **Vision analysis is slow**
 - Use lighter model: `--vision-model llava:7b`
 - Reduce frame count: `--scene-detection` or `--interval 10`
 - Check if Ollama is using GPU (much faster)
 **Poor vision analysis results**
 - Try different context hint: `--vision-context code` or `--vision-context dashboard`
 - Use larger model: `--vision-model llava:13b`
 - Ensure frames are clear (check video resolution)
 ### OCR Issues
 **"pytesseract not installed"**
 ```bash
 pip install pytesseract
 sudo apt-get install tesseract-ocr  # Don't forget system package!
 ```
-### "No frames extracted"
+**Poor OCR quality**
 - **Solution**: Switch to vision analysis with `--use-vision`
 - Or try different OCR engine: `--ocr-engine easyocr`
 - Check if video resolution is sufficient
 - Use `--no-deduplicate` to keep more frames
 ### General Issues
 **"No frames extracted"**
 - Check video file is valid: `ffmpeg -i video.mkv`
 - Try lower interval: `--interval 3`
 - Check disk space in frames directory
-### Poor OCR quality
+**Scene detection not working**
 - Try different OCR engine
 - Check if video resolution is sufficient
 - Use `--no-deduplicate` to keep more frames
 ### Scene detection not working
 - Fallback to interval extraction automatically
 - Ensure FFmpeg is installed
 - Try manual interval: `--interval 5`
 **Cache not being used**
 - Ensure you're using the same video filename
 - Check that output directory contains cached files
 - Use `--verbose` to see what's being cached/loaded
 ## Project Structure
 ```
--- a/meetus/vision_processor.py
+++ b/meetus/vision_processor.py
@@ -0,0 +1,192 @@
 """
 Vision-based frame analysis using local vision-language models via Ollama.
 Better than OCR for understanding dashboards, code, and console output.
 """
 from typing import List, Tuple, Dict, Optional
 from pathlib import Path
 import logging
 from difflib import SequenceMatcher
 logger = logging.getLogger(__name__)
 class VisionProcessor:
    """Process frames using local vision models via Ollama."""
    def __init__(self, model: str = "llava:13b"):
        """
        Initialize vision processor.
        Args:
            model: Ollama vision model to use (llava:13b, llava:7b, llava-llama3, bakllava)
        """
        self.model = model
        self._client = None
        self._init_client()
    def _init_client(self):
        """Initialize Ollama client."""
        try:
            import ollama
            self._client = ollama
            # Check if model is available
            try:
                models = self._client.list()
                available_models = [m['name'] for m in models.get('models', [])]
                if self.model not in available_models:
                    logger.warning(f"Model {self.model} not found locally.")
                    logger.info(f"Pulling {self.model}... (this may take a few minutes)")
                    self._client.pull(self.model)
                    logger.info(f"✓ Model {self.model} downloaded")
                else:
                    logger.info(f"Using local vision model: {self.model}")
            except Exception as e:
                logger.warning(f"Could not verify model availability: {e}")
                logger.info("Attempting to use model anyway...")
        except ImportError:
            raise ImportError(
                "ollama package not installed. Run: pip install ollama\n"
                "Also install Ollama: https://ollama.ai/download"
            )
    def analyze_frame(self, image_path: str, context: str = "meeting") -> str:
        """
        Analyze a single frame using local vision model.
        Args:
            image_path: Path to image file
            context: Context hint for analysis (meeting, dashboard, code, console)
        Returns:
            Analyzed content description
        """
        # Context-specific prompts
        prompts = {
            "meeting": """Analyze this screen capture from a meeting recording. Extract:
 1. Any visible text (titles, labels, headings)
 2. Key metrics, numbers, or data points shown
 3. Dashboard panels or visualizations (describe what they show)
 4. Code snippets (preserve formatting and context)
 5. Console/terminal output (commands and results)
 6. Application names or UI elements
 Focus on information that would help someone understand what was being discussed.
 Be concise but include all important details. If there's code, preserve it exactly.""",
            "dashboard": """Analyze this dashboard/monitoring panel. Extract:
 1. Panel titles and metrics names
 2. Current values and units
 3. Trends (up/down/stable)
 4. Alerts or warnings
 5. Time ranges shown
 6. Any anomalies or notable patterns
 Format as structured data.""",
            "code": """Analyze this code screenshot. Extract:
 1. Programming language
 2. File name or path (if visible)
 3. Code content (preserve exact formatting)
 4. Comments
 5. Function/class names
 6. Any error messages or warnings
 Preserve code exactly as shown.""",
            "console": """Analyze this console/terminal output. Extract:
 1. Commands executed
 2. Output/results
 3. Error messages
 4. Warnings or status messages
 5. File paths or URLs
 Preserve formatting and structure."""
        }
        prompt = prompts.get(context, prompts["meeting"])
        try:
            # Use Ollama's chat API with vision
            response = self._client.chat(
                model=self.model,
                messages=[
                    {
                        'role': 'user',
                        'content': prompt,
                        'images': [image_path]
                    }
                ]
            )
            # Extract text from response
            text = response['message']['content']
            return text.strip()
        except Exception as e:
            logger.error(f"Vision model error for {image_path}: {e}")
            return ""
    def process_frames(
        self,
        frames_info: List[Tuple[str, float]],
        context: str = "meeting",
        deduplicate: bool = True,
        similarity_threshold: float = 0.85
    ) -> List[Dict]:
        """
        Process multiple frames with vision analysis.
        Args:
            frames_info: List of (frame_path, timestamp) tuples
            context: Context hint for analysis
            deduplicate: Whether to remove similar consecutive analyses
            similarity_threshold: Threshold for considering analyses as duplicates (0-1)
        Returns:
            List of dicts with 'timestamp', 'text', and 'frame_path'
        """
        results = []
        prev_text = ""
        total = len(frames_info)
        logger.info(f"Starting vision analysis of {total} frames...")
        for idx, (frame_path, timestamp) in enumerate(frames_info, 1):
            logger.info(f"Analyzing frame {idx}/{total} at {timestamp:.2f}s...")
            text = self.analyze_frame(frame_path, context)
            if not text:
                logger.warning(f"No content extracted from frame at {timestamp:.2f}s")
                continue
            # Deduplicate similar consecutive frames
            if deduplicate:
                similarity = self._text_similarity(prev_text, text)
                if similarity > similarity_threshold:
                    logger.debug(f"Skipping duplicate frame at {timestamp:.2f}s (similarity: {similarity:.2f})")
                    continue
            results.append({
                'timestamp': timestamp,
                'text': text,
                'frame_path': frame_path
            })
            prev_text = text
        logger.info(f"Extracted content from {len(results)} frames (deduplication: {deduplicate})")
        return results
    def _text_similarity(self, text1: str, text2: str) -> float:
        """
        Calculate similarity between two texts.
        Returns:
            Similarity score between 0 and 1
        """
        return SequenceMatcher(None, text1, text2).ratio()
--- a/process_meeting.py
+++ b/process_meeting.py
@@ -13,6 +13,7 @@ import shutil
 from meetus.frame_extractor import FrameExtractor
 from meetus.ocr_processor import OCRProcessor
 from meetus.vision_processor import VisionProcessor
 from meetus.transcript_merger import TranscriptMerger
 logger = logging.getLogger(__name__)
@@ -98,20 +99,23 @@ def main():
        formatter_class=argparse.RawDescriptionHelpFormatter,
        epilog="""
 Examples:
-  # Run Whisper + full processing in one command
+  # Run Whisper + vision analysis (recommended for code/dashboards)
  python process_meeting.py samples/meeting.mkv --run-whisper --use-vision
  # Use vision with specific context hint
  python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --vision-context code
  # Traditional OCR approach
  python process_meeting.py samples/meeting.mkv --run-whisper
-  # Process video with existing Whisper transcript
+  # Re-run analysis using cached frames and transcript
-  python process_meeting.py samples/meeting.mkv --transcript output/meeting.json
+  python process_meeting.py samples/meeting.mkv --use-vision
-  # Use scene detection instead of interval
+  # Force reprocessing (ignore cache)
-  python process_meeting.py samples/meeting.mkv --run-whisper --scene-detection
+  python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --no-cache
-  # Use different Whisper model and OCR engine
+  # Use scene detection for fewer frames
-  python process_meeting.py samples/meeting.mkv --run-whisper --whisper-model small --ocr-engine easyocr
+  python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --scene-detection
  # Extract frames only (no transcript)
  python process_meeting.py samples/meeting.mkv --extract-only
        """
    )
@@ -177,6 +181,31 @@ Examples:
        default='tesseract'
    )
    parser.add_argument(
        '--use-vision',
        action='store_true',
        help='Use local vision model (Ollama) instead of OCR for better context understanding'
    )
    parser.add_argument(
        '--vision-model',
        help='Vision model to use with Ollama (default: llava:13b)',
        default='llava:13b'
    )
    parser.add_argument(
        '--vision-context',
        choices=['meeting', 'dashboard', 'code', 'console'],
        help='Context hint for vision analysis (default: meeting)',
        default='meeting'
    )
    parser.add_argument(
        '--no-cache',
        action='store_true',
        help='Disable caching - reprocess everything even if outputs exist'
    )
    parser.add_argument(
        '--no-deduplicate',
        action='store_true',
@@ -221,8 +250,17 @@ Examples:
    if args.output is None:
        args.output = str(output_dir / f"{video_path.stem}_enhanced.txt")
-    # Run Whisper if requested
+    # Define cache paths
    whisper_cache = output_dir / f"{video_path.stem}.json"
    analysis_cache = output_dir / f"{video_path.stem}_{'vision' if args.use_vision else 'ocr'}.json"
    frames_cache_dir = Path(args.frames_dir)
    # Check for cached Whisper transcript
    if args.run_whisper:
        if not args.no_cache and whisper_cache.exists():
            logger.info(f"✓ Found cached Whisper transcript: {whisper_cache}")
            args.transcript = str(whisper_cache)
        else:
            logger.info("=" * 80)
            logger.info("STEP 0: Running Whisper Transcription")
            logger.info("=" * 80)
@@ -234,14 +272,35 @@ Examples:
    logger.info("MEETING PROCESSOR")
    logger.info("=" * 80)
    logger.info(f"Video: {video_path.name}")
-    logger.info(f"OCR Engine: {args.ocr_engine}")
+    logger.info(f"Analysis: {'Vision Model' if args.use_vision else f'OCR ({args.ocr_engine})'}")
    if args.use_vision:
        logger.info(f"Vision Model: {args.vision_model}")
        logger.info(f"Context: {args.vision_context}")
    logger.info(f"Frame extraction: {'Scene detection' if args.scene_detection else f'Every {args.interval}s'}")
    if args.transcript:
        logger.info(f"Transcript: {args.transcript}")
    logger.info(f"Caching: {'Disabled' if args.no_cache else 'Enabled'}")
    logger.info("=" * 80)
-    # Step 1: Extract frames
+    # Step 1: Extract frames (with caching)
    logger.info("Step 1: Extracting frames from video...")
    # Check if frames already exist
    existing_frames = list(frames_cache_dir.glob(f"{video_path.stem}_*.jpg")) if frames_cache_dir.exists() else []
    if not args.no_cache and existing_frames and len(existing_frames) > 0:
        logger.info(f"✓ Found {len(existing_frames)} cached frames in {args.frames_dir}/")
        # Build frames_info from existing files
        frames_info = []
        for frame_path in sorted(existing_frames):
            # Try to extract timestamp from filename (e.g., video_00001_12.34s.jpg)
            try:
                timestamp_str = frame_path.stem.split('_')[-1].rstrip('s')
                timestamp = float(timestamp_str)
            except:
                timestamp = 0.0
            frames_info.append((str(frame_path), timestamp))
    else:
        extractor = FrameExtractor(str(video_path), args.frames_dir)
        if args.scene_detection:
@@ -255,7 +314,30 @@ Examples:
        logger.info(f"✓ Extracted {len(frames_info)} frames")
-    # Step 2: Run OCR on frames
+    # Step 2: Run analysis on frames (with caching)
    if not args.no_cache and analysis_cache.exists():
        logger.info(f"✓ Found cached analysis results: {analysis_cache}")
        with open(analysis_cache, 'r', encoding='utf-8') as f:
            screen_segments = json.load(f)
        logger.info(f"✓ Loaded {len(screen_segments)} analyzed frames from cache")
    else:
        if args.use_vision:
            # Use vision model
            logger.info("Step 2: Running vision analysis on extracted frames...")
            try:
                vision = VisionProcessor(model=args.vision_model)
                screen_segments = vision.process_frames(
                    frames_info,
                    context=args.vision_context,
                    deduplicate=not args.no_deduplicate
                )
                logger.info(f"✓ Analyzed {len(screen_segments)} frames with vision model")
            except ImportError as e:
                logger.error(f"{e}")
                sys.exit(1)
        else:
            # Use OCR
            logger.info("Step 2: Running OCR on extracted frames...")
            try:
                ocr = OCRProcessor(engine=args.ocr_engine)
@@ -263,7 +345,7 @@ Examples:
                    frames_info,
                    deduplicate=not args.no_deduplicate
                )
-        logger.info(f"✓ Processed {len(screen_segments)} frames with text content")
+                logger.info(f"✓ Processed {len(screen_segments)} frames with OCR")
            except ImportError as e:
                logger.error(f"{e}")
@@ -271,11 +353,10 @@ Examples:
                logger.error(f"  pip install {args.ocr_engine}")
                sys.exit(1)
-    # Save OCR results as JSON
+        # Save analysis results as JSON
-    ocr_output = output_dir / f"{video_path.stem}_ocr.json"
+        with open(analysis_cache, 'w', encoding='utf-8') as f:
    with open(ocr_output, 'w', encoding='utf-8') as f:
            json.dump(screen_segments, f, indent=2, ensure_ascii=False)
-    logger.info(f"✓ Saved OCR results to: {ocr_output}")
+        logger.info(f"✓ Saved analysis results to: {analysis_cache}")
    if args.extract_only:
        logger.info("Done! (extract-only mode)")
--- a/requirements.txt
+++ b/requirements.txt
@@ -2,13 +2,17 @@
 opencv-python>=4.8.0
 Pillow>=10.0.0
-# OCR engines (install at least one)
+# Vision analysis (recommended for better results)
-# Tesseract (recommended, lightweight)
+# Requires Ollama to be installed: https://ollama.ai/download
 ollama>=0.1.0
 # OCR engines (alternative to vision analysis)
 # Tesseract (lightweight, basic text extraction)
 pytesseract>=0.3.10
-# Alternative OCR engines (optional, install as needed)
+# Alternative OCR engines (optional)
 # easyocr>=1.7.0
 # paddleocr>=2.7.0
-# For Whisper transcription (if not already installed)
+# For Whisper transcription (recommended)
 # openai-whisper>=20230918