add vision processor

2025-10-19 22:58:28 -03:00
parent ae89564373
commit a999bc9093
4 changed files with 511 additions and 107 deletions
--- a/README.md
+++ b/README.md
@@ -1,12 +1,17 @@
 # Meeting Processor

-Extract screen content from meeting recordings and merge with Whisper transcripts for better Claude summarization.
+Extract screen content from meeting recordings and merge with Whisper transcripts for better AI summarization.

 ## Overview

 This tool enhances meeting transcripts by combining:
 - **Audio transcription** (from Whisper)
- **Screen content** (OCR from screen shares)
+- **Screen content analysis** (Vision models or OCR)
+
+### Vision Analysis vs OCR
+
+- **Vision Models** (recommended): Uses local LLaVA model via Ollama to understand context - great for dashboards, code, consoles
+- **OCR**: Traditional text extraction - faster but less context-aware

 The result is a rich, timestamped transcript that provides full context for AI summarization.

@@ -14,16 +19,13 @@ The result is a rich, timestamped transcript that provides full context for AI s

 ### 1. System Dependencies

-**Tesseract OCR** (recommended):
+**Ollama** (required for vision analysis):
 ```bash
-# Ubuntu/Debian
-sudo apt-get install tesseract-ocr
-
-# macOS
-brew install tesseract
-
-# Arch Linux
-sudo pacman -S tesseract
+# Install from https://ollama.ai/download
+# Then pull a vision model:
+ollama pull llava:13b
+# or for lighter model:
+ollama pull llava:7b
 ```

 **FFmpeg** (for scene detection):
@@ -35,6 +37,18 @@ sudo apt-get install ffmpeg
 brew install ffmpeg
 ```

+**Tesseract OCR** (optional, if not using vision):
+```bash
+# Ubuntu/Debian
+sudo apt-get install tesseract-ocr
+
+# macOS
+brew install tesseract
+
+# Arch Linux
+sudo pacman -S tesseract
+```
+
 ### 2. Python Dependencies

 ```bash
@@ -49,6 +63,7 @@ pip install openai-whisper

 ### 4. Optional: Install Alternative OCR Engines

+If you prefer OCR over vision analysis:
 ```bash
 # EasyOCR (better for rotated/handwritten text)
 pip install easyocr
@@ -59,118 +74,173 @@ pip install paddleocr

 ## Quick Start

-### Recommended: Run Everything in One Command
+### Recommended: Vision Analysis (Best for Code/Dashboards)

 ```bash
-python process_meeting.py samples/meeting.mkv --run-whisper
+python process_meeting.py samples/meeting.mkv --run-whisper --use-vision
 ```

 This will:
 1. Run Whisper transcription (audio → text)
 2. Extract frames every 5 seconds
-3. Run OCR to extract screen text
+3. Use LLaVA vision model to analyze frames with context
 4. Merge audio + screen content
 5. Save everything to `output/` folder

-### Alternative: Use Existing Whisper Transcript
+### Re-run with Cached Results

-If you already have a Whisper transcript:
+Already ran it once? Re-run instantly using cached results:
 ```bash
-python process_meeting.py samples/meeting.mkv --transcript output/meeting.json
+# Uses cached transcript, frames, and analysis
+python process_meeting.py samples/meeting.mkv --use-vision
+
+# Force reprocessing
+python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --no-cache
 ```

-### Screen Content Only (No Audio)
+### Traditional OCR (Faster, Less Context-Aware)

 ```bash
-python process_meeting.py samples/meeting.mkv
+python process_meeting.py samples/meeting.mkv --run-whisper
 ```

 ## Usage Examples

-### Run with different Whisper models
+### Vision Analysis with Context Hints
 ```bash
-# Tiny model (fastest, less accurate)
-python process_meeting.py samples/meeting.mkv --run-whisper --whisper-model tiny
+# For code-heavy meetings
+python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --vision-context code

-# Small model (balanced)
-python process_meeting.py samples/meeting.mkv --run-whisper --whisper-model small
+# For dashboard/monitoring meetings (Grafana, GCP, etc.)
+python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --vision-context dashboard

-# Large model (slowest, most accurate)
-python process_meeting.py samples/meeting.mkv --run-whisper --whisper-model large
+# For console/terminal sessions
+python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --vision-context console
+```
+
+### Different Vision Models
+```bash
+# Lighter/faster model (7B parameters)
+python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --vision-model llava:7b
+
+# Default model (13B parameters, better quality)
+python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --vision-model llava:13b
+
+# Alternative models
+python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --vision-model bakllava
 ```

 ### Extract frames at different intervals
 ```bash
-# Every 10 seconds (with Whisper)
-python process_meeting.py samples/meeting.mkv --run-whisper --interval 10
+# Every 10 seconds
+python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --interval 10

 # Every 3 seconds (more detailed)
-python process_meeting.py samples/meeting.mkv --run-whisper --interval 3
+python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --interval 3
 ```

 ### Use scene detection (smarter, fewer frames)
 ```bash
-python process_meeting.py samples/meeting.mkv --run-whisper --scene-detection
+python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --scene-detection
 ```

-### Use different OCR engines
+### Traditional OCR (if you prefer)
 ```bash
-# EasyOCR (good for varied layouts)
+# Tesseract (default)
+python process_meeting.py samples/meeting.mkv --run-whisper
+
+# EasyOCR
 python process_meeting.py samples/meeting.mkv --run-whisper --ocr-engine easyocr

-# PaddleOCR (good for code/terminal)
+# PaddleOCR
 python process_meeting.py samples/meeting.mkv --run-whisper --ocr-engine paddleocr
 ```

-### Extract frames only (no merging)
+### Caching Examples
 ```bash
-python process_meeting.py samples/meeting.mkv --extract-only
+# First run - processes everything
+python process_meeting.py samples/meeting.mkv --run-whisper --use-vision
+
+# Second run - uses cached transcript and frames, only re-merges
+python process_meeting.py samples/meeting.mkv
+
+# Switch from OCR to vision using existing frames
+python process_meeting.py samples/meeting.mkv --use-vision
+
+# Force complete reprocessing
+python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --no-cache
 ```

 ### Custom output location
 ```bash
-python process_meeting.py samples/meeting.mkv --run-whisper --output-dir my_outputs/
+python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --output-dir my_outputs/
 ```

 ### Enable verbose logging
 ```bash
 # Show detailed debug information
-python process_meeting.py samples/meeting.mkv --run-whisper --verbose
+python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --verbose
 ```

 ## Output Files

 All output files are saved to the `output/` directory by default:

- **`output/<video>_enhanced.txt`** - Enhanced transcript ready for Claude
+- **`output/<video>_enhanced.txt`** - Enhanced transcript ready for AI summarization
 - **`output/<video>.json`** - Whisper transcript (if `--run-whisper` was used)
- **`output/<video>_ocr.json`** - Raw OCR data with timestamps
+- **`output/<video>_vision.json`** - Vision analysis results with timestamps (if `--use-vision`)
+- **`output/<video>_ocr.json`** - OCR results with timestamps (if using OCR)
 - **`frames/`** - Extracted video frames (JPG files)

+### Caching Behavior
+
+The tool automatically caches intermediate results to speed up re-runs:
+- **Whisper transcript**: Cached as `output/<video>.json`
+- **Extracted frames**: Cached in `frames/<video>_*.jpg`
+- **Analysis results**: Cached as `output/<video>_vision.json` or `output/<video>_ocr.json`
+
+Re-running with the same video will use cached results unless `--no-cache` is specified.
+
 ## Workflow for Meeting Analysis

 ### Complete Workflow (One Command!)

 ```bash
-# Process everything in one step
-python process_meeting.py samples/alo-intro1.mkv --run-whisper --scene-detection
+# Process everything in one step with vision analysis
+python process_meeting.py samples/alo-intro1.mkv --run-whisper --use-vision --scene-detection

 # Output will be in output/alo-intro1_enhanced.txt
 ```

+### Typical Iterative Workflow
+
+```bash
+# First run - full processing
+python process_meeting.py samples/meeting.mkv --run-whisper --use-vision
+
+# Review results, then re-run with different context if needed
+python process_meeting.py samples/meeting.mkv --use-vision --vision-context code
+
+# Or switch to a different vision model
+python process_meeting.py samples/meeting.mkv --use-vision --vision-model llava:7b
+
+# All use cached frames and transcript!
+```
+
 ### Traditional Workflow (Separate Steps)

 ```bash
 # 1. Extract audio and transcribe with Whisper (optional, if not using --run-whisper)
 whisper samples/alo-intro1.mkv --model base --output_format json --output_dir output

-# 2. Process video to extract screen content
+# 2. Process video to extract screen content with vision
 python process_meeting.py samples/alo-intro1.mkv \
    --transcript output/alo-intro1.json \
+    --use-vision \
    --scene-detection

-# 3. Use the enhanced transcript with Claude
-# Copy the content from output/alo-intro1_enhanced.txt and paste into Claude
+# 3. Use the enhanced transcript with AI
+# Copy the content from output/alo-intro1_enhanced.txt and paste into Claude or your LLM
 ```

 ### Example Prompt for Claude
@@ -217,42 +287,99 @@ Options:

 ## Tips for Best Results

+### Vision vs OCR: When to Use Each
+
+**Use Vision Models (`--use-vision`) when:**
+- ✅ Analyzing dashboards (Grafana, GCP Console, monitoring tools)
+- ✅ Code walkthroughs or debugging sessions
+- ✅ Complex layouts with mixed content
+- ✅ Need contextual understanding, not just text extraction
+- ✅ Working with charts, graphs, or visualizations
+- ⚠️ Trade-off: Slower (requires GPU/CPU for local model)
+
+**Use OCR when:**
+- ✅ Simple text extraction from slides or documents
+- ✅ Need maximum speed
+- ✅ Limited computational resources
+- ✅ Presentations with mostly text
+- ⚠️ Trade-off: Less context-aware, may miss visual relationships
+
+### Context Hints for Vision Analysis
+- **`--vision-context meeting`**: General purpose (default)
+- **`--vision-context code`**: Optimized for code screenshots, preserves formatting
+- **`--vision-context dashboard`**: Extracts metrics, trends, panel names
+- **`--vision-context console`**: Captures commands, output, error messages
+
 ### Scene Detection vs Interval
 - **Scene detection**: Better for presentations with distinct slides. More efficient.
 - **Interval extraction**: Better for continuous screen sharing (coding, browsing). More thorough.

-### OCR Engine Selection
- **Tesseract**: Best for clean slides, documents, presentations. Fast and lightweight.
- **EasyOCR**: Better for handwriting, rotated text, or varied fonts.
- **PaddleOCR**: Excellent for code, terminal outputs, and mixed languages.
+### Vision Model Selection
+- **`llava:7b`**: Faster, lower memory (~4GB RAM), good quality
+- **`llava:13b`**: Better quality, slower, needs ~8GB RAM (default)
+- **`bakllava`**: Alternative with different strengths

 ### Deduplication
 - Enabled by default - removes similar consecutive frames
- Disable with `--no-deduplicate` if slides change subtly
+- Disable with `--no-deduplicate` if slides/screens change subtly

 ## Troubleshooting

-### "pytesseract not installed"
+### Vision Model Issues
+
+**"ollama package not installed"**
+```bash
+pip install ollama
+```
+
+**"Ollama not found" or connection errors**
+```bash
+# Install Ollama first: https://ollama.ai/download
+# Then pull a vision model:
+ollama pull llava:13b
+```
+
+**Vision analysis is slow**
+- Use lighter model: `--vision-model llava:7b`
+- Reduce frame count: `--scene-detection` or `--interval 10`
+- Check if Ollama is using GPU (much faster)
+
+**Poor vision analysis results**
+- Try different context hint: `--vision-context code` or `--vision-context dashboard`
+- Use larger model: `--vision-model llava:13b`
+- Ensure frames are clear (check video resolution)
+
+### OCR Issues
+
+**"pytesseract not installed"**
 ```bash
 pip install pytesseract
 sudo apt-get install tesseract-ocr  # Don't forget system package!
 ```

-### "No frames extracted"
+**Poor OCR quality**
+- **Solution**: Switch to vision analysis with `--use-vision`
+- Or try different OCR engine: `--ocr-engine easyocr`
+- Check if video resolution is sufficient
+- Use `--no-deduplicate` to keep more frames
+
+### General Issues
+
+**"No frames extracted"**
 - Check video file is valid: `ffmpeg -i video.mkv`
 - Try lower interval: `--interval 3`
 - Check disk space in frames directory

-### Poor OCR quality
- Try different OCR engine
- Check if video resolution is sufficient
- Use `--no-deduplicate` to keep more frames
-
-### Scene detection not working
+**Scene detection not working**
 - Fallback to interval extraction automatically
 - Ensure FFmpeg is installed
 - Try manual interval: `--interval 5`

+**Cache not being used**
+- Ensure you're using the same video filename
+- Check that output directory contains cached files
+- Use `--verbose` to see what's being cached/loaded
+
 ## Project Structure

 ```
--- a/meetus/vision_processor.py
+++ b/meetus/vision_processor.py
@@ -0,0 +1,192 @@
+"""
+Vision-based frame analysis using local vision-language models via Ollama.
+Better than OCR for understanding dashboards, code, and console output.
+"""
+from typing import List, Tuple, Dict, Optional
+from pathlib import Path
+import logging
+from difflib import SequenceMatcher
+
+logger = logging.getLogger(__name__)
+
+
+class VisionProcessor:
+    """Process frames using local vision models via Ollama."""
+
+    def __init__(self, model: str = "llava:13b"):
+        """
+        Initialize vision processor.
+
+        Args:
+            model: Ollama vision model to use (llava:13b, llava:7b, llava-llama3, bakllava)
+        """
+        self.model = model
+        self._client = None
+        self._init_client()
+
+    def _init_client(self):
+        """Initialize Ollama client."""
+        try:
+            import ollama
+            self._client = ollama
+
+            # Check if model is available
+            try:
+                models = self._client.list()
+                available_models = [m['name'] for m in models.get('models', [])]
+
+                if self.model not in available_models:
+                    logger.warning(f"Model {self.model} not found locally.")
+                    logger.info(f"Pulling {self.model}... (this may take a few minutes)")
+                    self._client.pull(self.model)
+                    logger.info(f"✓ Model {self.model} downloaded")
+                else:
+                    logger.info(f"Using local vision model: {self.model}")
+
+            except Exception as e:
+                logger.warning(f"Could not verify model availability: {e}")
+                logger.info("Attempting to use model anyway...")
+
+        except ImportError:
+            raise ImportError(
+                "ollama package not installed. Run: pip install ollama\n"
+                "Also install Ollama: https://ollama.ai/download"
+            )
+
+    def analyze_frame(self, image_path: str, context: str = "meeting") -> str:
+        """
+        Analyze a single frame using local vision model.
+
+        Args:
+            image_path: Path to image file
+            context: Context hint for analysis (meeting, dashboard, code, console)
+
+        Returns:
+            Analyzed content description
+        """
+        # Context-specific prompts
+        prompts = {
+            "meeting": """Analyze this screen capture from a meeting recording. Extract:
+1. Any visible text (titles, labels, headings)
+2. Key metrics, numbers, or data points shown
+3. Dashboard panels or visualizations (describe what they show)
+4. Code snippets (preserve formatting and context)
+5. Console/terminal output (commands and results)
+6. Application names or UI elements
+
+Focus on information that would help someone understand what was being discussed.
+Be concise but include all important details. If there's code, preserve it exactly.""",
+
+            "dashboard": """Analyze this dashboard/monitoring panel. Extract:
+1. Panel titles and metrics names
+2. Current values and units
+3. Trends (up/down/stable)
+4. Alerts or warnings
+5. Time ranges shown
+6. Any anomalies or notable patterns
+
+Format as structured data.""",
+
+            "code": """Analyze this code screenshot. Extract:
+1. Programming language
+2. File name or path (if visible)
+3. Code content (preserve exact formatting)
+4. Comments
+5. Function/class names
+6. Any error messages or warnings
+
+Preserve code exactly as shown.""",
+
+            "console": """Analyze this console/terminal output. Extract:
+1. Commands executed
+2. Output/results
+3. Error messages
+4. Warnings or status messages
+5. File paths or URLs
+
+Preserve formatting and structure."""
+        }
+
+        prompt = prompts.get(context, prompts["meeting"])
+
+        try:
+            # Use Ollama's chat API with vision
+            response = self._client.chat(
+                model=self.model,
+                messages=[
+                    {
+                        'role': 'user',
+                        'content': prompt,
+                        'images': [image_path]
+                    }
+                ]
+            )
+
+            # Extract text from response
+            text = response['message']['content']
+            return text.strip()
+
+        except Exception as e:
+            logger.error(f"Vision model error for {image_path}: {e}")
+            return ""
+
+    def process_frames(
+        self,
+        frames_info: List[Tuple[str, float]],
+        context: str = "meeting",
+        deduplicate: bool = True,
+        similarity_threshold: float = 0.85
+    ) -> List[Dict]:
+        """
+        Process multiple frames with vision analysis.
+
+        Args:
+            frames_info: List of (frame_path, timestamp) tuples
+            context: Context hint for analysis
+            deduplicate: Whether to remove similar consecutive analyses
+            similarity_threshold: Threshold for considering analyses as duplicates (0-1)
+
+        Returns:
+            List of dicts with 'timestamp', 'text', and 'frame_path'
+        """
+        results = []
+        prev_text = ""
+
+        total = len(frames_info)
+        logger.info(f"Starting vision analysis of {total} frames...")
+
+        for idx, (frame_path, timestamp) in enumerate(frames_info, 1):
+            logger.info(f"Analyzing frame {idx}/{total} at {timestamp:.2f}s...")
+
+            text = self.analyze_frame(frame_path, context)
+
+            if not text:
+                logger.warning(f"No content extracted from frame at {timestamp:.2f}s")
+                continue
+
+            # Deduplicate similar consecutive frames
+            if deduplicate:
+                similarity = self._text_similarity(prev_text, text)
+                if similarity > similarity_threshold:
+                    logger.debug(f"Skipping duplicate frame at {timestamp:.2f}s (similarity: {similarity:.2f})")
+                    continue
+
+            results.append({
+                'timestamp': timestamp,
+                'text': text,
+                'frame_path': frame_path
+            })
+
+            prev_text = text
+
+        logger.info(f"Extracted content from {len(results)} frames (deduplication: {deduplicate})")
+        return results
+
+    def _text_similarity(self, text1: str, text2: str) -> float:
+        """
+        Calculate similarity between two texts.
+
+        Returns:
+            Similarity score between 0 and 1
+        """
+        return SequenceMatcher(None, text1, text2).ratio()
--- a/process_meeting.py
+++ b/process_meeting.py
@@ -13,6 +13,7 @@ import shutil

 from meetus.frame_extractor import FrameExtractor
 from meetus.ocr_processor import OCRProcessor
+from meetus.vision_processor import VisionProcessor
 from meetus.transcript_merger import TranscriptMerger

 logger = logging.getLogger(__name__)
@@ -98,20 +99,23 @@ def main():
        formatter_class=argparse.RawDescriptionHelpFormatter,
        epilog="""
 Examples:
-  # Run Whisper + full processing in one command
+  # Run Whisper + vision analysis (recommended for code/dashboards)
+  python process_meeting.py samples/meeting.mkv --run-whisper --use-vision
+
+  # Use vision with specific context hint
+  python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --vision-context code
+
+  # Traditional OCR approach
  python process_meeting.py samples/meeting.mkv --run-whisper

-  # Process video with existing Whisper transcript
-  python process_meeting.py samples/meeting.mkv --transcript output/meeting.json
+  # Re-run analysis using cached frames and transcript
+  python process_meeting.py samples/meeting.mkv --use-vision

-  # Use scene detection instead of interval
-  python process_meeting.py samples/meeting.mkv --run-whisper --scene-detection
+  # Force reprocessing (ignore cache)
+  python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --no-cache

-  # Use different Whisper model and OCR engine
-  python process_meeting.py samples/meeting.mkv --run-whisper --whisper-model small --ocr-engine easyocr
-
-  # Extract frames only (no transcript)
-  python process_meeting.py samples/meeting.mkv --extract-only
+  # Use scene detection for fewer frames
+  python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --scene-detection
        """
    )

@@ -177,6 +181,31 @@ Examples:
        default='tesseract'
    )

+    parser.add_argument(
+        '--use-vision',
+        action='store_true',
+        help='Use local vision model (Ollama) instead of OCR for better context understanding'
+    )
+
+    parser.add_argument(
+        '--vision-model',
+        help='Vision model to use with Ollama (default: llava:13b)',
+        default='llava:13b'
+    )
+
+    parser.add_argument(
+        '--vision-context',
+        choices=['meeting', 'dashboard', 'code', 'console'],
+        help='Context hint for vision analysis (default: meeting)',
+        default='meeting'
+    )
+
+    parser.add_argument(
+        '--no-cache',
+        action='store_true',
+        help='Disable caching - reprocess everything even if outputs exist'
+    )
+
    parser.add_argument(
        '--no-deduplicate',
        action='store_true',
@@ -221,61 +250,113 @@ Examples:
    if args.output is None:
        args.output = str(output_dir / f"{video_path.stem}_enhanced.txt")

-    # Run Whisper if requested
+    # Define cache paths
+    whisper_cache = output_dir / f"{video_path.stem}.json"
+    analysis_cache = output_dir / f"{video_path.stem}_{'vision' if args.use_vision else 'ocr'}.json"
+    frames_cache_dir = Path(args.frames_dir)
+
+    # Check for cached Whisper transcript
    if args.run_whisper:
-        logger.info("=" * 80)
-        logger.info("STEP 0: Running Whisper Transcription")
-        logger.info("=" * 80)
-        transcript_path = run_whisper(video_path, args.whisper_model, str(output_dir))
-        args.transcript = str(transcript_path)
-        logger.info("")
+        if not args.no_cache and whisper_cache.exists():
+            logger.info(f"✓ Found cached Whisper transcript: {whisper_cache}")
+            args.transcript = str(whisper_cache)
+        else:
+            logger.info("=" * 80)
+            logger.info("STEP 0: Running Whisper Transcription")
+            logger.info("=" * 80)
+            transcript_path = run_whisper(video_path, args.whisper_model, str(output_dir))
+            args.transcript = str(transcript_path)
+            logger.info("")

    logger.info("=" * 80)
    logger.info("MEETING PROCESSOR")
    logger.info("=" * 80)
    logger.info(f"Video: {video_path.name}")
-    logger.info(f"OCR Engine: {args.ocr_engine}")
+    logger.info(f"Analysis: {'Vision Model' if args.use_vision else f'OCR ({args.ocr_engine})'}")
+    if args.use_vision:
+        logger.info(f"Vision Model: {args.vision_model}")
+        logger.info(f"Context: {args.vision_context}")
    logger.info(f"Frame extraction: {'Scene detection' if args.scene_detection else f'Every {args.interval}s'}")
    if args.transcript:
        logger.info(f"Transcript: {args.transcript}")
+    logger.info(f"Caching: {'Disabled' if args.no_cache else 'Enabled'}")
    logger.info("=" * 80)

-    # Step 1: Extract frames
+    # Step 1: Extract frames (with caching)
    logger.info("Step 1: Extracting frames from video...")
-    extractor = FrameExtractor(str(video_path), args.frames_dir)

-    if args.scene_detection:
-        frames_info = extractor.extract_scene_changes()
+    # Check if frames already exist
+    existing_frames = list(frames_cache_dir.glob(f"{video_path.stem}_*.jpg")) if frames_cache_dir.exists() else []
+
+    if not args.no_cache and existing_frames and len(existing_frames) > 0:
+        logger.info(f"✓ Found {len(existing_frames)} cached frames in {args.frames_dir}/")
+        # Build frames_info from existing files
+        frames_info = []
+        for frame_path in sorted(existing_frames):
+            # Try to extract timestamp from filename (e.g., video_00001_12.34s.jpg)
+            try:
+                timestamp_str = frame_path.stem.split('_')[-1].rstrip('s')
+                timestamp = float(timestamp_str)
+            except:
+                timestamp = 0.0
+            frames_info.append((str(frame_path), timestamp))
    else:
-        frames_info = extractor.extract_by_interval(args.interval)
+        extractor = FrameExtractor(str(video_path), args.frames_dir)

-    if not frames_info:
-        logger.error("No frames extracted")
-        sys.exit(1)
+        if args.scene_detection:
+            frames_info = extractor.extract_scene_changes()
+        else:
+            frames_info = extractor.extract_by_interval(args.interval)

-    logger.info(f"✓ Extracted {len(frames_info)} frames")
+        if not frames_info:
+            logger.error("No frames extracted")
+            sys.exit(1)

-    # Step 2: Run OCR on frames
-    logger.info("Step 2: Running OCR on extracted frames...")
-    try:
-        ocr = OCRProcessor(engine=args.ocr_engine)
-        screen_segments = ocr.process_frames(
-            frames_info,
-            deduplicate=not args.no_deduplicate
-        )
-        logger.info(f"✓ Processed {len(screen_segments)} frames with text content")
+        logger.info(f"✓ Extracted {len(frames_info)} frames")

-    except ImportError as e:
-        logger.error(f"{e}")
-        logger.error(f"To install {args.ocr_engine}:")
-        logger.error(f"  pip install {args.ocr_engine}")
-        sys.exit(1)
+    # Step 2: Run analysis on frames (with caching)
+    if not args.no_cache and analysis_cache.exists():
+        logger.info(f"✓ Found cached analysis results: {analysis_cache}")
+        with open(analysis_cache, 'r', encoding='utf-8') as f:
+            screen_segments = json.load(f)
+        logger.info(f"✓ Loaded {len(screen_segments)} analyzed frames from cache")
+    else:
+        if args.use_vision:
+            # Use vision model
+            logger.info("Step 2: Running vision analysis on extracted frames...")
+            try:
+                vision = VisionProcessor(model=args.vision_model)
+                screen_segments = vision.process_frames(
+                    frames_info,
+                    context=args.vision_context,
+                    deduplicate=not args.no_deduplicate
+                )
+                logger.info(f"✓ Analyzed {len(screen_segments)} frames with vision model")

-    # Save OCR results as JSON
-    ocr_output = output_dir / f"{video_path.stem}_ocr.json"
-    with open(ocr_output, 'w', encoding='utf-8') as f:
-        json.dump(screen_segments, f, indent=2, ensure_ascii=False)
-    logger.info(f"✓ Saved OCR results to: {ocr_output}")
+            except ImportError as e:
+                logger.error(f"{e}")
+                sys.exit(1)
+        else:
+            # Use OCR
+            logger.info("Step 2: Running OCR on extracted frames...")
+            try:
+                ocr = OCRProcessor(engine=args.ocr_engine)
+                screen_segments = ocr.process_frames(
+                    frames_info,
+                    deduplicate=not args.no_deduplicate
+                )
+                logger.info(f"✓ Processed {len(screen_segments)} frames with OCR")
+
+            except ImportError as e:
+                logger.error(f"{e}")
+                logger.error(f"To install {args.ocr_engine}:")
+                logger.error(f"  pip install {args.ocr_engine}")
+                sys.exit(1)
+
+        # Save analysis results as JSON
+        with open(analysis_cache, 'w', encoding='utf-8') as f:
+            json.dump(screen_segments, f, indent=2, ensure_ascii=False)
+        logger.info(f"✓ Saved analysis results to: {analysis_cache}")

    if args.extract_only:
        logger.info("Done! (extract-only mode)")
--- a/requirements.txt
+++ b/requirements.txt
@@ -2,13 +2,17 @@
 opencv-python>=4.8.0
 Pillow>=10.0.0

-# OCR engines (install at least one)
-# Tesseract (recommended, lightweight)
+# Vision analysis (recommended for better results)
+# Requires Ollama to be installed: https://ollama.ai/download
+ollama>=0.1.0
+
+# OCR engines (alternative to vision analysis)
+# Tesseract (lightweight, basic text extraction)
 pytesseract>=0.3.10

-# Alternative OCR engines (optional, install as needed)
+# Alternative OCR engines (optional)
 # easyocr>=1.7.0
 # paddleocr>=2.7.0

-# For Whisper transcription (if not already installed)
+# For Whisper transcription (recommended)
 # openai-whisper>=20230918