add vision processor

This commit is contained in:
Mariano Gabriel
2025-10-19 22:58:28 -03:00
parent ae89564373
commit a999bc9093
4 changed files with 511 additions and 107 deletions

241
README.md
View File

@@ -1,12 +1,17 @@
# Meeting Processor
Extract screen content from meeting recordings and merge with Whisper transcripts for better Claude summarization.
Extract screen content from meeting recordings and merge with Whisper transcripts for better AI summarization.
## Overview
This tool enhances meeting transcripts by combining:
- **Audio transcription** (from Whisper)
- **Screen content** (OCR from screen shares)
- **Screen content analysis** (Vision models or OCR)
### Vision Analysis vs OCR
- **Vision Models** (recommended): Uses local LLaVA model via Ollama to understand context - great for dashboards, code, consoles
- **OCR**: Traditional text extraction - faster but less context-aware
The result is a rich, timestamped transcript that provides full context for AI summarization.
@@ -14,16 +19,13 @@ The result is a rich, timestamped transcript that provides full context for AI s
### 1. System Dependencies
**Tesseract OCR** (recommended):
**Ollama** (required for vision analysis):
```bash
# Ubuntu/Debian
sudo apt-get install tesseract-ocr
# macOS
brew install tesseract
# Arch Linux
sudo pacman -S tesseract
# Install from https://ollama.ai/download
# Then pull a vision model:
ollama pull llava:13b
# or for lighter model:
ollama pull llava:7b
```
**FFmpeg** (for scene detection):
@@ -35,6 +37,18 @@ sudo apt-get install ffmpeg
brew install ffmpeg
```
**Tesseract OCR** (optional, if not using vision):
```bash
# Ubuntu/Debian
sudo apt-get install tesseract-ocr
# macOS
brew install tesseract
# Arch Linux
sudo pacman -S tesseract
```
### 2. Python Dependencies
```bash
@@ -49,6 +63,7 @@ pip install openai-whisper
### 4. Optional: Install Alternative OCR Engines
If you prefer OCR over vision analysis:
```bash
# EasyOCR (better for rotated/handwritten text)
pip install easyocr
@@ -59,118 +74,173 @@ pip install paddleocr
## Quick Start
### Recommended: Run Everything in One Command
### Recommended: Vision Analysis (Best for Code/Dashboards)
```bash
python process_meeting.py samples/meeting.mkv --run-whisper
python process_meeting.py samples/meeting.mkv --run-whisper --use-vision
```
This will:
1. Run Whisper transcription (audio → text)
2. Extract frames every 5 seconds
3. Run OCR to extract screen text
3. Use LLaVA vision model to analyze frames with context
4. Merge audio + screen content
5. Save everything to `output/` folder
### Alternative: Use Existing Whisper Transcript
### Re-run with Cached Results
If you already have a Whisper transcript:
Already ran it once? Re-run instantly using cached results:
```bash
python process_meeting.py samples/meeting.mkv --transcript output/meeting.json
# Uses cached transcript, frames, and analysis
python process_meeting.py samples/meeting.mkv --use-vision
# Force reprocessing
python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --no-cache
```
### Screen Content Only (No Audio)
### Traditional OCR (Faster, Less Context-Aware)
```bash
python process_meeting.py samples/meeting.mkv
python process_meeting.py samples/meeting.mkv --run-whisper
```
## Usage Examples
### Run with different Whisper models
### Vision Analysis with Context Hints
```bash
# Tiny model (fastest, less accurate)
python process_meeting.py samples/meeting.mkv --run-whisper --whisper-model tiny
# For code-heavy meetings
python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --vision-context code
# Small model (balanced)
python process_meeting.py samples/meeting.mkv --run-whisper --whisper-model small
# For dashboard/monitoring meetings (Grafana, GCP, etc.)
python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --vision-context dashboard
# Large model (slowest, most accurate)
python process_meeting.py samples/meeting.mkv --run-whisper --whisper-model large
# For console/terminal sessions
python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --vision-context console
```
### Different Vision Models
```bash
# Lighter/faster model (7B parameters)
python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --vision-model llava:7b
# Default model (13B parameters, better quality)
python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --vision-model llava:13b
# Alternative models
python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --vision-model bakllava
```
### Extract frames at different intervals
```bash
# Every 10 seconds (with Whisper)
python process_meeting.py samples/meeting.mkv --run-whisper --interval 10
# Every 10 seconds
python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --interval 10
# Every 3 seconds (more detailed)
python process_meeting.py samples/meeting.mkv --run-whisper --interval 3
python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --interval 3
```
### Use scene detection (smarter, fewer frames)
```bash
python process_meeting.py samples/meeting.mkv --run-whisper --scene-detection
python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --scene-detection
```
### Use different OCR engines
### Traditional OCR (if you prefer)
```bash
# EasyOCR (good for varied layouts)
# Tesseract (default)
python process_meeting.py samples/meeting.mkv --run-whisper
# EasyOCR
python process_meeting.py samples/meeting.mkv --run-whisper --ocr-engine easyocr
# PaddleOCR (good for code/terminal)
# PaddleOCR
python process_meeting.py samples/meeting.mkv --run-whisper --ocr-engine paddleocr
```
### Extract frames only (no merging)
### Caching Examples
```bash
python process_meeting.py samples/meeting.mkv --extract-only
# First run - processes everything
python process_meeting.py samples/meeting.mkv --run-whisper --use-vision
# Second run - uses cached transcript and frames, only re-merges
python process_meeting.py samples/meeting.mkv
# Switch from OCR to vision using existing frames
python process_meeting.py samples/meeting.mkv --use-vision
# Force complete reprocessing
python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --no-cache
```
### Custom output location
```bash
python process_meeting.py samples/meeting.mkv --run-whisper --output-dir my_outputs/
python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --output-dir my_outputs/
```
### Enable verbose logging
```bash
# Show detailed debug information
python process_meeting.py samples/meeting.mkv --run-whisper --verbose
python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --verbose
```
## Output Files
All output files are saved to the `output/` directory by default:
- **`output/<video>_enhanced.txt`** - Enhanced transcript ready for Claude
- **`output/<video>_enhanced.txt`** - Enhanced transcript ready for AI summarization
- **`output/<video>.json`** - Whisper transcript (if `--run-whisper` was used)
- **`output/<video>_ocr.json`** - Raw OCR data with timestamps
- **`output/<video>_vision.json`** - Vision analysis results with timestamps (if `--use-vision`)
- **`output/<video>_ocr.json`** - OCR results with timestamps (if using OCR)
- **`frames/`** - Extracted video frames (JPG files)
### Caching Behavior
The tool automatically caches intermediate results to speed up re-runs:
- **Whisper transcript**: Cached as `output/<video>.json`
- **Extracted frames**: Cached in `frames/<video>_*.jpg`
- **Analysis results**: Cached as `output/<video>_vision.json` or `output/<video>_ocr.json`
Re-running with the same video will use cached results unless `--no-cache` is specified.
## Workflow for Meeting Analysis
### Complete Workflow (One Command!)
```bash
# Process everything in one step
python process_meeting.py samples/alo-intro1.mkv --run-whisper --scene-detection
# Process everything in one step with vision analysis
python process_meeting.py samples/alo-intro1.mkv --run-whisper --use-vision --scene-detection
# Output will be in output/alo-intro1_enhanced.txt
```
### Typical Iterative Workflow
```bash
# First run - full processing
python process_meeting.py samples/meeting.mkv --run-whisper --use-vision
# Review results, then re-run with different context if needed
python process_meeting.py samples/meeting.mkv --use-vision --vision-context code
# Or switch to a different vision model
python process_meeting.py samples/meeting.mkv --use-vision --vision-model llava:7b
# All use cached frames and transcript!
```
### Traditional Workflow (Separate Steps)
```bash
# 1. Extract audio and transcribe with Whisper (optional, if not using --run-whisper)
whisper samples/alo-intro1.mkv --model base --output_format json --output_dir output
# 2. Process video to extract screen content
# 2. Process video to extract screen content with vision
python process_meeting.py samples/alo-intro1.mkv \
--transcript output/alo-intro1.json \
--use-vision \
--scene-detection
# 3. Use the enhanced transcript with Claude
# Copy the content from output/alo-intro1_enhanced.txt and paste into Claude
# 3. Use the enhanced transcript with AI
# Copy the content from output/alo-intro1_enhanced.txt and paste into Claude or your LLM
```
### Example Prompt for Claude
@@ -217,42 +287,99 @@ Options:
## Tips for Best Results
### Vision vs OCR: When to Use Each
**Use Vision Models (`--use-vision`) when:**
- ✅ Analyzing dashboards (Grafana, GCP Console, monitoring tools)
- ✅ Code walkthroughs or debugging sessions
- ✅ Complex layouts with mixed content
- ✅ Need contextual understanding, not just text extraction
- ✅ Working with charts, graphs, or visualizations
- ⚠️ Trade-off: Slower (requires GPU/CPU for local model)
**Use OCR when:**
- ✅ Simple text extraction from slides or documents
- ✅ Need maximum speed
- ✅ Limited computational resources
- ✅ Presentations with mostly text
- ⚠️ Trade-off: Less context-aware, may miss visual relationships
### Context Hints for Vision Analysis
- **`--vision-context meeting`**: General purpose (default)
- **`--vision-context code`**: Optimized for code screenshots, preserves formatting
- **`--vision-context dashboard`**: Extracts metrics, trends, panel names
- **`--vision-context console`**: Captures commands, output, error messages
### Scene Detection vs Interval
- **Scene detection**: Better for presentations with distinct slides. More efficient.
- **Interval extraction**: Better for continuous screen sharing (coding, browsing). More thorough.
### OCR Engine Selection
- **Tesseract**: Best for clean slides, documents, presentations. Fast and lightweight.
- **EasyOCR**: Better for handwriting, rotated text, or varied fonts.
- **PaddleOCR**: Excellent for code, terminal outputs, and mixed languages.
### Vision Model Selection
- **`llava:7b`**: Faster, lower memory (~4GB RAM), good quality
- **`llava:13b`**: Better quality, slower, needs ~8GB RAM (default)
- **`bakllava`**: Alternative with different strengths
### Deduplication
- Enabled by default - removes similar consecutive frames
- Disable with `--no-deduplicate` if slides change subtly
- Disable with `--no-deduplicate` if slides/screens change subtly
## Troubleshooting
### "pytesseract not installed"
### Vision Model Issues
**"ollama package not installed"**
```bash
pip install ollama
```
**"Ollama not found" or connection errors**
```bash
# Install Ollama first: https://ollama.ai/download
# Then pull a vision model:
ollama pull llava:13b
```
**Vision analysis is slow**
- Use lighter model: `--vision-model llava:7b`
- Reduce frame count: `--scene-detection` or `--interval 10`
- Check if Ollama is using GPU (much faster)
**Poor vision analysis results**
- Try different context hint: `--vision-context code` or `--vision-context dashboard`
- Use larger model: `--vision-model llava:13b`
- Ensure frames are clear (check video resolution)
### OCR Issues
**"pytesseract not installed"**
```bash
pip install pytesseract
sudo apt-get install tesseract-ocr # Don't forget system package!
```
### "No frames extracted"
**Poor OCR quality**
- **Solution**: Switch to vision analysis with `--use-vision`
- Or try different OCR engine: `--ocr-engine easyocr`
- Check if video resolution is sufficient
- Use `--no-deduplicate` to keep more frames
### General Issues
**"No frames extracted"**
- Check video file is valid: `ffmpeg -i video.mkv`
- Try lower interval: `--interval 3`
- Check disk space in frames directory
### Poor OCR quality
- Try different OCR engine
- Check if video resolution is sufficient
- Use `--no-deduplicate` to keep more frames
### Scene detection not working
**Scene detection not working**
- Fallback to interval extraction automatically
- Ensure FFmpeg is installed
- Try manual interval: `--interval 5`
**Cache not being used**
- Ensure you're using the same video filename
- Check that output directory contains cached files
- Use `--verbose` to see what's being cached/loaded
## Project Structure
```

192
meetus/vision_processor.py Normal file
View File

@@ -0,0 +1,192 @@
"""
Vision-based frame analysis using local vision-language models via Ollama.
Better than OCR for understanding dashboards, code, and console output.
"""
from typing import List, Tuple, Dict, Optional
from pathlib import Path
import logging
from difflib import SequenceMatcher
logger = logging.getLogger(__name__)
class VisionProcessor:
"""Process frames using local vision models via Ollama."""
def __init__(self, model: str = "llava:13b"):
"""
Initialize vision processor.
Args:
model: Ollama vision model to use (llava:13b, llava:7b, llava-llama3, bakllava)
"""
self.model = model
self._client = None
self._init_client()
def _init_client(self):
"""Initialize Ollama client."""
try:
import ollama
self._client = ollama
# Check if model is available
try:
models = self._client.list()
available_models = [m['name'] for m in models.get('models', [])]
if self.model not in available_models:
logger.warning(f"Model {self.model} not found locally.")
logger.info(f"Pulling {self.model}... (this may take a few minutes)")
self._client.pull(self.model)
logger.info(f"✓ Model {self.model} downloaded")
else:
logger.info(f"Using local vision model: {self.model}")
except Exception as e:
logger.warning(f"Could not verify model availability: {e}")
logger.info("Attempting to use model anyway...")
except ImportError:
raise ImportError(
"ollama package not installed. Run: pip install ollama\n"
"Also install Ollama: https://ollama.ai/download"
)
def analyze_frame(self, image_path: str, context: str = "meeting") -> str:
"""
Analyze a single frame using local vision model.
Args:
image_path: Path to image file
context: Context hint for analysis (meeting, dashboard, code, console)
Returns:
Analyzed content description
"""
# Context-specific prompts
prompts = {
"meeting": """Analyze this screen capture from a meeting recording. Extract:
1. Any visible text (titles, labels, headings)
2. Key metrics, numbers, or data points shown
3. Dashboard panels or visualizations (describe what they show)
4. Code snippets (preserve formatting and context)
5. Console/terminal output (commands and results)
6. Application names or UI elements
Focus on information that would help someone understand what was being discussed.
Be concise but include all important details. If there's code, preserve it exactly.""",
"dashboard": """Analyze this dashboard/monitoring panel. Extract:
1. Panel titles and metrics names
2. Current values and units
3. Trends (up/down/stable)
4. Alerts or warnings
5. Time ranges shown
6. Any anomalies or notable patterns
Format as structured data.""",
"code": """Analyze this code screenshot. Extract:
1. Programming language
2. File name or path (if visible)
3. Code content (preserve exact formatting)
4. Comments
5. Function/class names
6. Any error messages or warnings
Preserve code exactly as shown.""",
"console": """Analyze this console/terminal output. Extract:
1. Commands executed
2. Output/results
3. Error messages
4. Warnings or status messages
5. File paths or URLs
Preserve formatting and structure."""
}
prompt = prompts.get(context, prompts["meeting"])
try:
# Use Ollama's chat API with vision
response = self._client.chat(
model=self.model,
messages=[
{
'role': 'user',
'content': prompt,
'images': [image_path]
}
]
)
# Extract text from response
text = response['message']['content']
return text.strip()
except Exception as e:
logger.error(f"Vision model error for {image_path}: {e}")
return ""
def process_frames(
self,
frames_info: List[Tuple[str, float]],
context: str = "meeting",
deduplicate: bool = True,
similarity_threshold: float = 0.85
) -> List[Dict]:
"""
Process multiple frames with vision analysis.
Args:
frames_info: List of (frame_path, timestamp) tuples
context: Context hint for analysis
deduplicate: Whether to remove similar consecutive analyses
similarity_threshold: Threshold for considering analyses as duplicates (0-1)
Returns:
List of dicts with 'timestamp', 'text', and 'frame_path'
"""
results = []
prev_text = ""
total = len(frames_info)
logger.info(f"Starting vision analysis of {total} frames...")
for idx, (frame_path, timestamp) in enumerate(frames_info, 1):
logger.info(f"Analyzing frame {idx}/{total} at {timestamp:.2f}s...")
text = self.analyze_frame(frame_path, context)
if not text:
logger.warning(f"No content extracted from frame at {timestamp:.2f}s")
continue
# Deduplicate similar consecutive frames
if deduplicate:
similarity = self._text_similarity(prev_text, text)
if similarity > similarity_threshold:
logger.debug(f"Skipping duplicate frame at {timestamp:.2f}s (similarity: {similarity:.2f})")
continue
results.append({
'timestamp': timestamp,
'text': text,
'frame_path': frame_path
})
prev_text = text
logger.info(f"Extracted content from {len(results)} frames (deduplication: {deduplicate})")
return results
def _text_similarity(self, text1: str, text2: str) -> float:
"""
Calculate similarity between two texts.
Returns:
Similarity score between 0 and 1
"""
return SequenceMatcher(None, text1, text2).ratio()

View File

@@ -13,6 +13,7 @@ import shutil
from meetus.frame_extractor import FrameExtractor
from meetus.ocr_processor import OCRProcessor
from meetus.vision_processor import VisionProcessor
from meetus.transcript_merger import TranscriptMerger
logger = logging.getLogger(__name__)
@@ -98,20 +99,23 @@ def main():
formatter_class=argparse.RawDescriptionHelpFormatter,
epilog="""
Examples:
# Run Whisper + full processing in one command
# Run Whisper + vision analysis (recommended for code/dashboards)
python process_meeting.py samples/meeting.mkv --run-whisper --use-vision
# Use vision with specific context hint
python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --vision-context code
# Traditional OCR approach
python process_meeting.py samples/meeting.mkv --run-whisper
# Process video with existing Whisper transcript
python process_meeting.py samples/meeting.mkv --transcript output/meeting.json
# Re-run analysis using cached frames and transcript
python process_meeting.py samples/meeting.mkv --use-vision
# Use scene detection instead of interval
python process_meeting.py samples/meeting.mkv --run-whisper --scene-detection
# Force reprocessing (ignore cache)
python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --no-cache
# Use different Whisper model and OCR engine
python process_meeting.py samples/meeting.mkv --run-whisper --whisper-model small --ocr-engine easyocr
# Extract frames only (no transcript)
python process_meeting.py samples/meeting.mkv --extract-only
# Use scene detection for fewer frames
python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --scene-detection
"""
)
@@ -177,6 +181,31 @@ Examples:
default='tesseract'
)
parser.add_argument(
'--use-vision',
action='store_true',
help='Use local vision model (Ollama) instead of OCR for better context understanding'
)
parser.add_argument(
'--vision-model',
help='Vision model to use with Ollama (default: llava:13b)',
default='llava:13b'
)
parser.add_argument(
'--vision-context',
choices=['meeting', 'dashboard', 'code', 'console'],
help='Context hint for vision analysis (default: meeting)',
default='meeting'
)
parser.add_argument(
'--no-cache',
action='store_true',
help='Disable caching - reprocess everything even if outputs exist'
)
parser.add_argument(
'--no-deduplicate',
action='store_true',
@@ -221,61 +250,113 @@ Examples:
if args.output is None:
args.output = str(output_dir / f"{video_path.stem}_enhanced.txt")
# Run Whisper if requested
# Define cache paths
whisper_cache = output_dir / f"{video_path.stem}.json"
analysis_cache = output_dir / f"{video_path.stem}_{'vision' if args.use_vision else 'ocr'}.json"
frames_cache_dir = Path(args.frames_dir)
# Check for cached Whisper transcript
if args.run_whisper:
logger.info("=" * 80)
logger.info("STEP 0: Running Whisper Transcription")
logger.info("=" * 80)
transcript_path = run_whisper(video_path, args.whisper_model, str(output_dir))
args.transcript = str(transcript_path)
logger.info("")
if not args.no_cache and whisper_cache.exists():
logger.info(f"✓ Found cached Whisper transcript: {whisper_cache}")
args.transcript = str(whisper_cache)
else:
logger.info("=" * 80)
logger.info("STEP 0: Running Whisper Transcription")
logger.info("=" * 80)
transcript_path = run_whisper(video_path, args.whisper_model, str(output_dir))
args.transcript = str(transcript_path)
logger.info("")
logger.info("=" * 80)
logger.info("MEETING PROCESSOR")
logger.info("=" * 80)
logger.info(f"Video: {video_path.name}")
logger.info(f"OCR Engine: {args.ocr_engine}")
logger.info(f"Analysis: {'Vision Model' if args.use_vision else f'OCR ({args.ocr_engine})'}")
if args.use_vision:
logger.info(f"Vision Model: {args.vision_model}")
logger.info(f"Context: {args.vision_context}")
logger.info(f"Frame extraction: {'Scene detection' if args.scene_detection else f'Every {args.interval}s'}")
if args.transcript:
logger.info(f"Transcript: {args.transcript}")
logger.info(f"Caching: {'Disabled' if args.no_cache else 'Enabled'}")
logger.info("=" * 80)
# Step 1: Extract frames
# Step 1: Extract frames (with caching)
logger.info("Step 1: Extracting frames from video...")
extractor = FrameExtractor(str(video_path), args.frames_dir)
if args.scene_detection:
frames_info = extractor.extract_scene_changes()
# Check if frames already exist
existing_frames = list(frames_cache_dir.glob(f"{video_path.stem}_*.jpg")) if frames_cache_dir.exists() else []
if not args.no_cache and existing_frames and len(existing_frames) > 0:
logger.info(f"✓ Found {len(existing_frames)} cached frames in {args.frames_dir}/")
# Build frames_info from existing files
frames_info = []
for frame_path in sorted(existing_frames):
# Try to extract timestamp from filename (e.g., video_00001_12.34s.jpg)
try:
timestamp_str = frame_path.stem.split('_')[-1].rstrip('s')
timestamp = float(timestamp_str)
except:
timestamp = 0.0
frames_info.append((str(frame_path), timestamp))
else:
frames_info = extractor.extract_by_interval(args.interval)
extractor = FrameExtractor(str(video_path), args.frames_dir)
if not frames_info:
logger.error("No frames extracted")
sys.exit(1)
if args.scene_detection:
frames_info = extractor.extract_scene_changes()
else:
frames_info = extractor.extract_by_interval(args.interval)
logger.info(f"✓ Extracted {len(frames_info)} frames")
if not frames_info:
logger.error("No frames extracted")
sys.exit(1)
# Step 2: Run OCR on frames
logger.info("Step 2: Running OCR on extracted frames...")
try:
ocr = OCRProcessor(engine=args.ocr_engine)
screen_segments = ocr.process_frames(
frames_info,
deduplicate=not args.no_deduplicate
)
logger.info(f"✓ Processed {len(screen_segments)} frames with text content")
logger.info(f"✓ Extracted {len(frames_info)} frames")
except ImportError as e:
logger.error(f"{e}")
logger.error(f"To install {args.ocr_engine}:")
logger.error(f" pip install {args.ocr_engine}")
sys.exit(1)
# Step 2: Run analysis on frames (with caching)
if not args.no_cache and analysis_cache.exists():
logger.info(f"✓ Found cached analysis results: {analysis_cache}")
with open(analysis_cache, 'r', encoding='utf-8') as f:
screen_segments = json.load(f)
logger.info(f"✓ Loaded {len(screen_segments)} analyzed frames from cache")
else:
if args.use_vision:
# Use vision model
logger.info("Step 2: Running vision analysis on extracted frames...")
try:
vision = VisionProcessor(model=args.vision_model)
screen_segments = vision.process_frames(
frames_info,
context=args.vision_context,
deduplicate=not args.no_deduplicate
)
logger.info(f"✓ Analyzed {len(screen_segments)} frames with vision model")
# Save OCR results as JSON
ocr_output = output_dir / f"{video_path.stem}_ocr.json"
with open(ocr_output, 'w', encoding='utf-8') as f:
json.dump(screen_segments, f, indent=2, ensure_ascii=False)
logger.info(f"✓ Saved OCR results to: {ocr_output}")
except ImportError as e:
logger.error(f"{e}")
sys.exit(1)
else:
# Use OCR
logger.info("Step 2: Running OCR on extracted frames...")
try:
ocr = OCRProcessor(engine=args.ocr_engine)
screen_segments = ocr.process_frames(
frames_info,
deduplicate=not args.no_deduplicate
)
logger.info(f"✓ Processed {len(screen_segments)} frames with OCR")
except ImportError as e:
logger.error(f"{e}")
logger.error(f"To install {args.ocr_engine}:")
logger.error(f" pip install {args.ocr_engine}")
sys.exit(1)
# Save analysis results as JSON
with open(analysis_cache, 'w', encoding='utf-8') as f:
json.dump(screen_segments, f, indent=2, ensure_ascii=False)
logger.info(f"✓ Saved analysis results to: {analysis_cache}")
if args.extract_only:
logger.info("Done! (extract-only mode)")

View File

@@ -2,13 +2,17 @@
opencv-python>=4.8.0
Pillow>=10.0.0
# OCR engines (install at least one)
# Tesseract (recommended, lightweight)
# Vision analysis (recommended for better results)
# Requires Ollama to be installed: https://ollama.ai/download
ollama>=0.1.0
# OCR engines (alternative to vision analysis)
# Tesseract (lightweight, basic text extraction)
pytesseract>=0.3.10
# Alternative OCR engines (optional, install as needed)
# Alternative OCR engines (optional)
# easyocr>=1.7.0
# paddleocr>=2.7.0
# For Whisper transcription (if not already installed)
# For Whisper transcription (recommended)
# openai-whisper>=20230918