add vision processor
This commit is contained in:
241
README.md
241
README.md
@@ -1,12 +1,17 @@
|
|||||||
# Meeting Processor
|
# Meeting Processor
|
||||||
|
|
||||||
Extract screen content from meeting recordings and merge with Whisper transcripts for better Claude summarization.
|
Extract screen content from meeting recordings and merge with Whisper transcripts for better AI summarization.
|
||||||
|
|
||||||
## Overview
|
## Overview
|
||||||
|
|
||||||
This tool enhances meeting transcripts by combining:
|
This tool enhances meeting transcripts by combining:
|
||||||
- **Audio transcription** (from Whisper)
|
- **Audio transcription** (from Whisper)
|
||||||
- **Screen content** (OCR from screen shares)
|
- **Screen content analysis** (Vision models or OCR)
|
||||||
|
|
||||||
|
### Vision Analysis vs OCR
|
||||||
|
|
||||||
|
- **Vision Models** (recommended): Uses local LLaVA model via Ollama to understand context - great for dashboards, code, consoles
|
||||||
|
- **OCR**: Traditional text extraction - faster but less context-aware
|
||||||
|
|
||||||
The result is a rich, timestamped transcript that provides full context for AI summarization.
|
The result is a rich, timestamped transcript that provides full context for AI summarization.
|
||||||
|
|
||||||
@@ -14,16 +19,13 @@ The result is a rich, timestamped transcript that provides full context for AI s
|
|||||||
|
|
||||||
### 1. System Dependencies
|
### 1. System Dependencies
|
||||||
|
|
||||||
**Tesseract OCR** (recommended):
|
**Ollama** (required for vision analysis):
|
||||||
```bash
|
```bash
|
||||||
# Ubuntu/Debian
|
# Install from https://ollama.ai/download
|
||||||
sudo apt-get install tesseract-ocr
|
# Then pull a vision model:
|
||||||
|
ollama pull llava:13b
|
||||||
# macOS
|
# or for lighter model:
|
||||||
brew install tesseract
|
ollama pull llava:7b
|
||||||
|
|
||||||
# Arch Linux
|
|
||||||
sudo pacman -S tesseract
|
|
||||||
```
|
```
|
||||||
|
|
||||||
**FFmpeg** (for scene detection):
|
**FFmpeg** (for scene detection):
|
||||||
@@ -35,6 +37,18 @@ sudo apt-get install ffmpeg
|
|||||||
brew install ffmpeg
|
brew install ffmpeg
|
||||||
```
|
```
|
||||||
|
|
||||||
|
**Tesseract OCR** (optional, if not using vision):
|
||||||
|
```bash
|
||||||
|
# Ubuntu/Debian
|
||||||
|
sudo apt-get install tesseract-ocr
|
||||||
|
|
||||||
|
# macOS
|
||||||
|
brew install tesseract
|
||||||
|
|
||||||
|
# Arch Linux
|
||||||
|
sudo pacman -S tesseract
|
||||||
|
```
|
||||||
|
|
||||||
### 2. Python Dependencies
|
### 2. Python Dependencies
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
@@ -49,6 +63,7 @@ pip install openai-whisper
|
|||||||
|
|
||||||
### 4. Optional: Install Alternative OCR Engines
|
### 4. Optional: Install Alternative OCR Engines
|
||||||
|
|
||||||
|
If you prefer OCR over vision analysis:
|
||||||
```bash
|
```bash
|
||||||
# EasyOCR (better for rotated/handwritten text)
|
# EasyOCR (better for rotated/handwritten text)
|
||||||
pip install easyocr
|
pip install easyocr
|
||||||
@@ -59,118 +74,173 @@ pip install paddleocr
|
|||||||
|
|
||||||
## Quick Start
|
## Quick Start
|
||||||
|
|
||||||
### Recommended: Run Everything in One Command
|
### Recommended: Vision Analysis (Best for Code/Dashboards)
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
python process_meeting.py samples/meeting.mkv --run-whisper
|
python process_meeting.py samples/meeting.mkv --run-whisper --use-vision
|
||||||
```
|
```
|
||||||
|
|
||||||
This will:
|
This will:
|
||||||
1. Run Whisper transcription (audio → text)
|
1. Run Whisper transcription (audio → text)
|
||||||
2. Extract frames every 5 seconds
|
2. Extract frames every 5 seconds
|
||||||
3. Run OCR to extract screen text
|
3. Use LLaVA vision model to analyze frames with context
|
||||||
4. Merge audio + screen content
|
4. Merge audio + screen content
|
||||||
5. Save everything to `output/` folder
|
5. Save everything to `output/` folder
|
||||||
|
|
||||||
### Alternative: Use Existing Whisper Transcript
|
### Re-run with Cached Results
|
||||||
|
|
||||||
If you already have a Whisper transcript:
|
Already ran it once? Re-run instantly using cached results:
|
||||||
```bash
|
```bash
|
||||||
python process_meeting.py samples/meeting.mkv --transcript output/meeting.json
|
# Uses cached transcript, frames, and analysis
|
||||||
|
python process_meeting.py samples/meeting.mkv --use-vision
|
||||||
|
|
||||||
|
# Force reprocessing
|
||||||
|
python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --no-cache
|
||||||
```
|
```
|
||||||
|
|
||||||
### Screen Content Only (No Audio)
|
### Traditional OCR (Faster, Less Context-Aware)
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
python process_meeting.py samples/meeting.mkv
|
python process_meeting.py samples/meeting.mkv --run-whisper
|
||||||
```
|
```
|
||||||
|
|
||||||
## Usage Examples
|
## Usage Examples
|
||||||
|
|
||||||
### Run with different Whisper models
|
### Vision Analysis with Context Hints
|
||||||
```bash
|
```bash
|
||||||
# Tiny model (fastest, less accurate)
|
# For code-heavy meetings
|
||||||
python process_meeting.py samples/meeting.mkv --run-whisper --whisper-model tiny
|
python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --vision-context code
|
||||||
|
|
||||||
# Small model (balanced)
|
# For dashboard/monitoring meetings (Grafana, GCP, etc.)
|
||||||
python process_meeting.py samples/meeting.mkv --run-whisper --whisper-model small
|
python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --vision-context dashboard
|
||||||
|
|
||||||
# Large model (slowest, most accurate)
|
# For console/terminal sessions
|
||||||
python process_meeting.py samples/meeting.mkv --run-whisper --whisper-model large
|
python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --vision-context console
|
||||||
|
```
|
||||||
|
|
||||||
|
### Different Vision Models
|
||||||
|
```bash
|
||||||
|
# Lighter/faster model (7B parameters)
|
||||||
|
python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --vision-model llava:7b
|
||||||
|
|
||||||
|
# Default model (13B parameters, better quality)
|
||||||
|
python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --vision-model llava:13b
|
||||||
|
|
||||||
|
# Alternative models
|
||||||
|
python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --vision-model bakllava
|
||||||
```
|
```
|
||||||
|
|
||||||
### Extract frames at different intervals
|
### Extract frames at different intervals
|
||||||
```bash
|
```bash
|
||||||
# Every 10 seconds (with Whisper)
|
# Every 10 seconds
|
||||||
python process_meeting.py samples/meeting.mkv --run-whisper --interval 10
|
python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --interval 10
|
||||||
|
|
||||||
# Every 3 seconds (more detailed)
|
# Every 3 seconds (more detailed)
|
||||||
python process_meeting.py samples/meeting.mkv --run-whisper --interval 3
|
python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --interval 3
|
||||||
```
|
```
|
||||||
|
|
||||||
### Use scene detection (smarter, fewer frames)
|
### Use scene detection (smarter, fewer frames)
|
||||||
```bash
|
```bash
|
||||||
python process_meeting.py samples/meeting.mkv --run-whisper --scene-detection
|
python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --scene-detection
|
||||||
```
|
```
|
||||||
|
|
||||||
### Use different OCR engines
|
### Traditional OCR (if you prefer)
|
||||||
```bash
|
```bash
|
||||||
# EasyOCR (good for varied layouts)
|
# Tesseract (default)
|
||||||
|
python process_meeting.py samples/meeting.mkv --run-whisper
|
||||||
|
|
||||||
|
# EasyOCR
|
||||||
python process_meeting.py samples/meeting.mkv --run-whisper --ocr-engine easyocr
|
python process_meeting.py samples/meeting.mkv --run-whisper --ocr-engine easyocr
|
||||||
|
|
||||||
# PaddleOCR (good for code/terminal)
|
# PaddleOCR
|
||||||
python process_meeting.py samples/meeting.mkv --run-whisper --ocr-engine paddleocr
|
python process_meeting.py samples/meeting.mkv --run-whisper --ocr-engine paddleocr
|
||||||
```
|
```
|
||||||
|
|
||||||
### Extract frames only (no merging)
|
### Caching Examples
|
||||||
```bash
|
```bash
|
||||||
python process_meeting.py samples/meeting.mkv --extract-only
|
# First run - processes everything
|
||||||
|
python process_meeting.py samples/meeting.mkv --run-whisper --use-vision
|
||||||
|
|
||||||
|
# Second run - uses cached transcript and frames, only re-merges
|
||||||
|
python process_meeting.py samples/meeting.mkv
|
||||||
|
|
||||||
|
# Switch from OCR to vision using existing frames
|
||||||
|
python process_meeting.py samples/meeting.mkv --use-vision
|
||||||
|
|
||||||
|
# Force complete reprocessing
|
||||||
|
python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --no-cache
|
||||||
```
|
```
|
||||||
|
|
||||||
### Custom output location
|
### Custom output location
|
||||||
```bash
|
```bash
|
||||||
python process_meeting.py samples/meeting.mkv --run-whisper --output-dir my_outputs/
|
python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --output-dir my_outputs/
|
||||||
```
|
```
|
||||||
|
|
||||||
### Enable verbose logging
|
### Enable verbose logging
|
||||||
```bash
|
```bash
|
||||||
# Show detailed debug information
|
# Show detailed debug information
|
||||||
python process_meeting.py samples/meeting.mkv --run-whisper --verbose
|
python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --verbose
|
||||||
```
|
```
|
||||||
|
|
||||||
## Output Files
|
## Output Files
|
||||||
|
|
||||||
All output files are saved to the `output/` directory by default:
|
All output files are saved to the `output/` directory by default:
|
||||||
|
|
||||||
- **`output/<video>_enhanced.txt`** - Enhanced transcript ready for Claude
|
- **`output/<video>_enhanced.txt`** - Enhanced transcript ready for AI summarization
|
||||||
- **`output/<video>.json`** - Whisper transcript (if `--run-whisper` was used)
|
- **`output/<video>.json`** - Whisper transcript (if `--run-whisper` was used)
|
||||||
- **`output/<video>_ocr.json`** - Raw OCR data with timestamps
|
- **`output/<video>_vision.json`** - Vision analysis results with timestamps (if `--use-vision`)
|
||||||
|
- **`output/<video>_ocr.json`** - OCR results with timestamps (if using OCR)
|
||||||
- **`frames/`** - Extracted video frames (JPG files)
|
- **`frames/`** - Extracted video frames (JPG files)
|
||||||
|
|
||||||
|
### Caching Behavior
|
||||||
|
|
||||||
|
The tool automatically caches intermediate results to speed up re-runs:
|
||||||
|
- **Whisper transcript**: Cached as `output/<video>.json`
|
||||||
|
- **Extracted frames**: Cached in `frames/<video>_*.jpg`
|
||||||
|
- **Analysis results**: Cached as `output/<video>_vision.json` or `output/<video>_ocr.json`
|
||||||
|
|
||||||
|
Re-running with the same video will use cached results unless `--no-cache` is specified.
|
||||||
|
|
||||||
## Workflow for Meeting Analysis
|
## Workflow for Meeting Analysis
|
||||||
|
|
||||||
### Complete Workflow (One Command!)
|
### Complete Workflow (One Command!)
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
# Process everything in one step
|
# Process everything in one step with vision analysis
|
||||||
python process_meeting.py samples/alo-intro1.mkv --run-whisper --scene-detection
|
python process_meeting.py samples/alo-intro1.mkv --run-whisper --use-vision --scene-detection
|
||||||
|
|
||||||
# Output will be in output/alo-intro1_enhanced.txt
|
# Output will be in output/alo-intro1_enhanced.txt
|
||||||
```
|
```
|
||||||
|
|
||||||
|
### Typical Iterative Workflow
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# First run - full processing
|
||||||
|
python process_meeting.py samples/meeting.mkv --run-whisper --use-vision
|
||||||
|
|
||||||
|
# Review results, then re-run with different context if needed
|
||||||
|
python process_meeting.py samples/meeting.mkv --use-vision --vision-context code
|
||||||
|
|
||||||
|
# Or switch to a different vision model
|
||||||
|
python process_meeting.py samples/meeting.mkv --use-vision --vision-model llava:7b
|
||||||
|
|
||||||
|
# All use cached frames and transcript!
|
||||||
|
```
|
||||||
|
|
||||||
### Traditional Workflow (Separate Steps)
|
### Traditional Workflow (Separate Steps)
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
# 1. Extract audio and transcribe with Whisper (optional, if not using --run-whisper)
|
# 1. Extract audio and transcribe with Whisper (optional, if not using --run-whisper)
|
||||||
whisper samples/alo-intro1.mkv --model base --output_format json --output_dir output
|
whisper samples/alo-intro1.mkv --model base --output_format json --output_dir output
|
||||||
|
|
||||||
# 2. Process video to extract screen content
|
# 2. Process video to extract screen content with vision
|
||||||
python process_meeting.py samples/alo-intro1.mkv \
|
python process_meeting.py samples/alo-intro1.mkv \
|
||||||
--transcript output/alo-intro1.json \
|
--transcript output/alo-intro1.json \
|
||||||
|
--use-vision \
|
||||||
--scene-detection
|
--scene-detection
|
||||||
|
|
||||||
# 3. Use the enhanced transcript with Claude
|
# 3. Use the enhanced transcript with AI
|
||||||
# Copy the content from output/alo-intro1_enhanced.txt and paste into Claude
|
# Copy the content from output/alo-intro1_enhanced.txt and paste into Claude or your LLM
|
||||||
```
|
```
|
||||||
|
|
||||||
### Example Prompt for Claude
|
### Example Prompt for Claude
|
||||||
@@ -217,42 +287,99 @@ Options:
|
|||||||
|
|
||||||
## Tips for Best Results
|
## Tips for Best Results
|
||||||
|
|
||||||
|
### Vision vs OCR: When to Use Each
|
||||||
|
|
||||||
|
**Use Vision Models (`--use-vision`) when:**
|
||||||
|
- ✅ Analyzing dashboards (Grafana, GCP Console, monitoring tools)
|
||||||
|
- ✅ Code walkthroughs or debugging sessions
|
||||||
|
- ✅ Complex layouts with mixed content
|
||||||
|
- ✅ Need contextual understanding, not just text extraction
|
||||||
|
- ✅ Working with charts, graphs, or visualizations
|
||||||
|
- ⚠️ Trade-off: Slower (requires GPU/CPU for local model)
|
||||||
|
|
||||||
|
**Use OCR when:**
|
||||||
|
- ✅ Simple text extraction from slides or documents
|
||||||
|
- ✅ Need maximum speed
|
||||||
|
- ✅ Limited computational resources
|
||||||
|
- ✅ Presentations with mostly text
|
||||||
|
- ⚠️ Trade-off: Less context-aware, may miss visual relationships
|
||||||
|
|
||||||
|
### Context Hints for Vision Analysis
|
||||||
|
- **`--vision-context meeting`**: General purpose (default)
|
||||||
|
- **`--vision-context code`**: Optimized for code screenshots, preserves formatting
|
||||||
|
- **`--vision-context dashboard`**: Extracts metrics, trends, panel names
|
||||||
|
- **`--vision-context console`**: Captures commands, output, error messages
|
||||||
|
|
||||||
### Scene Detection vs Interval
|
### Scene Detection vs Interval
|
||||||
- **Scene detection**: Better for presentations with distinct slides. More efficient.
|
- **Scene detection**: Better for presentations with distinct slides. More efficient.
|
||||||
- **Interval extraction**: Better for continuous screen sharing (coding, browsing). More thorough.
|
- **Interval extraction**: Better for continuous screen sharing (coding, browsing). More thorough.
|
||||||
|
|
||||||
### OCR Engine Selection
|
### Vision Model Selection
|
||||||
- **Tesseract**: Best for clean slides, documents, presentations. Fast and lightweight.
|
- **`llava:7b`**: Faster, lower memory (~4GB RAM), good quality
|
||||||
- **EasyOCR**: Better for handwriting, rotated text, or varied fonts.
|
- **`llava:13b`**: Better quality, slower, needs ~8GB RAM (default)
|
||||||
- **PaddleOCR**: Excellent for code, terminal outputs, and mixed languages.
|
- **`bakllava`**: Alternative with different strengths
|
||||||
|
|
||||||
### Deduplication
|
### Deduplication
|
||||||
- Enabled by default - removes similar consecutive frames
|
- Enabled by default - removes similar consecutive frames
|
||||||
- Disable with `--no-deduplicate` if slides change subtly
|
- Disable with `--no-deduplicate` if slides/screens change subtly
|
||||||
|
|
||||||
## Troubleshooting
|
## Troubleshooting
|
||||||
|
|
||||||
### "pytesseract not installed"
|
### Vision Model Issues
|
||||||
|
|
||||||
|
**"ollama package not installed"**
|
||||||
|
```bash
|
||||||
|
pip install ollama
|
||||||
|
```
|
||||||
|
|
||||||
|
**"Ollama not found" or connection errors**
|
||||||
|
```bash
|
||||||
|
# Install Ollama first: https://ollama.ai/download
|
||||||
|
# Then pull a vision model:
|
||||||
|
ollama pull llava:13b
|
||||||
|
```
|
||||||
|
|
||||||
|
**Vision analysis is slow**
|
||||||
|
- Use lighter model: `--vision-model llava:7b`
|
||||||
|
- Reduce frame count: `--scene-detection` or `--interval 10`
|
||||||
|
- Check if Ollama is using GPU (much faster)
|
||||||
|
|
||||||
|
**Poor vision analysis results**
|
||||||
|
- Try different context hint: `--vision-context code` or `--vision-context dashboard`
|
||||||
|
- Use larger model: `--vision-model llava:13b`
|
||||||
|
- Ensure frames are clear (check video resolution)
|
||||||
|
|
||||||
|
### OCR Issues
|
||||||
|
|
||||||
|
**"pytesseract not installed"**
|
||||||
```bash
|
```bash
|
||||||
pip install pytesseract
|
pip install pytesseract
|
||||||
sudo apt-get install tesseract-ocr # Don't forget system package!
|
sudo apt-get install tesseract-ocr # Don't forget system package!
|
||||||
```
|
```
|
||||||
|
|
||||||
### "No frames extracted"
|
**Poor OCR quality**
|
||||||
|
- **Solution**: Switch to vision analysis with `--use-vision`
|
||||||
|
- Or try different OCR engine: `--ocr-engine easyocr`
|
||||||
|
- Check if video resolution is sufficient
|
||||||
|
- Use `--no-deduplicate` to keep more frames
|
||||||
|
|
||||||
|
### General Issues
|
||||||
|
|
||||||
|
**"No frames extracted"**
|
||||||
- Check video file is valid: `ffmpeg -i video.mkv`
|
- Check video file is valid: `ffmpeg -i video.mkv`
|
||||||
- Try lower interval: `--interval 3`
|
- Try lower interval: `--interval 3`
|
||||||
- Check disk space in frames directory
|
- Check disk space in frames directory
|
||||||
|
|
||||||
### Poor OCR quality
|
**Scene detection not working**
|
||||||
- Try different OCR engine
|
|
||||||
- Check if video resolution is sufficient
|
|
||||||
- Use `--no-deduplicate` to keep more frames
|
|
||||||
|
|
||||||
### Scene detection not working
|
|
||||||
- Fallback to interval extraction automatically
|
- Fallback to interval extraction automatically
|
||||||
- Ensure FFmpeg is installed
|
- Ensure FFmpeg is installed
|
||||||
- Try manual interval: `--interval 5`
|
- Try manual interval: `--interval 5`
|
||||||
|
|
||||||
|
**Cache not being used**
|
||||||
|
- Ensure you're using the same video filename
|
||||||
|
- Check that output directory contains cached files
|
||||||
|
- Use `--verbose` to see what's being cached/loaded
|
||||||
|
|
||||||
## Project Structure
|
## Project Structure
|
||||||
|
|
||||||
```
|
```
|
||||||
|
|||||||
192
meetus/vision_processor.py
Normal file
192
meetus/vision_processor.py
Normal file
@@ -0,0 +1,192 @@
|
|||||||
|
"""
|
||||||
|
Vision-based frame analysis using local vision-language models via Ollama.
|
||||||
|
Better than OCR for understanding dashboards, code, and console output.
|
||||||
|
"""
|
||||||
|
from typing import List, Tuple, Dict, Optional
|
||||||
|
from pathlib import Path
|
||||||
|
import logging
|
||||||
|
from difflib import SequenceMatcher
|
||||||
|
|
||||||
|
logger = logging.getLogger(__name__)
|
||||||
|
|
||||||
|
|
||||||
|
class VisionProcessor:
|
||||||
|
"""Process frames using local vision models via Ollama."""
|
||||||
|
|
||||||
|
def __init__(self, model: str = "llava:13b"):
|
||||||
|
"""
|
||||||
|
Initialize vision processor.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
model: Ollama vision model to use (llava:13b, llava:7b, llava-llama3, bakllava)
|
||||||
|
"""
|
||||||
|
self.model = model
|
||||||
|
self._client = None
|
||||||
|
self._init_client()
|
||||||
|
|
||||||
|
def _init_client(self):
|
||||||
|
"""Initialize Ollama client."""
|
||||||
|
try:
|
||||||
|
import ollama
|
||||||
|
self._client = ollama
|
||||||
|
|
||||||
|
# Check if model is available
|
||||||
|
try:
|
||||||
|
models = self._client.list()
|
||||||
|
available_models = [m['name'] for m in models.get('models', [])]
|
||||||
|
|
||||||
|
if self.model not in available_models:
|
||||||
|
logger.warning(f"Model {self.model} not found locally.")
|
||||||
|
logger.info(f"Pulling {self.model}... (this may take a few minutes)")
|
||||||
|
self._client.pull(self.model)
|
||||||
|
logger.info(f"✓ Model {self.model} downloaded")
|
||||||
|
else:
|
||||||
|
logger.info(f"Using local vision model: {self.model}")
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
logger.warning(f"Could not verify model availability: {e}")
|
||||||
|
logger.info("Attempting to use model anyway...")
|
||||||
|
|
||||||
|
except ImportError:
|
||||||
|
raise ImportError(
|
||||||
|
"ollama package not installed. Run: pip install ollama\n"
|
||||||
|
"Also install Ollama: https://ollama.ai/download"
|
||||||
|
)
|
||||||
|
|
||||||
|
def analyze_frame(self, image_path: str, context: str = "meeting") -> str:
|
||||||
|
"""
|
||||||
|
Analyze a single frame using local vision model.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
image_path: Path to image file
|
||||||
|
context: Context hint for analysis (meeting, dashboard, code, console)
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
Analyzed content description
|
||||||
|
"""
|
||||||
|
# Context-specific prompts
|
||||||
|
prompts = {
|
||||||
|
"meeting": """Analyze this screen capture from a meeting recording. Extract:
|
||||||
|
1. Any visible text (titles, labels, headings)
|
||||||
|
2. Key metrics, numbers, or data points shown
|
||||||
|
3. Dashboard panels or visualizations (describe what they show)
|
||||||
|
4. Code snippets (preserve formatting and context)
|
||||||
|
5. Console/terminal output (commands and results)
|
||||||
|
6. Application names or UI elements
|
||||||
|
|
||||||
|
Focus on information that would help someone understand what was being discussed.
|
||||||
|
Be concise but include all important details. If there's code, preserve it exactly.""",
|
||||||
|
|
||||||
|
"dashboard": """Analyze this dashboard/monitoring panel. Extract:
|
||||||
|
1. Panel titles and metrics names
|
||||||
|
2. Current values and units
|
||||||
|
3. Trends (up/down/stable)
|
||||||
|
4. Alerts or warnings
|
||||||
|
5. Time ranges shown
|
||||||
|
6. Any anomalies or notable patterns
|
||||||
|
|
||||||
|
Format as structured data.""",
|
||||||
|
|
||||||
|
"code": """Analyze this code screenshot. Extract:
|
||||||
|
1. Programming language
|
||||||
|
2. File name or path (if visible)
|
||||||
|
3. Code content (preserve exact formatting)
|
||||||
|
4. Comments
|
||||||
|
5. Function/class names
|
||||||
|
6. Any error messages or warnings
|
||||||
|
|
||||||
|
Preserve code exactly as shown.""",
|
||||||
|
|
||||||
|
"console": """Analyze this console/terminal output. Extract:
|
||||||
|
1. Commands executed
|
||||||
|
2. Output/results
|
||||||
|
3. Error messages
|
||||||
|
4. Warnings or status messages
|
||||||
|
5. File paths or URLs
|
||||||
|
|
||||||
|
Preserve formatting and structure."""
|
||||||
|
}
|
||||||
|
|
||||||
|
prompt = prompts.get(context, prompts["meeting"])
|
||||||
|
|
||||||
|
try:
|
||||||
|
# Use Ollama's chat API with vision
|
||||||
|
response = self._client.chat(
|
||||||
|
model=self.model,
|
||||||
|
messages=[
|
||||||
|
{
|
||||||
|
'role': 'user',
|
||||||
|
'content': prompt,
|
||||||
|
'images': [image_path]
|
||||||
|
}
|
||||||
|
]
|
||||||
|
)
|
||||||
|
|
||||||
|
# Extract text from response
|
||||||
|
text = response['message']['content']
|
||||||
|
return text.strip()
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
logger.error(f"Vision model error for {image_path}: {e}")
|
||||||
|
return ""
|
||||||
|
|
||||||
|
def process_frames(
|
||||||
|
self,
|
||||||
|
frames_info: List[Tuple[str, float]],
|
||||||
|
context: str = "meeting",
|
||||||
|
deduplicate: bool = True,
|
||||||
|
similarity_threshold: float = 0.85
|
||||||
|
) -> List[Dict]:
|
||||||
|
"""
|
||||||
|
Process multiple frames with vision analysis.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
frames_info: List of (frame_path, timestamp) tuples
|
||||||
|
context: Context hint for analysis
|
||||||
|
deduplicate: Whether to remove similar consecutive analyses
|
||||||
|
similarity_threshold: Threshold for considering analyses as duplicates (0-1)
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
List of dicts with 'timestamp', 'text', and 'frame_path'
|
||||||
|
"""
|
||||||
|
results = []
|
||||||
|
prev_text = ""
|
||||||
|
|
||||||
|
total = len(frames_info)
|
||||||
|
logger.info(f"Starting vision analysis of {total} frames...")
|
||||||
|
|
||||||
|
for idx, (frame_path, timestamp) in enumerate(frames_info, 1):
|
||||||
|
logger.info(f"Analyzing frame {idx}/{total} at {timestamp:.2f}s...")
|
||||||
|
|
||||||
|
text = self.analyze_frame(frame_path, context)
|
||||||
|
|
||||||
|
if not text:
|
||||||
|
logger.warning(f"No content extracted from frame at {timestamp:.2f}s")
|
||||||
|
continue
|
||||||
|
|
||||||
|
# Deduplicate similar consecutive frames
|
||||||
|
if deduplicate:
|
||||||
|
similarity = self._text_similarity(prev_text, text)
|
||||||
|
if similarity > similarity_threshold:
|
||||||
|
logger.debug(f"Skipping duplicate frame at {timestamp:.2f}s (similarity: {similarity:.2f})")
|
||||||
|
continue
|
||||||
|
|
||||||
|
results.append({
|
||||||
|
'timestamp': timestamp,
|
||||||
|
'text': text,
|
||||||
|
'frame_path': frame_path
|
||||||
|
})
|
||||||
|
|
||||||
|
prev_text = text
|
||||||
|
|
||||||
|
logger.info(f"Extracted content from {len(results)} frames (deduplication: {deduplicate})")
|
||||||
|
return results
|
||||||
|
|
||||||
|
def _text_similarity(self, text1: str, text2: str) -> float:
|
||||||
|
"""
|
||||||
|
Calculate similarity between two texts.
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
Similarity score between 0 and 1
|
||||||
|
"""
|
||||||
|
return SequenceMatcher(None, text1, text2).ratio()
|
||||||
@@ -13,6 +13,7 @@ import shutil
|
|||||||
|
|
||||||
from meetus.frame_extractor import FrameExtractor
|
from meetus.frame_extractor import FrameExtractor
|
||||||
from meetus.ocr_processor import OCRProcessor
|
from meetus.ocr_processor import OCRProcessor
|
||||||
|
from meetus.vision_processor import VisionProcessor
|
||||||
from meetus.transcript_merger import TranscriptMerger
|
from meetus.transcript_merger import TranscriptMerger
|
||||||
|
|
||||||
logger = logging.getLogger(__name__)
|
logger = logging.getLogger(__name__)
|
||||||
@@ -98,20 +99,23 @@ def main():
|
|||||||
formatter_class=argparse.RawDescriptionHelpFormatter,
|
formatter_class=argparse.RawDescriptionHelpFormatter,
|
||||||
epilog="""
|
epilog="""
|
||||||
Examples:
|
Examples:
|
||||||
# Run Whisper + full processing in one command
|
# Run Whisper + vision analysis (recommended for code/dashboards)
|
||||||
|
python process_meeting.py samples/meeting.mkv --run-whisper --use-vision
|
||||||
|
|
||||||
|
# Use vision with specific context hint
|
||||||
|
python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --vision-context code
|
||||||
|
|
||||||
|
# Traditional OCR approach
|
||||||
python process_meeting.py samples/meeting.mkv --run-whisper
|
python process_meeting.py samples/meeting.mkv --run-whisper
|
||||||
|
|
||||||
# Process video with existing Whisper transcript
|
# Re-run analysis using cached frames and transcript
|
||||||
python process_meeting.py samples/meeting.mkv --transcript output/meeting.json
|
python process_meeting.py samples/meeting.mkv --use-vision
|
||||||
|
|
||||||
# Use scene detection instead of interval
|
# Force reprocessing (ignore cache)
|
||||||
python process_meeting.py samples/meeting.mkv --run-whisper --scene-detection
|
python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --no-cache
|
||||||
|
|
||||||
# Use different Whisper model and OCR engine
|
# Use scene detection for fewer frames
|
||||||
python process_meeting.py samples/meeting.mkv --run-whisper --whisper-model small --ocr-engine easyocr
|
python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --scene-detection
|
||||||
|
|
||||||
# Extract frames only (no transcript)
|
|
||||||
python process_meeting.py samples/meeting.mkv --extract-only
|
|
||||||
"""
|
"""
|
||||||
)
|
)
|
||||||
|
|
||||||
@@ -177,6 +181,31 @@ Examples:
|
|||||||
default='tesseract'
|
default='tesseract'
|
||||||
)
|
)
|
||||||
|
|
||||||
|
parser.add_argument(
|
||||||
|
'--use-vision',
|
||||||
|
action='store_true',
|
||||||
|
help='Use local vision model (Ollama) instead of OCR for better context understanding'
|
||||||
|
)
|
||||||
|
|
||||||
|
parser.add_argument(
|
||||||
|
'--vision-model',
|
||||||
|
help='Vision model to use with Ollama (default: llava:13b)',
|
||||||
|
default='llava:13b'
|
||||||
|
)
|
||||||
|
|
||||||
|
parser.add_argument(
|
||||||
|
'--vision-context',
|
||||||
|
choices=['meeting', 'dashboard', 'code', 'console'],
|
||||||
|
help='Context hint for vision analysis (default: meeting)',
|
||||||
|
default='meeting'
|
||||||
|
)
|
||||||
|
|
||||||
|
parser.add_argument(
|
||||||
|
'--no-cache',
|
||||||
|
action='store_true',
|
||||||
|
help='Disable caching - reprocess everything even if outputs exist'
|
||||||
|
)
|
||||||
|
|
||||||
parser.add_argument(
|
parser.add_argument(
|
||||||
'--no-deduplicate',
|
'--no-deduplicate',
|
||||||
action='store_true',
|
action='store_true',
|
||||||
@@ -221,8 +250,17 @@ Examples:
|
|||||||
if args.output is None:
|
if args.output is None:
|
||||||
args.output = str(output_dir / f"{video_path.stem}_enhanced.txt")
|
args.output = str(output_dir / f"{video_path.stem}_enhanced.txt")
|
||||||
|
|
||||||
# Run Whisper if requested
|
# Define cache paths
|
||||||
|
whisper_cache = output_dir / f"{video_path.stem}.json"
|
||||||
|
analysis_cache = output_dir / f"{video_path.stem}_{'vision' if args.use_vision else 'ocr'}.json"
|
||||||
|
frames_cache_dir = Path(args.frames_dir)
|
||||||
|
|
||||||
|
# Check for cached Whisper transcript
|
||||||
if args.run_whisper:
|
if args.run_whisper:
|
||||||
|
if not args.no_cache and whisper_cache.exists():
|
||||||
|
logger.info(f"✓ Found cached Whisper transcript: {whisper_cache}")
|
||||||
|
args.transcript = str(whisper_cache)
|
||||||
|
else:
|
||||||
logger.info("=" * 80)
|
logger.info("=" * 80)
|
||||||
logger.info("STEP 0: Running Whisper Transcription")
|
logger.info("STEP 0: Running Whisper Transcription")
|
||||||
logger.info("=" * 80)
|
logger.info("=" * 80)
|
||||||
@@ -234,14 +272,35 @@ Examples:
|
|||||||
logger.info("MEETING PROCESSOR")
|
logger.info("MEETING PROCESSOR")
|
||||||
logger.info("=" * 80)
|
logger.info("=" * 80)
|
||||||
logger.info(f"Video: {video_path.name}")
|
logger.info(f"Video: {video_path.name}")
|
||||||
logger.info(f"OCR Engine: {args.ocr_engine}")
|
logger.info(f"Analysis: {'Vision Model' if args.use_vision else f'OCR ({args.ocr_engine})'}")
|
||||||
|
if args.use_vision:
|
||||||
|
logger.info(f"Vision Model: {args.vision_model}")
|
||||||
|
logger.info(f"Context: {args.vision_context}")
|
||||||
logger.info(f"Frame extraction: {'Scene detection' if args.scene_detection else f'Every {args.interval}s'}")
|
logger.info(f"Frame extraction: {'Scene detection' if args.scene_detection else f'Every {args.interval}s'}")
|
||||||
if args.transcript:
|
if args.transcript:
|
||||||
logger.info(f"Transcript: {args.transcript}")
|
logger.info(f"Transcript: {args.transcript}")
|
||||||
|
logger.info(f"Caching: {'Disabled' if args.no_cache else 'Enabled'}")
|
||||||
logger.info("=" * 80)
|
logger.info("=" * 80)
|
||||||
|
|
||||||
# Step 1: Extract frames
|
# Step 1: Extract frames (with caching)
|
||||||
logger.info("Step 1: Extracting frames from video...")
|
logger.info("Step 1: Extracting frames from video...")
|
||||||
|
|
||||||
|
# Check if frames already exist
|
||||||
|
existing_frames = list(frames_cache_dir.glob(f"{video_path.stem}_*.jpg")) if frames_cache_dir.exists() else []
|
||||||
|
|
||||||
|
if not args.no_cache and existing_frames and len(existing_frames) > 0:
|
||||||
|
logger.info(f"✓ Found {len(existing_frames)} cached frames in {args.frames_dir}/")
|
||||||
|
# Build frames_info from existing files
|
||||||
|
frames_info = []
|
||||||
|
for frame_path in sorted(existing_frames):
|
||||||
|
# Try to extract timestamp from filename (e.g., video_00001_12.34s.jpg)
|
||||||
|
try:
|
||||||
|
timestamp_str = frame_path.stem.split('_')[-1].rstrip('s')
|
||||||
|
timestamp = float(timestamp_str)
|
||||||
|
except:
|
||||||
|
timestamp = 0.0
|
||||||
|
frames_info.append((str(frame_path), timestamp))
|
||||||
|
else:
|
||||||
extractor = FrameExtractor(str(video_path), args.frames_dir)
|
extractor = FrameExtractor(str(video_path), args.frames_dir)
|
||||||
|
|
||||||
if args.scene_detection:
|
if args.scene_detection:
|
||||||
@@ -255,7 +314,30 @@ Examples:
|
|||||||
|
|
||||||
logger.info(f"✓ Extracted {len(frames_info)} frames")
|
logger.info(f"✓ Extracted {len(frames_info)} frames")
|
||||||
|
|
||||||
# Step 2: Run OCR on frames
|
# Step 2: Run analysis on frames (with caching)
|
||||||
|
if not args.no_cache and analysis_cache.exists():
|
||||||
|
logger.info(f"✓ Found cached analysis results: {analysis_cache}")
|
||||||
|
with open(analysis_cache, 'r', encoding='utf-8') as f:
|
||||||
|
screen_segments = json.load(f)
|
||||||
|
logger.info(f"✓ Loaded {len(screen_segments)} analyzed frames from cache")
|
||||||
|
else:
|
||||||
|
if args.use_vision:
|
||||||
|
# Use vision model
|
||||||
|
logger.info("Step 2: Running vision analysis on extracted frames...")
|
||||||
|
try:
|
||||||
|
vision = VisionProcessor(model=args.vision_model)
|
||||||
|
screen_segments = vision.process_frames(
|
||||||
|
frames_info,
|
||||||
|
context=args.vision_context,
|
||||||
|
deduplicate=not args.no_deduplicate
|
||||||
|
)
|
||||||
|
logger.info(f"✓ Analyzed {len(screen_segments)} frames with vision model")
|
||||||
|
|
||||||
|
except ImportError as e:
|
||||||
|
logger.error(f"{e}")
|
||||||
|
sys.exit(1)
|
||||||
|
else:
|
||||||
|
# Use OCR
|
||||||
logger.info("Step 2: Running OCR on extracted frames...")
|
logger.info("Step 2: Running OCR on extracted frames...")
|
||||||
try:
|
try:
|
||||||
ocr = OCRProcessor(engine=args.ocr_engine)
|
ocr = OCRProcessor(engine=args.ocr_engine)
|
||||||
@@ -263,7 +345,7 @@ Examples:
|
|||||||
frames_info,
|
frames_info,
|
||||||
deduplicate=not args.no_deduplicate
|
deduplicate=not args.no_deduplicate
|
||||||
)
|
)
|
||||||
logger.info(f"✓ Processed {len(screen_segments)} frames with text content")
|
logger.info(f"✓ Processed {len(screen_segments)} frames with OCR")
|
||||||
|
|
||||||
except ImportError as e:
|
except ImportError as e:
|
||||||
logger.error(f"{e}")
|
logger.error(f"{e}")
|
||||||
@@ -271,11 +353,10 @@ Examples:
|
|||||||
logger.error(f" pip install {args.ocr_engine}")
|
logger.error(f" pip install {args.ocr_engine}")
|
||||||
sys.exit(1)
|
sys.exit(1)
|
||||||
|
|
||||||
# Save OCR results as JSON
|
# Save analysis results as JSON
|
||||||
ocr_output = output_dir / f"{video_path.stem}_ocr.json"
|
with open(analysis_cache, 'w', encoding='utf-8') as f:
|
||||||
with open(ocr_output, 'w', encoding='utf-8') as f:
|
|
||||||
json.dump(screen_segments, f, indent=2, ensure_ascii=False)
|
json.dump(screen_segments, f, indent=2, ensure_ascii=False)
|
||||||
logger.info(f"✓ Saved OCR results to: {ocr_output}")
|
logger.info(f"✓ Saved analysis results to: {analysis_cache}")
|
||||||
|
|
||||||
if args.extract_only:
|
if args.extract_only:
|
||||||
logger.info("Done! (extract-only mode)")
|
logger.info("Done! (extract-only mode)")
|
||||||
|
|||||||
@@ -2,13 +2,17 @@
|
|||||||
opencv-python>=4.8.0
|
opencv-python>=4.8.0
|
||||||
Pillow>=10.0.0
|
Pillow>=10.0.0
|
||||||
|
|
||||||
# OCR engines (install at least one)
|
# Vision analysis (recommended for better results)
|
||||||
# Tesseract (recommended, lightweight)
|
# Requires Ollama to be installed: https://ollama.ai/download
|
||||||
|
ollama>=0.1.0
|
||||||
|
|
||||||
|
# OCR engines (alternative to vision analysis)
|
||||||
|
# Tesseract (lightweight, basic text extraction)
|
||||||
pytesseract>=0.3.10
|
pytesseract>=0.3.10
|
||||||
|
|
||||||
# Alternative OCR engines (optional, install as needed)
|
# Alternative OCR engines (optional)
|
||||||
# easyocr>=1.7.0
|
# easyocr>=1.7.0
|
||||||
# paddleocr>=2.7.0
|
# paddleocr>=2.7.0
|
||||||
|
|
||||||
# For Whisper transcription (if not already installed)
|
# For Whisper transcription (recommended)
|
||||||
# openai-whisper>=20230918
|
# openai-whisper>=20230918
|
||||||
|
|||||||
Reference in New Issue
Block a user