Compare commits

...

10 Commits

Author SHA1 Message Date
Mariano Gabriel
eb8b1f4f11 updated readme 2025-12-04 20:24:52 -03:00
Mariano Gabriel
331cccb15f updated readme 2025-12-04 20:15:16 -03:00
Mariano Gabriel
7d7ec15ff7 add whisperx support 2025-12-03 06:48:45 -03:00
Mariano Gabriel
7b919beda6 add whisperx support 2025-12-02 02:33:39 -03:00
Mariano Gabriel
118ef04223 embed images 2025-10-28 08:02:45 -03:00
Mariano Gabriel
b1e1daf278 scene detection quality and caching 2025-10-28 05:52:31 -03:00
Mariano Gabriel
c871af2def group text 2025-10-23 14:49:14 -03:00
Mariano Gabriel
cdf7ad1199 update prompts 2025-10-20 17:36:31 -03:00
Mariano Gabriel
b9c3cbfbab take turns using the GPU 2025-10-20 01:12:13 -03:00
Mariano Gabriel
cd7b0aed07 refactor 2025-10-20 00:03:41 -03:00
21 changed files with 2331 additions and 606 deletions

9
.gitignore vendored
View File

@@ -2,10 +2,11 @@
samples/*
!samples/.gitkeep
# Output files
# Output directories (timestamped folders for each video)
output/*
!output/.gitkeep
# Extracted frames
frames/
__pycache__
# Python cache
__pycache__
*.pyc
.pytest_cache/

407
README.md
View File

@@ -1,34 +1,21 @@
# Meeting Processor
Extract screen content from meeting recordings and merge with Whisper transcripts for better AI summarization.
Extract screen content from meeting recordings and merge with Whisper/WhisperX transcripts for better AI summarization.
## Overview
This tool enhances meeting transcripts by combining:
- **Audio transcription** (from Whisper)
- **Screen content analysis** (Vision models or OCR)
- **Audio transcription** (Whisper or WhisperX with speaker diarization)
- **Screen content extraction** via FFmpeg scene detection
- **Frame embedding** for direct LLM analysis
### Vision Analysis vs OCR
- **Vision Models** (recommended): Uses local LLaVA model via Ollama to understand context - great for dashboards, code, consoles
- **OCR**: Traditional text extraction - faster but less context-aware
The result is a rich, timestamped transcript that provides full context for AI summarization.
The result is a rich, timestamped transcript with embedded screen frames that provides full context for AI summarization.
## Installation
### 1. System Dependencies
**Ollama** (required for vision analysis):
```bash
# Install from https://ollama.ai/download
# Then pull a vision model:
ollama pull llava:13b
# or for lighter model:
ollama pull llava:7b
```
**FFmpeg** (for scene detection):
**FFmpeg** (required for scene detection and frame extraction):
```bash
# Ubuntu/Debian
sudo apt-get install ffmpeg
@@ -37,210 +24,152 @@ sudo apt-get install ffmpeg
brew install ffmpeg
```
**Tesseract OCR** (optional, if not using vision):
```bash
# Ubuntu/Debian
sudo apt-get install tesseract-ocr
# macOS
brew install tesseract
# Arch Linux
sudo pacman -S tesseract
```
### 2. Python Dependencies
```bash
pip install -r requirements.txt
```
### 3. Whisper (for audio transcription)
### 3. Whisper or WhisperX (for audio transcription)
**Standard Whisper:**
```bash
pip install openai-whisper
```
### 4. Optional: Install Alternative OCR Engines
If you prefer OCR over vision analysis:
**WhisperX** (recommended - includes speaker diarization):
```bash
# EasyOCR (better for rotated/handwritten text)
pip install easyocr
# PaddleOCR (better for code/terminal screens)
pip install paddleocr
pip install whisperx
```
For speaker diarization, you'll need a HuggingFace token with access to pyannote models.
## Quick Start
### Recommended: Vision Analysis (Best for Code/Dashboards)
### Recommended Usage
```bash
python process_meeting.py samples/meeting.mkv --run-whisper --use-vision
python process_meeting.py samples/meeting.mkv --embed-images --scene-detection --scene-threshold 10 --diarize
```
This will:
1. Run Whisper transcription (audio → text)
2. Extract frames every 5 seconds
3. Use LLaVA vision model to analyze frames with context
4. Merge audio + screen content
5. Save everything to `output/` folder
1. Run WhisperX transcription with speaker diarization
2. Extract frames at scene changes (threshold 10 = moderately sensitive)
3. Create an enhanced transcript with frame file references
4. Save everything to `output/` folder
The `--embed-images` flag adds frame paths to the transcript (e.g., `Frame: frames/video_00257.jpg`), keeping the transcript small while frames stay in `frames/` folder for LLM access.
### Re-run with Cached Results
Already ran it once? Re-run instantly using cached results:
```bash
# Uses cached transcript, frames, and analysis
python process_meeting.py samples/meeting.mkv --use-vision
# Uses cached transcript and frames
python process_meeting.py samples/meeting.mkv --embed-images
# Force reprocessing
python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --no-cache
```
# Skip only specific cached items
python process_meeting.py samples/meeting.mkv --embed-images --skip-cache-frames
python process_meeting.py samples/meeting.mkv --embed-images --skip-cache-whisper
### Traditional OCR (Faster, Less Context-Aware)
```bash
python process_meeting.py samples/meeting.mkv --run-whisper
# Force complete reprocessing
python process_meeting.py samples/meeting.mkv --embed-images --scene-detection --diarize --no-cache
```
## Usage Examples
### Vision Analysis with Context Hints
### Scene Detection Options
```bash
# For code-heavy meetings
python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --vision-context code
# Default threshold (15)
python process_meeting.py samples/meeting.mkv --embed-images --scene-detection --diarize
# For dashboard/monitoring meetings (Grafana, GCP, etc.)
python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --vision-context dashboard
# More sensitive (more frames, threshold: 5)
python process_meeting.py samples/meeting.mkv --embed-images --scene-detection --scene-threshold 5 --diarize
# For console/terminal sessions
python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --vision-context console
# Less sensitive (fewer frames, threshold: 30)
python process_meeting.py samples/meeting.mkv --embed-images --scene-detection --scene-threshold 30 --diarize
```
### Different Vision Models
```bash
# Lighter/faster model (7B parameters)
python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --vision-model llava:7b
# Default model (13B parameters, better quality)
python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --vision-model llava:13b
# Alternative models
python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --vision-model bakllava
```
### Extract frames at different intervals
### Fixed Interval Extraction (alternative to scene detection)
```bash
# Every 10 seconds
python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --interval 10
python process_meeting.py samples/meeting.mkv --embed-images --interval 10 --diarize
# Every 3 seconds (more detailed)
python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --interval 3
```
### Use scene detection (smarter, fewer frames)
```bash
python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --scene-detection
```
### Traditional OCR (if you prefer)
```bash
# Tesseract (default)
python process_meeting.py samples/meeting.mkv --run-whisper
# EasyOCR
python process_meeting.py samples/meeting.mkv --run-whisper --ocr-engine easyocr
# PaddleOCR
python process_meeting.py samples/meeting.mkv --run-whisper --ocr-engine paddleocr
python process_meeting.py samples/meeting.mkv --embed-images --interval 3 --diarize
```
### Caching Examples
```bash
# First run - processes everything
python process_meeting.py samples/meeting.mkv --run-whisper --use-vision
python process_meeting.py samples/meeting.mkv --embed-images --scene-detection --scene-threshold 10 --diarize
# Second run - uses cached transcript and frames, only re-merges
python process_meeting.py samples/meeting.mkv
# Iterate on scene threshold (reuse whisper transcript)
python process_meeting.py samples/meeting.mkv --embed-images --scene-detection --scene-threshold 5 --skip-cache-frames --skip-cache-analysis
# Switch from OCR to vision using existing frames
python process_meeting.py samples/meeting.mkv --use-vision
# Re-run whisper only
python process_meeting.py samples/meeting.mkv --embed-images --skip-cache-whisper
# Force complete reprocessing
python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --no-cache
python process_meeting.py samples/meeting.mkv --embed-images --scene-detection --diarize --no-cache
```
### Custom output location
```bash
python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --output-dir my_outputs/
python process_meeting.py samples/meeting.mkv --embed-images --scene-detection --diarize --output-dir my_outputs/
```
### Enable verbose logging
```bash
# Show detailed debug information
python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --verbose
python process_meeting.py samples/meeting.mkv --embed-images --scene-detection --diarize --verbose
```
## Output Files
All output files are saved to the `output/` directory by default:
Each video gets its own timestamped output directory:
- **`output/<video>_enhanced.txt`** - Enhanced transcript ready for AI summarization
- **`output/<video>.json`** - Whisper transcript (if `--run-whisper` was used)
- **`output/<video>_vision.json`** - Vision analysis results with timestamps (if `--use-vision`)
- **`output/<video>_ocr.json`** - OCR results with timestamps (if using OCR)
- **`frames/`** - Extracted video frames (JPG files)
```
output/
└── 20241019_143022-meeting/
├── manifest.json # Processing configuration
├── meeting_enhanced.txt # Enhanced transcript for AI
├── meeting.json # Whisper/WhisperX transcript
└── frames/ # Extracted video frames
├── frame_00001_5.00s.jpg
├── frame_00002_10.00s.jpg
└── ...
```
### Caching Behavior
The tool automatically caches intermediate results to speed up re-runs:
- **Whisper transcript**: Cached as `output/<video>.json`
- **Extracted frames**: Cached in `frames/<video>_*.jpg`
- **Analysis results**: Cached as `output/<video>_vision.json` or `output/<video>_ocr.json`
The tool automatically reuses the most recent output directory for the same video:
- **First run**: Creates new timestamped directory (e.g., `20241019_143022-meeting/`)
- **Subsequent runs**: Reuses the same directory and cached results
- **Cached items**: Whisper transcript, extracted frames, analysis results
Re-running with the same video will use cached results unless `--no-cache` is specified.
**Fine-grained cache control:**
- `--no-cache`: Force complete reprocessing
- `--skip-cache-frames`: Re-extract frames only
- `--skip-cache-whisper`: Re-run transcription only
- `--skip-cache-analysis`: Re-run analysis only
This allows you to iterate on scene detection thresholds without re-running Whisper!
## Workflow for Meeting Analysis
### Complete Workflow (One Command!)
```bash
# Process everything in one step with vision analysis
python process_meeting.py samples/alo-intro1.mkv --run-whisper --use-vision --scene-detection
# Output will be in output/alo-intro1_enhanced.txt
python process_meeting.py samples/meeting.mkv --embed-images --scene-detection --scene-threshold 10 --diarize
```
### Typical Iterative Workflow
```bash
# First run - full processing
python process_meeting.py samples/meeting.mkv --run-whisper --use-vision
python process_meeting.py samples/meeting.mkv --embed-images --scene-detection --scene-threshold 10 --diarize
# Review results, then re-run with different context if needed
python process_meeting.py samples/meeting.mkv --use-vision --vision-context code
# Or switch to a different vision model
python process_meeting.py samples/meeting.mkv --use-vision --vision-model llava:7b
# All use cached frames and transcript!
```
### Traditional Workflow (Separate Steps)
```bash
# 1. Extract audio and transcribe with Whisper (optional, if not using --run-whisper)
whisper samples/alo-intro1.mkv --model base --output_format json --output_dir output
# 2. Process video to extract screen content with vision
python process_meeting.py samples/alo-intro1.mkv \
--transcript output/alo-intro1.json \
--use-vision \
--scene-detection
# 3. Use the enhanced transcript with AI
# Copy the content from output/alo-intro1_enhanced.txt and paste into Claude or your LLM
# Adjust scene threshold (keeps cached whisper transcript)
python process_meeting.py samples/meeting.mkv --embed-images --scene-detection --scene-threshold 5 --skip-cache-frames --skip-cache-analysis
```
### Example Prompt for Claude
@@ -260,64 +189,54 @@ Please summarize this meeting transcript. Pay special attention to:
```
usage: process_meeting.py [-h] [--transcript TRANSCRIPT] [--run-whisper]
[--whisper-model {tiny,base,small,medium,large}]
[--output OUTPUT] [--output-dir OUTPUT_DIR]
[--frames-dir FRAMES_DIR] [--interval INTERVAL]
[--scene-detection]
[--ocr-engine {tesseract,easyocr,paddleocr}]
[--no-deduplicate] [--extract-only]
[--format {detailed,compact}] [--verbose]
video
[--diarize] [--output OUTPUT] [--output-dir OUTPUT_DIR]
[--interval INTERVAL] [--scene-detection]
[--scene-threshold SCENE_THRESHOLD]
[--embed-images] [--embed-quality EMBED_QUALITY]
[--no-cache] [--skip-cache-frames] [--skip-cache-whisper]
[--skip-cache-analysis] [--no-deduplicate]
[--extract-only] [--format {detailed,compact}]
[--verbose] video
Options:
video Path to video file
--transcript, -t Path to Whisper transcript (JSON or TXT)
--run-whisper Run Whisper transcription before processing
--whisper-model Whisper model: tiny, base, small, medium, large (default: base)
--output, -o Output file for enhanced transcript
--output-dir Directory for output files (default: output/)
--frames-dir Directory to save extracted frames (default: frames/)
--interval Extract frame every N seconds (default: 5)
--scene-detection Use scene detection instead of interval extraction
--ocr-engine OCR engine: tesseract, easyocr, paddleocr (default: tesseract)
--no-deduplicate Disable text deduplication
--extract-only Only extract frames and OCR, skip transcript merging
--format Output format: detailed or compact (default: detailed)
--verbose, -v Enable verbose logging (DEBUG level)
Main Options:
video Path to video file
--diarize Use WhisperX with speaker diarization
--embed-images Add frame file references to transcript (recommended)
Frame Extraction:
--scene-detection Use FFmpeg scene detection (recommended)
--scene-threshold Detection sensitivity 0-100 (default: 15, lower=more sensitive)
--interval Extract frame every N seconds (alternative to scene detection)
Caching:
--no-cache Force complete reprocessing
--skip-cache-frames Re-extract frames only
--skip-cache-whisper Re-run transcription only
--skip-cache-analysis Re-run analysis only
Other:
--run-whisper Run Whisper (without diarization)
--whisper-model Whisper model: tiny, base, small, medium, large (default: medium)
--transcript, -t Path to existing Whisper transcript (JSON or TXT)
--output, -o Output file for enhanced transcript
--output-dir Directory for output files (default: output/)
--verbose, -v Enable verbose logging
```
## Tips for Best Results
### Vision vs OCR: When to Use Each
**Use Vision Models (`--use-vision`) when:**
- ✅ Analyzing dashboards (Grafana, GCP Console, monitoring tools)
- ✅ Code walkthroughs or debugging sessions
- ✅ Complex layouts with mixed content
- ✅ Need contextual understanding, not just text extraction
- ✅ Working with charts, graphs, or visualizations
- ⚠️ Trade-off: Slower (requires GPU/CPU for local model)
**Use OCR when:**
- ✅ Simple text extraction from slides or documents
- ✅ Need maximum speed
- ✅ Limited computational resources
- ✅ Presentations with mostly text
- ⚠️ Trade-off: Less context-aware, may miss visual relationships
### Context Hints for Vision Analysis
- **`--vision-context meeting`**: General purpose (default)
- **`--vision-context code`**: Optimized for code screenshots, preserves formatting
- **`--vision-context dashboard`**: Extracts metrics, trends, panel names
- **`--vision-context console`**: Captures commands, output, error messages
### Scene Detection vs Interval
- **Scene detection**: Better for presentations with distinct slides. More efficient.
- **Interval extraction**: Better for continuous screen sharing (coding, browsing). More thorough.
- **Scene detection** (`--scene-detection`): Recommended. Captures frames when content changes. More efficient.
- **Interval extraction** (`--interval N`): Alternative for continuous content. Captures every N seconds.
### Vision Model Selection
- **`llava:7b`**: Faster, lower memory (~4GB RAM), good quality
- **`llava:13b`**: Better quality, slower, needs ~8GB RAM (default)
- **`bakllava`**: Alternative with different strengths
### Scene Detection Threshold
- Lower values (5-10): More sensitive, captures more frames
- Default (15): Good balance for most meetings
- Higher values (20-30): Less sensitive, fewer frames
### Whisper vs WhisperX
- **Whisper** (`--run-whisper`): Standard transcription, fast
- **WhisperX** (`--run-whisper --diarize`): Adds speaker identification, requires HuggingFace token
### Deduplication
- Enabled by default - removes similar consecutive frames
@@ -325,73 +244,75 @@ Options:
## Troubleshooting
### Vision Model Issues
**"ollama package not installed"**
```bash
pip install ollama
```
**"Ollama not found" or connection errors**
```bash
# Install Ollama first: https://ollama.ai/download
# Then pull a vision model:
ollama pull llava:13b
```
**Vision analysis is slow**
- Use lighter model: `--vision-model llava:7b`
- Reduce frame count: `--scene-detection` or `--interval 10`
- Check if Ollama is using GPU (much faster)
**Poor vision analysis results**
- Try different context hint: `--vision-context code` or `--vision-context dashboard`
- Use larger model: `--vision-model llava:13b`
- Ensure frames are clear (check video resolution)
### OCR Issues
**"pytesseract not installed"**
```bash
pip install pytesseract
sudo apt-get install tesseract-ocr # Don't forget system package!
```
**Poor OCR quality**
- **Solution**: Switch to vision analysis with `--use-vision`
- Or try different OCR engine: `--ocr-engine easyocr`
- Check if video resolution is sufficient
- Use `--no-deduplicate` to keep more frames
### General Issues
### Frame Extraction Issues
**"No frames extracted"**
- Check video file is valid: `ffmpeg -i video.mkv`
- Try lower interval: `--interval 3`
- Check disk space in frames directory
- Try lower scene threshold: `--scene-threshold 5`
- Try interval extraction: `--interval 3`
- Check disk space in output directory
**Scene detection not working**
- Fallback to interval extraction automatically
- Ensure FFmpeg is installed
- Falls back to interval extraction automatically
- Try manual interval: `--interval 5`
### Whisper/WhisperX Issues
**WhisperX diarization not working**
- Ensure you have a HuggingFace token set
- Token needs access to pyannote models
- Fall back to standard Whisper without `--diarize`
### Cache Issues
**Cache not being used**
- Ensure you're using the same video filename
- Check that output directory contains cached files
- Use `--verbose` to see what's being cached/loaded
**Want to re-run specific steps**
- `--skip-cache-frames`: Re-extract frames
- `--skip-cache-whisper`: Re-run transcription
- `--skip-cache-analysis`: Re-run analysis
- `--no-cache`: Force complete reprocessing
## Experimental Features
### OCR and Vision Analysis
OCR (`--ocr-engine`) and Vision analysis (`--use-vision`) options are available but experimental. The recommended approach is to use `--embed-images` which embeds frame references directly in the transcript, letting your LLM analyze the images.
```bash
# Experimental: OCR extraction
python process_meeting.py samples/meeting.mkv --run-whisper --ocr-engine tesseract
# Experimental: Vision model analysis
python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --vision-model llava:13b
# Experimental: Hybrid OpenCV + OCR
python process_meeting.py samples/meeting.mkv --run-whisper --use-hybrid
```
## Project Structure
```
meetus/
├── meetus/ # Main package
├── meetus/ # Main package
│ ├── __init__.py
│ ├── frame_extractor.py # Video frame extraction
│ ├── ocr_processor.py # OCR processing
── transcript_merger.py # Transcript merging
├── process_meeting.py # Main CLI script
├── requirements.txt # Python dependencies
└── README.md # This file
│ ├── workflow.py # Processing orchestrator
│ ├── output_manager.py # Output directory & manifest management
── cache_manager.py # Caching logic
│ ├── frame_extractor.py # Video frame extraction (FFmpeg scene detection)
│ ├── vision_processor.py # Vision model analysis (experimental)
│ ├── ocr_processor.py # OCR processing (experimental)
│ └── transcript_merger.py # Transcript merging
├── process_meeting.py # Main CLI script
├── requirements.txt # Python dependencies
├── output/ # Timestamped output directories
│ └── YYYYMMDD_HHMMSS-video/ # Auto-generated per video
├── samples/ # Sample videos (gitignored)
└── README.md # This file
```
## License

View File

@@ -0,0 +1,80 @@
# 01 - Scene Detection Sensitivity, Image Quality, and Granular Caching
## Date
2025-10-28
## Context
Last run on zaca-run-scrapers sample (Zed editor walkthrough) only detected 19 frames with 7+ minute gaps. Whisper wasn't running (flag not passed). JPEG compression quality was poor for code/text readability.
## Problems Identified
1. **Scene detection too conservative** - Default threshold of 30.0 missed file switches and scrolling in clean UI (Zed vs VS Code)
2. **No whisper transcription** - User expected it to run but `--run-whisper` is opt-in
3. **Poor JPEG quality** - Default compression made code/text hard to read for OCR/vision
4. **Subprocess-based FFmpeg** - Using shell commands instead of Python library
5. **All-or-nothing caching** - `--no-cache` regenerates everything including slow whisper transcription
## Changes Made
### 1. Scene Detection Sensitivity
**Files:** `meetus/frame_extractor.py`, `process_meeting.py`, `meetus/workflow.py`
- Lowered default threshold: `30.0``15.0` (more sensitive for clean UIs)
- Added `--scene-threshold` CLI argument (0-100, lower = more sensitive)
- Added threshold to manifest for tracking
- Updated docstring with usage guidelines:
- 15.0: Good for clean UIs like Zed
- 20-30: Busy UIs like VS Code
- 5-10: Very subtle changes
### 2. JPEG Quality Improvements
**Files:** `meetus/frame_extractor.py`
- **Interval extraction**: Added `cv2.IMWRITE_JPEG_QUALITY, 95` (line 60)
- **Scene detection**: Added `-q:v 2` to FFmpeg (best quality, line 94)
### 3. Migration to ffmpeg-python
**Files:** `meetus/frame_extractor.py`, `requirements.txt`
- Replaced `subprocess.run()` with `ffmpeg-python` library
- Cleaner, more Pythonic API
- Better error handling with `ffmpeg.Error`
- Added to requirements.txt
### 4. Granular Cache Control
**Files:** `process_meeting.py`, `meetus/workflow.py`, `meetus/cache_manager.py`
Added three new flags for selective cache invalidation:
- `--skip-cache-frames`: Regenerate frames (useful when tuning scene threshold)
- `--skip-cache-whisper`: Rerun whisper transcription
- `--skip-cache-analysis`: Rerun OCR/vision analysis
**Key design:**
- `--no-cache`: Still works as before (new directory + regenerate everything)
- New flags: Reuse existing output directory but selectively invalidate caches
- Frames are cleaned up when regenerating to avoid stale data
## Typical Workflow
```bash
# First run - generate everything including whisper (expensive, once)
python process_meeting.py samples/video.mkv --run-whisper --scene-detection --use-vision
# Iterate on scene threshold without re-running whisper
python process_meeting.py samples/video.mkv --scene-detection --scene-threshold 10 --use-vision --skip-cache-frames --skip-cache-analysis
# Try even more sensitive
python process_meeting.py samples/video.mkv --scene-detection --scene-threshold 5 --use-vision --skip-cache-frames --skip-cache-analysis
```
## Notes
- Whisper is the most expensive and reliable step → always cache it during iteration
- Scene detection needs tuning per UI style (Zed vs VS Code)
- Vision analysis should regenerate when frames change
- Walking through code (file switches, scrolling) should trigger scene changes
## Files Modified
- `meetus/frame_extractor.py` - Scene threshold, quality, ffmpeg-python
- `meetus/workflow.py` - Cache flags, frame cleanup
- `meetus/cache_manager.py` - Granular cache checks
- `process_meeting.py` - CLI arguments
- `requirements.txt` - Added ffmpeg-python

View File

@@ -0,0 +1,111 @@
# 02 - Hybrid OpenCV + OCR + LLM Approach
## Date
2025-10-28
## Context
Vision models (llava) were hallucinating text content badly - showing HTML code when there was none, inventing text that didn't exist. Pure OCR was fast and accurate but lost code formatting and structure.
## Problem
- **Vision models**: Hallucinate text content, can't be trusted for accurate extraction
- **Pure OCR**: Accurate text but messy output, lost indentation/formatting
- **Need**: Accurate text extraction + preserved code structure
## Solution: Three-Stage Hybrid Approach
### Stage 1: OpenCV Text Detection
Use morphological operations to find text regions:
- Adaptive thresholding (handles varying lighting)
- Dilation with horizontal kernel to connect text lines
- Contour detection to find bounding boxes
- Filter by area and aspect ratio
- Merge overlapping regions
### Stage 2: Region-Based OCR
- Sort regions by reading order (top-to-bottom, left-to-right)
- Crop each region from original image
- Run OCR on cropped regions (more accurate than full frame)
- Tesseract with PSM 6 mode to preserve layout
- Preserve indentation in cleaning step
### Stage 3: Optional LLM Cleanup
- Take accurate OCR output (no hallucination)
- Use lightweight LLM (llama3.2:3b for speed) to:
- Fix obvious OCR errors (l→1, O→0)
- Restore code indentation and structure
- Preserve exact text content
- No added explanations or hallucinated content
## Benefits
**Accurate**: OCR reads actual pixels, no hallucination
**Fast**: OpenCV detection is instant, focused OCR is quick
**Structured**: Regions separated with headers showing position
**Formatted**: Optional LLM cleanup preserves/restores code structure
**Deterministic**: Same input = same output (unlike vision models)
## Implementation
**New file:** `meetus/hybrid_processor.py`
- `HybridProcessor` class with OpenCV detection + OCR + optional LLM
- Region sorting for proper reading order
- Visual separators between regions
**CLI flags:**
```bash
--use-hybrid # Enable hybrid mode
--hybrid-llm-cleanup # Add LLM post-processing (optional)
--hybrid-llm-model MODEL # LLM model (default: llama3.2:3b)
```
**OCR improvements:**
- Tesseract PSM 6 mode for better layout preservation
- Modified text cleaning to keep indentation
- `preserve_layout` parameter
## Usage
```bash
# Basic hybrid (OpenCV + OCR)
python process_meeting.py samples/video.mkv --use-hybrid --scene-detection
# With LLM cleanup for best code formatting
python process_meeting.py samples/video.mkv --use-hybrid --hybrid-llm-cleanup --scene-detection -v
# Iterate on threshold
python process_meeting.py samples/video.mkv --use-hybrid --scene-detection --scene-threshold 5 --skip-cache-frames --skip-cache-analysis
```
## Output Format
```
[Region 1 at y=120]
function calculateTotal(items) {
return items.reduce((sum, item) => sum + item.price, 0);
}
============================================================
[Region 2 at y=450]
const result = calculateTotal(cartItems);
console.log('Total:', result);
```
## Performance
- **Without LLM cleanup**: Very fast (~2-3s per frame)
- **With LLM cleanup**: Slower but still faster than vision models (~5-8s per frame)
- **Accuracy**: Much better than vision model hallucinations
## When to Use What
| Method | Best For | Pros | Cons |
|--------|----------|------|------|
| **Hybrid** | Code/terminal text extraction | Accurate, fast, no hallucination | Formatting may be messy |
| **Hybrid + LLM** | Code with preserved structure | Accurate + formatted | Slower, needs Ollama |
| **Vision** | Understanding layout/context | Semantic understanding | Hallucinates text |
| **Pure OCR** | Simple text, no structure needed | Fast, simple | Full-frame, no region detection |
## Files Modified
- `meetus/hybrid_processor.py` - New hybrid processor
- `meetus/ocr_processor.py` - Layout preservation
- `meetus/workflow.py` - Hybrid mode integration
- `process_meeting.py` - CLI flags and examples

View File

@@ -0,0 +1,100 @@
# 03 - Embed Images for LLM Analysis
## Date
2025-10-28
## Context
Hybrid OCR approach was fast and accurate but formatting was messy. Vision models hallucinated text. Rather than fighting with text extraction, a better approach is to embed the actual frame images in the enhanced transcript and let the end-user's LLM analyze them with full audio context.
## Problem
- OCR/vision models either hallucinate or produce messy text
- Code formatting/indentation is hard to preserve
- User wants to analyze frames with their own LLM (Claude, GPT, etc.)
- Need to keep file size reasonable (~200KB per image is too big)
## Solution: Image Embedding
Instead of extracting text, embed the actual frame images as base64 in the enhanced transcript. The LLM can then:
- See the actual screen content (no hallucination)
- Understand code structure, layout, and formatting visually
- Have full audio transcript context for each frame
- Analyze dashboards, terminals, editors with perfect accuracy
## Implementation
**Quality Optimization:**
- Default JPEG quality: 80 (good tradeoff between size and readability)
- Configurable via `--embed-quality` (0-100)
- Typical sizes at quality 80: ~40-80KB per image (vs 200KB original)
**Format:**
```
[MM:SS] SPEAKER:
Audio transcript text here
[MM:SS] SCREEN CONTENT:
IMAGE (base64, 52KB):
<image>data:image/jpeg;base64,/9j/4AAQSkZJRg...</image>
TEXT:
| Optional OCR text for reference
```
**Features:**
- Base64 encoding for easy embedding
- Size tracking and reporting
- Optional text content alongside images
- Works with scene detection for smart frame selection
## Usage
```bash
# Basic: Embed images at quality 80 (default)
python process_meeting.py samples/video.mkv --run-whisper --embed-images --scene-detection --no-cache -v
# Lower quality for smaller files (still readable)
python process_meeting.py samples/video.mkv --run-whisper --embed-images --embed-quality 60 --scene-detection --no-cache -v
# Higher quality for detailed code
python process_meeting.py samples/video.mkv --run-whisper --embed-images --embed-quality 90 --scene-detection --no-cache -v
# Iterate on scene threshold (reuse whisper)
python process_meeting.py samples/video.mkv --embed-images --scene-detection --scene-threshold 5 --skip-cache-frames --skip-cache-analysis -v
```
## File Sizes
**Example for 20 frames:**
- Quality 60: ~30-50KB per image = 0.6-1MB total
- Quality 80: ~40-80KB per image = 0.8-1.6MB total (recommended)
- Quality 90: ~80-120KB per image = 1.6-2.4MB total
- Original: ~200KB per image = 4MB total
## Benefits
**No hallucination**: LLM sees actual pixels
**Perfect formatting**: Code structure preserved visually
**Full context**: Audio transcript + visual frame together
**User's choice**: Use your preferred LLM (Claude, GPT, etc.)
**Reasonable size**: Quality 80 gives 4x smaller files vs original
**Simple workflow**: One file contains everything
## Use Cases
**Code walkthroughs:** LLM can see actual code structure and indentation
**Dashboard analysis:** Charts, graphs, metrics visible to LLM
**Terminal sessions:** Commands and output in proper context
**UI reviews:** Actual interface visible with audio commentary
## Files Modified
- `meetus/transcript_merger.py` - Image encoding and embedding
- `meetus/workflow.py` - Wire through config
- `process_meeting.py` - CLI flags
- `meetus/output_manager.py` - Cleaner directory naming (date + increment)
## Output Directory Naming
Also changed output directory format for clarity:
- Old: `20251028_054553-video` (confusing timestamps)
- New: `20251028-001-video` (clear date + run number)

View File

@@ -0,0 +1,78 @@
# 04 - Fix Whisper Cache Loading
## Date
2025-10-28
## Problem
Enhanced transcript was not including the audio segments from cached whisper transcripts when running without the `--run-whisper` flag.
Example command that failed:
```bash
python process_meeting.py samples/zaca-run-scrapers.mkv --embed-images --scene-detection --scene-threshold 10 --skip-cache-frames -v
```
Result: Enhanced transcript only contained embedded images, no audio segments (0 "SPEAKER" entries).
## Root Cause
In `workflow.py`, the `_run_whisper()` method was checking the `run_whisper` flag **before** checking the cache:
```python
def _run_whisper(self) -> Optional[str]:
if not self.config.run_whisper:
return self.config.transcript_path # Returns None if --transcript not specified
# Cache check NEVER REACHED if run_whisper is False
cached = self.cache_mgr.get_whisper_cache()
if cached:
return str(cached)
```
This meant:
- User runs command without `--run-whisper`
- Method returns None immediately
- Cached whisper transcript is never discovered
- No audio segments in enhanced output
## Solution
Reorder the logic to check cache **first**, regardless of flags:
```python
def _run_whisper(self) -> Optional[str]:
"""Run Whisper transcription if requested, or use cached/provided transcript."""
# First, check cache (regardless of run_whisper flag)
cached = self.cache_mgr.get_whisper_cache()
if cached:
return str(cached)
# If no cache and not running whisper, use provided transcript path (if any)
if not self.config.run_whisper:
return self.config.transcript_path
# If no cache and run_whisper is True, run whisper transcription
# ... rest of whisper code
```
## New Behavior
1. Cache is checked first (regardless of `--run-whisper` flag)
2. If cached whisper exists, use it
3. If no cache and `--run-whisper` not specified, use `--transcript` path (or None)
4. If no cache and `--run-whisper` specified, run whisper
## Benefits
✓ Cached whisper transcripts are always discovered and used
✓ User can iterate on frame extraction/analysis without re-running whisper
✓ Enhanced transcripts now properly include both audio + visual content
✓ Granular cache flags (`--skip-cache-frames`, `--skip-cache-whisper`) work as expected
## Use Case
```bash
# First run: Generate whisper transcript + extract frames
python process_meeting.py samples/video.mkv --run-whisper --embed-images --scene-detection -v
# Second run: Iterate on scene threshold without re-running whisper
python process_meeting.py samples/video.mkv --embed-images --scene-detection --scene-threshold 10 --skip-cache-frames -v
# Now correctly includes cached whisper transcript in enhanced output!
```
## Files Modified
- `meetus/workflow.py` - Reordered logic in `_run_whisper()` method (lines 172-181)

View File

@@ -0,0 +1,124 @@
# 05 - Reference Frame Files Instead of Embedding
## Date
2025-10-28
## Context
Embedding base64 images made the enhanced transcript files very large (3.7MB for ~40 frames). This made them harder to work with and slower to process.
## Problem
- Enhanced transcript with embedded base64 images was 3.7MB
- Large file size makes it slow to read/process
- Difficult to inspect individual frames
- Harder to share and version control
## Solution: Reference Frame Paths
Instead of embedding base64 image data, reference the frame files by their relative paths.
### Before (Embedded):
```
[00:08] SCREEN CONTENT:
IMAGE (base64, 85KB):
<image>data:image/jpeg;base64,/9j/4AAQSkZJRg...</image>
```
File size: 3.7MB
### After (Referenced):
```
[00:08] SCREEN CONTENT:
Frame: frames/zaca-run-scrapers_00257.jpg
```
File size: ~50KB
## Implementation
**Directory Structure:**
```
output/20251028-003-zaca-run-scrapers/
├── frames/
│ ├── zaca-run-scrapers_00257.jpg
│ ├── zaca-run-scrapers_00487.jpg
│ └── ...
├── zaca-run-scrapers.json (whisper transcript)
└── zaca-run-scrapers_enhanced.txt (references frames/ directory)
```
**Enhanced Transcript Format:**
```
================================================================================
ENHANCED MEETING TRANSCRIPT
Audio transcript + Screen frames
================================================================================
[00:30] SPEAKER:
Bueno, te dio un tour para el proyecto...
[00:08] SCREEN CONTENT:
Frame: frames/zaca-run-scrapers_00257.jpg
[01:00] SPEAKER:
Mayormente en Scrapping lo que tenemos...
[01:15] SCREEN CONTENT:
Frame: frames/zaca-run-scrapers_00487.jpg
TEXT:
| Code snippet from screen (if OCR was used)
```
## Benefits
**Much smaller files**: ~50KB vs 3.7MB (74x smaller!)
**Easier to inspect**: Can view individual frames directly
**LLM can access images**: Frame paths allow LLM to load images on demand
**Better version control**: Text files are small and diffable
**Cleaner structure**: Frames organized in dedicated directory
**Flexible**: Can still do OCR/vision analysis if needed (adds TEXT section)
## Flags
**`--embed-images`**: Skip OCR/vision analysis, just reference frame files
- Faster (no analysis needed)
- Lets LLM analyze raw images
- Enhanced transcript only contains frame references
**Without `--embed-images`**: Run OCR/vision analysis
- Extracts text from frames
- Enhanced transcript includes both frame reference AND extracted text
- Useful for code/dashboard analysis
## Usage
```bash
# Reference frames only (no OCR, faster)
python process_meeting.py samples/video.mkv --run-whisper --embed-images --scene-detection -v
# Reference frames + OCR text extraction
python process_meeting.py samples/video.mkv --run-whisper --use-hybrid --scene-detection -v
# Adjust frame quality (smaller files)
python process_meeting.py samples/video.mkv --run-whisper --embed-images --embed-quality 60 --scene-detection -v
```
## Files Modified
- `meetus/transcript_merger.py` - Modified `_format_detailed()` to output frame paths instead of base64
- `process_meeting.py` - Updated help text and examples to reflect frame referencing
- All processors (OCR, vision, hybrid) already include `frame_path` in results (no changes needed)
## Workflow Example
```bash
# First run: Generate everything
python process_meeting.py samples/meeting.mkv --run-whisper --embed-images --scene-detection -v
# Result:
# - output/20251028-004-meeting/
# - frames/ (40 frames, ~80KB each)
# - meeting.json (whisper transcript)
# - meeting_enhanced.txt (~50KB, references frames/)
# LLM can now:
# 1. Read enhanced transcript
# 2. See timeline of audio + screen changes
# 3. Load individual frames as needed from frames/ directory
```

162
meetus/cache_manager.py Normal file
View File

@@ -0,0 +1,162 @@
"""
Manage caching for frames, transcripts, and analysis results.
"""
from pathlib import Path
import json
import logging
from typing import List, Tuple, Dict, Optional
logger = logging.getLogger(__name__)
class CacheManager:
"""Manage caching of intermediate processing results."""
def __init__(self, output_dir: Path, frames_dir: Path, video_name: str, use_cache: bool = True,
skip_cache_frames: bool = False, skip_cache_whisper: bool = False,
skip_cache_analysis: bool = False):
"""
Initialize cache manager.
Args:
output_dir: Output directory for cached files
frames_dir: Directory for cached frames
video_name: Name of the video (stem)
use_cache: Whether to use caching globally
skip_cache_frames: Skip cached frames specifically
skip_cache_whisper: Skip cached whisper specifically
skip_cache_analysis: Skip cached analysis specifically
"""
self.output_dir = output_dir
self.frames_dir = frames_dir
self.video_name = video_name
self.use_cache = use_cache
self.skip_cache_frames = skip_cache_frames
self.skip_cache_whisper = skip_cache_whisper
self.skip_cache_analysis = skip_cache_analysis
def get_whisper_cache(self) -> Optional[Path]:
"""
Check for cached Whisper transcript.
Returns:
Path to cached transcript or None
"""
if not self.use_cache or self.skip_cache_whisper:
return None
cache_path = self.output_dir / f"{self.video_name}.json"
if cache_path.exists():
logger.info(f"✓ Found cached Whisper transcript: {cache_path.name}")
# Debug: Show cached transcript info
try:
import json
with open(cache_path, 'r', encoding='utf-8') as f:
data = json.load(f)
if 'segments' in data:
logger.debug(f"Cached transcript has {len(data['segments'])} segments")
except Exception as e:
logger.debug(f"Could not parse cached whisper for debug: {e}")
return cache_path
return None
def get_frames_cache(self) -> Optional[List[Tuple[str, float]]]:
"""
Check for cached frames.
Returns:
List of (frame_path, timestamp) tuples or None
"""
if not self.use_cache or self.skip_cache_frames or not self.frames_dir.exists():
return None
existing_frames = list(self.frames_dir.glob("*.jpg"))
if not existing_frames:
return None
logger.info(f"✓ Found {len(existing_frames)} cached frames in {self.frames_dir.name}/")
logger.debug(f"Frame filenames: {[f.name for f in sorted(existing_frames)[:3]]}...")
# Build frames_info from existing files
frames_info = []
for frame_path in sorted(existing_frames):
# Try to extract timestamp from filename (e.g., frame_00001_12.34s.jpg)
try:
timestamp_str = frame_path.stem.split('_')[-1].rstrip('s')
timestamp = float(timestamp_str)
except:
timestamp = 0.0
frames_info.append((str(frame_path), timestamp))
return frames_info
def get_analysis_cache(self, analysis_type: str) -> Optional[List[Dict]]:
"""
Check for cached analysis results.
Args:
analysis_type: 'vision' or 'ocr'
Returns:
List of analysis results or None
"""
if not self.use_cache or self.skip_cache_analysis:
return None
cache_path = self.output_dir / f"{self.video_name}_{analysis_type}.json"
if cache_path.exists():
logger.info(f"✓ Found cached {analysis_type} analysis: {cache_path.name}")
with open(cache_path, 'r', encoding='utf-8') as f:
results = json.load(f)
logger.info(f"✓ Loaded {len(results)} analyzed frames from cache")
# Debug: Show first cached result
if results:
logger.debug(f"First cached result: timestamp={results[0].get('timestamp')}, text_length={len(results[0].get('text', ''))}")
return results
return None
def save_analysis(self, analysis_type: str, results: List[Dict]):
"""
Save analysis results to cache.
Args:
analysis_type: 'vision' or 'ocr'
results: Analysis results to save
"""
cache_path = self.output_dir / f"{self.video_name}_{analysis_type}.json"
with open(cache_path, 'w', encoding='utf-8') as f:
json.dump(results, f, indent=2, ensure_ascii=False)
logger.info(f"✓ Saved {analysis_type} analysis to: {cache_path.name}")
def cache_exists(self, analysis_type: Optional[str] = None) -> Dict[str, bool]:
"""
Check what caches exist.
Args:
analysis_type: Optional specific analysis type to check
Returns:
Dictionary of cache status
"""
status = {
"whisper": (self.output_dir / f"{self.video_name}.json").exists(),
"frames": len(list(self.frames_dir.glob("frame_*.jpg"))) > 0 if self.frames_dir.exists() else False,
}
if analysis_type:
status[analysis_type] = (self.output_dir / f"{self.video_name}_{analysis_type}.json").exists()
else:
status["vision"] = (self.output_dir / f"{self.video_name}_vision.json").exists()
status["ocr"] = (self.output_dir / f"{self.video_name}_ocr.json").exists()
return status

View File

@@ -6,9 +6,9 @@ import cv2
import os
from pathlib import Path
from typing import List, Tuple, Optional
import subprocess
import json
import logging
import re
logger = logging.getLogger(__name__)
@@ -16,17 +16,19 @@ logger = logging.getLogger(__name__)
class FrameExtractor:
"""Extract frames from video files."""
def __init__(self, video_path: str, output_dir: str = "frames"):
def __init__(self, video_path: str, output_dir: str = "frames", quality: int = 75):
"""
Initialize frame extractor.
Args:
video_path: Path to video file
output_dir: Directory to save extracted frames
quality: JPEG quality for saved frames (0-100)
"""
self.video_path = video_path
self.output_dir = Path(output_dir)
self.output_dir.mkdir(parents=True, exist_ok=True)
self.quality = quality
def extract_by_interval(self, interval_seconds: int = 5) -> List[Tuple[str, float]]:
"""
@@ -56,7 +58,16 @@ class FrameExtractor:
frame_filename = f"frame_{saved_count:05d}_{timestamp:.2f}s.jpg"
frame_path = self.output_dir / frame_filename
cv2.imwrite(str(frame_path), frame)
# Downscale to 1600px width for smaller file size (but still readable)
height, width = frame.shape[:2]
if width > 1600:
ratio = 1600 / width
new_width = 1600
new_height = int(height * ratio)
frame = cv2.resize(frame, (new_width, new_height), interpolation=cv2.INTER_LANCZOS4)
# Save with configured quality (matches embed quality)
cv2.imwrite(str(frame_path), frame, [cv2.IMWRITE_JPEG_QUALITY, self.quality])
frames_info.append((str(frame_path), timestamp))
saved_count += 1
@@ -66,48 +77,80 @@ class FrameExtractor:
logger.info(f"Extracted {saved_count} frames at {interval_seconds}s intervals")
return frames_info
def extract_scene_changes(self, threshold: float = 30.0) -> List[Tuple[str, float]]:
def extract_scene_changes(self, threshold: float = 15.0) -> List[Tuple[str, float]]:
"""
Extract frames only on scene changes using FFmpeg.
More efficient than interval-based extraction.
Args:
threshold: Scene change detection threshold (0-100, lower = more sensitive)
Default: 15.0 (good for clean UIs like Zed)
Higher values (20-30) for busy UIs like VS Code
Lower values (5-10) for very subtle changes
Returns:
List of (frame_path, timestamp) tuples
"""
try:
import ffmpeg
except ImportError:
raise ImportError("ffmpeg-python not installed. Run: pip install ffmpeg-python")
video_name = Path(self.video_path).stem
output_pattern = self.output_dir / f"{video_name}_%05d.jpg"
# Use FFmpeg's scene detection filter
cmd = [
'ffmpeg',
'-i', self.video_path,
'-vf', f'select=gt(scene\\,{threshold/100}),showinfo',
'-vsync', 'vfr',
'-frame_pts', '1',
str(output_pattern),
'-loglevel', 'info'
]
try:
result = subprocess.run(cmd, capture_output=True, text=True, check=True)
# Use FFmpeg's scene detection filter with downscaling
stream = ffmpeg.input(self.video_path)
stream = ffmpeg.filter(stream, 'select', f'gt(scene,{threshold/100})')
stream = ffmpeg.filter(stream, 'showinfo')
# Scale to 1600px width (maintains aspect ratio, still readable)
# Use simple conditional: if width > 1600, scale to 1600, else keep original
stream = ffmpeg.filter(stream, 'scale', w='min(1600,iw)', h=-1)
# Parse output to get frame timestamps
# Convert JPEG quality (0-100) to FFmpeg qscale (2-31, lower=better)
# Rough mapping: qscale ≈ (100 - quality) / 10, clamped to 2-31
qscale = max(2, min(31, int((100 - self.quality) / 10 + 2)))
stream = ffmpeg.output(
stream,
str(output_pattern),
vsync='vfr',
frame_pts=1,
**{'q:v': str(qscale)} # Matches configured quality
)
# Run with stderr capture to get showinfo output
_, stderr = ffmpeg.run(stream, capture_stderr=True, overwrite_output=True)
stderr = stderr.decode('utf-8')
# Parse FFmpeg output to get frame timestamps from showinfo filter
frames_info = []
for img in sorted(self.output_dir.glob(f"{video_name}_*.jpg")):
# Extract timestamp from filename or use FFprobe
frames_info.append((str(img), 0.0)) # Timestamp extraction can be enhanced
# Extract timestamps from stderr (showinfo outputs there)
timestamp_pattern = r'pts_time:([\d.]+)'
timestamps = re.findall(timestamp_pattern, stderr)
# Match frames to timestamps
frame_files = sorted(self.output_dir.glob(f"{video_name}_*.jpg"))
for idx, img in enumerate(frame_files):
# Use extracted timestamp or fallback to index-based estimate
timestamp = float(timestamps[idx]) if idx < len(timestamps) else idx * 5.0
frames_info.append((str(img), timestamp))
logger.info(f"Extracted {len(frames_info)} frames at scene changes")
return frames_info
except subprocess.CalledProcessError as e:
logger.error(f"FFmpeg error: {e.stderr}")
except ffmpeg.Error as e:
logger.error(f"FFmpeg error: {e.stderr.decode() if e.stderr else str(e)}")
# Fallback to interval extraction
logger.warning("Falling back to interval extraction...")
return self.extract_by_interval()
except Exception as e:
logger.error(f"Unexpected error during scene extraction: {e}")
logger.warning("Falling back to interval extraction...")
return self.extract_by_interval()
def get_video_duration(self) -> float:
"""Get video duration in seconds."""

355
meetus/hybrid_processor.py Normal file
View File

@@ -0,0 +1,355 @@
"""
Hybrid frame analysis: OpenCV text detection + OCR for accurate extraction.
Better than pure vision models which tend to hallucinate text content.
"""
from typing import List, Tuple, Dict, Optional
from pathlib import Path
import logging
import cv2
import numpy as np
from difflib import SequenceMatcher
logger = logging.getLogger(__name__)
class HybridProcessor:
"""Combine OpenCV text detection with OCR for accurate text extraction."""
def __init__(self, ocr_engine: str = "tesseract", min_confidence: float = 0.5,
use_llm_cleanup: bool = False, llm_model: Optional[str] = None):
"""
Initialize hybrid processor.
Args:
ocr_engine: OCR engine to use ('tesseract', 'easyocr', 'paddleocr')
min_confidence: Minimum confidence for text detection (0-1)
use_llm_cleanup: Use LLM to clean up OCR output and preserve formatting
llm_model: Ollama model for cleanup (default: llama3.2:3b for speed)
"""
from .ocr_processor import OCRProcessor
self.ocr = OCRProcessor(engine=ocr_engine)
self.min_confidence = min_confidence
self.use_llm_cleanup = use_llm_cleanup
self.llm_model = llm_model or "llama3.2:3b"
self._llm_client = None
if use_llm_cleanup:
self._init_llm()
def _init_llm(self):
"""Initialize Ollama client for LLM cleanup."""
try:
import ollama
self._llm_client = ollama
logger.info(f"LLM cleanup enabled using {self.llm_model}")
except ImportError:
logger.warning("ollama package not installed. LLM cleanup disabled.")
self.use_llm_cleanup = False
def _cleanup_with_llm(self, raw_text: str) -> str:
"""
Use LLM to clean up OCR output and preserve code formatting.
Args:
raw_text: Raw OCR output
Returns:
Cleaned up text with proper formatting
"""
if not self.use_llm_cleanup or not self._llm_client:
return raw_text
prompt = """You are cleaning up OCR output from a code editor screenshot.
Your task:
1. Fix any obvious OCR errors (l→1, O→0, etc.)
2. Preserve or restore code indentation and structure
3. Keep the exact text content - don't add explanations or comments
4. If it's code, maintain proper spacing and formatting
5. Return ONLY the cleaned text, nothing else
OCR Text:
"""
try:
response = self._llm_client.generate(
model=self.llm_model,
prompt=prompt + raw_text,
options={"temperature": 0.1} # Low temperature for accuracy
)
cleaned = response['response'].strip()
logger.debug(f"LLM cleanup: {len(raw_text)}{len(cleaned)} chars")
return cleaned
except Exception as e:
logger.warning(f"LLM cleanup failed: {e}, using raw OCR output")
return raw_text
def detect_text_regions(self, image_path: str, min_area: int = 100) -> List[Tuple[int, int, int, int]]:
"""
Detect text regions in image using OpenCV.
Args:
image_path: Path to image file
min_area: Minimum area for text region (pixels)
Returns:
List of bounding boxes (x, y, w, h)
"""
# Read image
img = cv2.imread(image_path)
if img is None:
logger.warning(f"Could not read image: {image_path}")
return []
gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
# Method 1: Morphological operations to find text regions
# Works well for solid text blocks
regions = self._detect_by_morphology(gray, min_area)
if not regions:
logger.debug(f"No text regions detected in {Path(image_path).name}")
return regions
def _detect_by_morphology(self, gray: np.ndarray, min_area: int) -> List[Tuple[int, int, int, int]]:
"""
Detect text regions using morphological operations.
Fast and works well for solid text blocks (code editors, terminals).
Args:
gray: Grayscale image
min_area: Minimum area for region
Returns:
List of bounding boxes (x, y, w, h)
"""
# Apply adaptive threshold to handle varying lighting
binary = cv2.adaptiveThreshold(
gray, 255, cv2.ADAPTIVE_THRESH_GAUSSIAN_C,
cv2.THRESH_BINARY_INV, 11, 2
)
# Morphological operations to connect text regions
kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (15, 3)) # Horizontal kernel for text lines
dilated = cv2.dilate(binary, kernel, iterations=2)
# Find contours
contours, _ = cv2.findContours(dilated, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
# Filter and extract bounding boxes
regions = []
for contour in contours:
x, y, w, h = cv2.boundingRect(contour)
area = w * h
# Filter by area and aspect ratio
if area > min_area and w > 20 and h > 10: # Reasonable text dimensions
regions.append((x, y, w, h))
# Merge overlapping regions
regions = self._merge_overlapping_regions(regions)
logger.debug(f"Detected {len(regions)} text regions using morphology")
return regions
def _merge_overlapping_regions(
self, regions: List[Tuple[int, int, int, int]],
overlap_threshold: float = 0.3
) -> List[Tuple[int, int, int, int]]:
"""
Merge overlapping bounding boxes.
Args:
regions: List of (x, y, w, h) tuples
overlap_threshold: Minimum overlap ratio to merge
Returns:
Merged regions
"""
if not regions:
return []
# Sort by y-coordinate (top to bottom)
regions = sorted(regions, key=lambda r: r[1])
merged = []
current = list(regions[0])
for region in regions[1:]:
x, y, w, h = region
cx, cy, cw, ch = current
# Check for overlap
x_overlap = max(0, min(cx + cw, x + w) - max(cx, x))
y_overlap = max(0, min(cy + ch, y + h) - max(cy, y))
overlap_area = x_overlap * y_overlap
current_area = cw * ch
region_area = w * h
min_area = min(current_area, region_area)
if overlap_area / min_area > overlap_threshold:
# Merge regions
new_x = min(cx, x)
new_y = min(cy, y)
new_x2 = max(cx + cw, x + w)
new_y2 = max(cy + ch, y + h)
current = [new_x, new_y, new_x2 - new_x, new_y2 - new_y]
else:
merged.append(tuple(current))
current = list(region)
merged.append(tuple(current))
return merged
def extract_text_from_region(self, image_path: str, region: Tuple[int, int, int, int]) -> str:
"""
Extract text from a specific region using OCR.
Args:
image_path: Path to image file
region: Bounding box (x, y, w, h)
Returns:
Extracted text
"""
from PIL import Image
# Load image and crop region
img = Image.open(image_path)
x, y, w, h = region
cropped = img.crop((x, y, x + w, y + h))
# Save to temp file for OCR (or use in-memory)
import tempfile
with tempfile.NamedTemporaryFile(suffix='.png', delete=False) as tmp:
cropped.save(tmp.name)
text = self.ocr.extract_text(tmp.name)
# Clean up temp file
Path(tmp.name).unlink()
return text
def analyze_frame(self, image_path: str) -> str:
"""
Analyze a frame: detect text regions and OCR them.
Args:
image_path: Path to image file
Returns:
Combined text from all detected regions
"""
# Detect text regions
regions = self.detect_text_regions(image_path)
if not regions:
# Fallback to full-frame OCR if no regions detected
logger.debug(f"No regions detected, using full-frame OCR for {Path(image_path).name}")
raw_text = self.ocr.extract_text(image_path)
return self._cleanup_with_llm(raw_text) if self.use_llm_cleanup else raw_text
# Sort regions by reading order (top-to-bottom, left-to-right)
regions = self._sort_regions_by_reading_order(regions)
# Extract text from each region
texts = []
for idx, region in enumerate(regions):
x, y, w, h = region
text = self.extract_text_from_region(image_path, region)
if text.strip():
# Add visual separator with region info
section_header = f"[Region {idx+1} at y={y}]"
texts.append(f"{section_header}\n{text.strip()}")
logger.debug(f"Region {idx+1}/{len(regions)} (y={y}): Extracted {len(text)} chars")
combined = ("\n\n" + "="*60 + "\n\n").join(texts)
logger.debug(f"Total extracted from {len(regions)} regions: {len(combined)} chars")
# Apply LLM cleanup if enabled
if self.use_llm_cleanup:
combined = self._cleanup_with_llm(combined)
return combined
def _sort_regions_by_reading_order(self, regions: List[Tuple[int, int, int, int]]) -> List[Tuple[int, int, int, int]]:
"""
Sort regions in reading order (top-to-bottom, left-to-right).
Args:
regions: List of (x, y, w, h) tuples
Returns:
Sorted regions
"""
# Sort primarily by y (top to bottom), secondarily by x (left to right)
# Group regions that are on roughly the same line (within 20px)
sorted_regions = sorted(regions, key=lambda r: (r[1] // 20, r[0]))
return sorted_regions
def process_frames(
self,
frames_info: List[Tuple[str, float]],
deduplicate: bool = True,
similarity_threshold: float = 0.85
) -> List[Dict]:
"""
Process multiple frames with hybrid analysis.
Args:
frames_info: List of (frame_path, timestamp) tuples
deduplicate: Whether to remove similar consecutive analyses
similarity_threshold: Threshold for considering analyses as duplicates (0-1)
Returns:
List of dicts with 'timestamp', 'text', and 'frame_path'
"""
results = []
prev_text = ""
total = len(frames_info)
logger.info(f"Starting hybrid analysis of {total} frames...")
for idx, (frame_path, timestamp) in enumerate(frames_info, 1):
logger.info(f"Analyzing frame {idx}/{total} at {timestamp:.2f}s...")
text = self.analyze_frame(frame_path)
if not text:
logger.warning(f"No content extracted from frame at {timestamp:.2f}s")
continue
# Debug: Show what was extracted
logger.debug(f"Frame {idx} ({timestamp:.2f}s): Extracted {len(text)} chars")
logger.debug(f"Content preview: {text[:150]}{'...' if len(text) > 150 else ''}")
# Deduplicate similar consecutive frames
if deduplicate and prev_text:
similarity = self._text_similarity(prev_text, text)
logger.debug(f"Similarity to previous frame: {similarity:.2f} (threshold: {similarity_threshold})")
if similarity > similarity_threshold:
logger.debug(f"⊘ Skipping duplicate frame at {timestamp:.2f}s (similarity: {similarity:.2f})")
continue
results.append({
'timestamp': timestamp,
'text': text,
'frame_path': frame_path
})
prev_text = text
logger.info(f"Extracted content from {len(results)} frames (deduplication: {deduplicate})")
return results
def _text_similarity(self, text1: str, text2: str) -> float:
"""
Calculate similarity between two texts.
Returns:
Similarity score between 0 and 1
"""
return SequenceMatcher(None, text1, text2).ratio()

View File

@@ -53,20 +53,25 @@ class OCRProcessor:
else:
raise ValueError(f"Unknown OCR engine: {self.engine}")
def extract_text(self, image_path: str) -> str:
def extract_text(self, image_path: str, preserve_layout: bool = True) -> str:
"""
Extract text from a single image.
Args:
image_path: Path to image file
preserve_layout: Try to preserve whitespace and layout
Returns:
Extracted text
"""
if self.engine == "tesseract":
from PIL import Image
import pytesseract
image = Image.open(image_path)
text = self._ocr_engine.image_to_string(image)
# Use PSM 6 (uniform block of text) to preserve layout better
config = '--psm 6' if preserve_layout else ''
text = pytesseract.image_to_string(image, config=config)
elif self.engine == "easyocr":
result = self._ocr_engine.readtext(image_path, detail=0)
@@ -81,12 +86,31 @@ class OCRProcessor:
return self._clean_text(text)
def _clean_text(self, text: str) -> str:
"""Clean up OCR output."""
# Remove excessive whitespace
text = re.sub(r'\n\s*\n', '\n', text)
text = re.sub(r' +', ' ', text)
return text.strip()
def _clean_text(self, text: str, preserve_indentation: bool = True) -> str:
"""
Clean up OCR output.
Args:
text: Raw OCR text
preserve_indentation: Keep leading whitespace on lines
Returns:
Cleaned text
"""
if preserve_indentation:
# Remove excessive blank lines but preserve indentation
lines = text.split('\n')
cleaned_lines = []
for line in lines:
# Keep line if it has content or is single empty line
if line.strip() or (cleaned_lines and cleaned_lines[-1].strip()):
cleaned_lines.append(line)
return '\n'.join(cleaned_lines).strip()
else:
# Original aggressive cleaning
text = re.sub(r'\n\s*\n', '\n', text)
text = re.sub(r' +', ' ', text)
return text.strip()
def process_frames(
self,
@@ -108,18 +132,24 @@ class OCRProcessor:
results = []
prev_text = ""
for frame_path, timestamp in frames_info:
logger.debug(f"Processing frame at {timestamp:.2f}s...")
for idx, (frame_path, timestamp) in enumerate(frames_info, 1):
logger.debug(f"Processing frame {idx}/{len(frames_info)} at {timestamp:.2f}s...")
text = self.extract_text(frame_path)
if not text:
logger.debug(f"No text extracted from frame at {timestamp:.2f}s")
continue
# Debug: Show what was extracted
logger.debug(f"Frame {idx} ({timestamp:.2f}s): Extracted {len(text)} chars")
logger.debug(f"Content preview: {text[:150]}{'...' if len(text) > 150 else ''}")
# Deduplicate similar consecutive frames
if deduplicate:
if deduplicate and prev_text:
similarity = self._text_similarity(prev_text, text)
logger.debug(f"Similarity to previous frame: {similarity:.2f} (threshold: {similarity_threshold})")
if similarity > similarity_threshold:
logger.debug(f"Skipping duplicate frame at {timestamp:.2f}s (similarity: {similarity:.2f})")
logger.debug(f"Skipping duplicate frame at {timestamp:.2f}s (similarity: {similarity:.2f})")
continue
results.append({

155
meetus/output_manager.py Normal file
View File

@@ -0,0 +1,155 @@
"""
Manage output directories and manifest files.
Creates timestamped folders for each video and tracks processing options.
"""
from pathlib import Path
from datetime import datetime
import json
import logging
from typing import Dict, Any, Optional
logger = logging.getLogger(__name__)
class OutputManager:
"""Manage output directories and manifest files for video processing."""
def __init__(self, video_path: Path, base_output_dir: str = "output", use_cache: bool = True):
"""
Initialize output manager.
Args:
video_path: Path to the video file being processed
base_output_dir: Base directory for all outputs
use_cache: Whether to use existing directories if found
"""
self.video_path = video_path
self.base_output_dir = Path(base_output_dir)
self.use_cache = use_cache
# Find or create output directory
self.output_dir = self._get_or_create_output_dir()
self.frames_dir = self.output_dir / "frames"
self.frames_dir.mkdir(exist_ok=True)
logger.info(f"Output directory: {self.output_dir}")
def _get_or_create_output_dir(self) -> Path:
"""
Get existing output directory or create a new one with incremental number.
Returns:
Path to output directory
"""
video_name = self.video_path.stem
# Look for existing directories if caching is enabled
if self.use_cache and self.base_output_dir.exists():
existing_dirs = sorted([
d for d in self.base_output_dir.iterdir()
if d.is_dir() and d.name.endswith(f"-{video_name}")
], reverse=True) # Most recent first
if existing_dirs:
logger.info(f"Found existing output: {existing_dirs[0].name}")
return existing_dirs[0]
# Create new directory with date + incremental number
date_str = datetime.now().strftime("%Y%m%d")
# Find existing runs for today
if self.base_output_dir.exists():
existing_today = [
d for d in self.base_output_dir.iterdir()
if d.is_dir() and d.name.startswith(date_str) and d.name.endswith(f"-{video_name}")
]
# Extract run numbers and find max
run_numbers = []
for d in existing_today:
# Format: YYYYMMDD-NNN-videoname
parts = d.name.split('-')
if len(parts) >= 2 and parts[1].isdigit():
run_numbers.append(int(parts[1]))
next_run = max(run_numbers) + 1 if run_numbers else 1
else:
next_run = 1
dir_name = f"{date_str}-{next_run:03d}-{video_name}"
output_dir = self.base_output_dir / dir_name
output_dir.mkdir(parents=True, exist_ok=True)
logger.info(f"Created new output directory: {dir_name}")
return output_dir
def get_path(self, filename: str) -> Path:
"""Get full path for a file in the output directory."""
return self.output_dir / filename
def get_frames_path(self, filename: str) -> Path:
"""Get full path for a file in the frames directory."""
return self.frames_dir / filename
def save_manifest(self, config: Dict[str, Any]):
"""
Save processing configuration to manifest.json.
Args:
config: Dictionary of processing options
"""
manifest_path = self.output_dir / "manifest.json"
manifest = {
"video": {
"name": self.video_path.name,
"path": str(self.video_path.absolute()),
},
"processed_at": datetime.now().isoformat(),
"configuration": config,
"outputs": {
"frames": str(self.frames_dir.relative_to(self.output_dir)),
"enhanced_transcript": f"{self.video_path.stem}_enhanced.txt",
"whisper_transcript": f"{self.video_path.stem}.json" if config.get("run_whisper") else None,
"analysis": f"{self.video_path.stem}_{'vision' if config.get('use_vision') else 'ocr'}.json"
}
}
with open(manifest_path, 'w', encoding='utf-8') as f:
json.dump(manifest, f, indent=2, ensure_ascii=False)
logger.info(f"Saved manifest: {manifest_path}")
def load_manifest(self) -> Optional[Dict[str, Any]]:
"""
Load existing manifest if it exists.
Returns:
Manifest dictionary or None
"""
manifest_path = self.output_dir / "manifest.json"
if manifest_path.exists():
with open(manifest_path, 'r', encoding='utf-8') as f:
return json.load(f)
return None
def list_outputs(self) -> Dict[str, Any]:
"""
List all output files in the directory.
Returns:
Dictionary of output files and their status
"""
video_name = self.video_path.stem
return {
"output_dir": str(self.output_dir),
"manifest": (self.output_dir / "manifest.json").exists(),
"enhanced_transcript": (self.output_dir / f"{video_name}_enhanced.txt").exists(),
"whisper_transcript": (self.output_dir / f"{video_name}.json").exists(),
"vision_analysis": (self.output_dir / f"{video_name}_vision.json").exists(),
"ocr_analysis": (self.output_dir / f"{video_name}_ocr.json").exists(),
"frames": len(list(self.frames_dir.glob("*.jpg"))) if self.frames_dir.exists() else 0
}

5
meetus/prompts/code.txt Normal file
View File

@@ -0,0 +1,5 @@
You are analyzing a code screenshot from a meeting recording.
Provide a brief description of what's being shown (1-2 sentences about the context), then extract the visible code exactly as it appears, preserving all formatting, indentation, and structure.
If there's no code visible, just describe what you see on screen.

View File

@@ -0,0 +1,5 @@
You are analyzing console/terminal output from a meeting recording.
Provide a brief description of what's happening (1-2 sentences), then extract the visible commands and output exactly as shown, preserving formatting.
Include any error messages, warnings, or important status information.

View File

@@ -0,0 +1,5 @@
You are analyzing a dashboard/monitoring panel from a meeting recording.
Provide a brief description of what's being monitored (1-2 sentences), then list the key panels, metrics, and their current values. Include any alerts, warnings, or notable trends.
Keep it concise and focused on the important information.

View File

@@ -0,0 +1,10 @@
You are analyzing a screen capture from a meeting recording.
Provide a brief description of what's being shown (1-2 sentences about the context). Then extract the key information:
- Any visible text, titles, or headings
- Code (preserve exact formatting if present)
- Metrics, data points, or dashboard information
- Terminal/console commands and output
- Application or UI elements
Be concise but capture all important details that help understand what was being discussed.

View File

@@ -6,6 +6,8 @@ from typing import List, Dict, Optional
import json
from pathlib import Path
import logging
import base64
from io import BytesIO
logger = logging.getLogger(__name__)
@@ -13,11 +15,18 @@ logger = logging.getLogger(__name__)
class TranscriptMerger:
"""Merge audio transcripts with screen OCR text."""
def __init__(self):
"""Initialize transcript merger."""
pass
def __init__(self, embed_images: bool = False, embed_quality: int = 80):
"""
Initialize transcript merger.
def load_whisper_transcript(self, transcript_path: str) -> List[Dict]:
Args:
embed_images: Whether to embed frame images as base64
embed_quality: JPEG quality for embedded images (0-100)
"""
self.embed_images = embed_images
self.embed_quality = embed_quality
def load_whisper_transcript(self, transcript_path: str, group_interval: Optional[int] = None) -> List[Dict]:
"""
Load Whisper transcript from file.
@@ -25,6 +34,7 @@ class TranscriptMerger:
Args:
transcript_path: Path to transcript file
group_interval: If specified, group audio segments into intervals (in seconds)
Returns:
List of dicts with 'timestamp' (optional) and 'text'
@@ -35,28 +45,39 @@ class TranscriptMerger:
with open(path, 'r', encoding='utf-8') as f:
data = json.load(f)
# Handle different Whisper output formats
# Handle different Whisper/WhisperX output formats
segments = []
if isinstance(data, dict) and 'segments' in data:
# Standard Whisper JSON format
return [
# Standard Whisper/WhisperX JSON format
segments = [
{
'timestamp': seg.get('start', 0),
'text': seg['text'].strip(),
'speaker': seg.get('speaker'), # WhisperX diarization
'type': 'audio'
}
for seg in data['segments']
]
elif isinstance(data, list):
# List of segments
return [
segments = [
{
'timestamp': seg.get('start', seg.get('timestamp', 0)),
'text': seg['text'].strip(),
'speaker': seg.get('speaker'), # WhisperX diarization
'type': 'audio'
}
for seg in data
]
# Group by interval if requested, but skip if we have speaker diarization
# (merge_transcripts will group by speaker instead)
has_speakers = any(seg.get('speaker') for seg in segments)
if group_interval and segments and not has_speakers:
segments = self.group_audio_by_intervals(segments, group_interval)
return segments
else:
# Plain text file - no timestamps
with open(path, 'r', encoding='utf-8') as f:
@@ -68,6 +89,76 @@ class TranscriptMerger:
'type': 'audio'
}]
def group_audio_by_intervals(self, segments: List[Dict], interval_seconds: int = 30) -> List[Dict]:
"""
Group audio segments into regular time intervals.
Instead of word-level timestamps, this creates intervals (e.g., every 30 seconds)
with all text spoken during that interval concatenated together.
Args:
segments: List of audio segments with timestamps
interval_seconds: Duration of each interval in seconds
Returns:
List of grouped segments with interval timestamps
"""
if not segments:
return []
# Find the max timestamp to determine how many intervals we need
max_timestamp = max(seg['timestamp'] for seg in segments)
num_intervals = int(max_timestamp / interval_seconds) + 1
# Create interval buckets
intervals = []
for i in range(num_intervals):
interval_start = i * interval_seconds
interval_end = (i + 1) * interval_seconds
# Collect all text in this interval
texts = []
for seg in segments:
if interval_start <= seg['timestamp'] < interval_end:
texts.append(seg['text'])
# Only create interval if there's text
if texts:
intervals.append({
'timestamp': interval_start,
'text': ' '.join(texts),
'type': 'audio'
})
logger.info(f"Grouped {len(segments)} segments into {len(intervals)} intervals of {interval_seconds}s")
return intervals
def _encode_image_base64(self, image_path: str) -> tuple[str, int]:
"""
Encode image as base64 (image already at target quality/size).
Args:
image_path: Path to image file
Returns:
Tuple of (base64_string, size_in_bytes)
"""
try:
# Read file directly (already at target quality/resolution)
with open(image_path, 'rb') as f:
img_bytes = f.read()
# Encode to base64
b64_string = base64.b64encode(img_bytes).decode('utf-8')
logger.debug(f"Encoded {Path(image_path).name}: {len(img_bytes)} bytes")
return b64_string, len(img_bytes)
except Exception as e:
logger.error(f"Failed to encode image {image_path}: {e}")
return "", 0
def merge_transcripts(
self,
audio_segments: List[Dict],
@@ -75,13 +166,14 @@ class TranscriptMerger:
) -> List[Dict]:
"""
Merge audio and screen transcripts by timestamp.
Groups consecutive audio from same speaker until a screen frame interrupts.
Args:
audio_segments: List of audio transcript segments
screen_segments: List of screen OCR segments
Returns:
Merged list sorted by timestamp
Merged list sorted by timestamp, with audio grouped by speaker
"""
# Mark segment types
for seg in audio_segments:
@@ -93,7 +185,46 @@ class TranscriptMerger:
all_segments = audio_segments + screen_segments
all_segments.sort(key=lambda x: x['timestamp'])
return all_segments
# Group consecutive audio segments by speaker (screen frames break groups)
grouped = []
current_group = None
for seg in all_segments:
if seg['type'] == 'screen':
# Screen frame: flush current group and add frame
if current_group:
grouped.append(current_group)
current_group = None
grouped.append(seg)
else:
# Audio segment
speaker = seg.get('speaker')
if current_group is None:
# Start new group
current_group = {
'timestamp': seg['timestamp'],
'text': seg['text'],
'speaker': speaker,
'type': 'audio'
}
elif speaker == current_group.get('speaker'):
# Same speaker, append text
current_group['text'] += ' ' + seg['text']
else:
# Speaker changed, flush and start new group
grouped.append(current_group)
current_group = {
'timestamp': seg['timestamp'],
'text': seg['text'],
'speaker': speaker,
'type': 'audio'
}
# Don't forget last group
if current_group:
grouped.append(current_group)
return grouped
def format_for_claude(
self,
@@ -120,7 +251,7 @@ class TranscriptMerger:
lines = []
lines.append("=" * 80)
lines.append("ENHANCED MEETING TRANSCRIPT")
lines.append("Audio transcript + Screen content")
lines.append("Audio transcript + Screen frames")
lines.append("=" * 80)
lines.append("")
@@ -128,15 +259,27 @@ class TranscriptMerger:
timestamp = self._format_timestamp(seg['timestamp'])
if seg['type'] == 'audio':
lines.append(f"[{timestamp}] SPEAKER:")
speaker = seg.get('speaker', 'SPEAKER')
lines.append(f"[{timestamp}] {speaker}:")
lines.append(f" {seg['text']}")
lines.append("")
else: # screen
lines.append(f"[{timestamp}] SCREEN CONTENT:")
# Indent screen text for visibility
screen_text = seg['text'].replace('\n', '\n | ')
lines.append(f" | {screen_text}")
# Show frame path if available
if 'frame_path' in seg:
# Get just the filename relative to the enhanced transcript
frame_path = Path(seg['frame_path'])
relative_path = f"frames/{frame_path.name}"
lines.append(f" Frame: {relative_path}")
# Include text content if available (fallback or additional context)
if 'text' in seg and seg['text'].strip():
screen_text = seg['text'].replace('\n', '\n | ')
lines.append(f" TEXT:")
lines.append(f" | {screen_text}")
lines.append("")
return "\n".join(lines)
@@ -147,7 +290,10 @@ class TranscriptMerger:
for seg in segments:
timestamp = self._format_timestamp(seg['timestamp'])
prefix = "SPEAKER" if seg['type'] == 'audio' else "SCREEN"
if seg['type'] == 'audio':
prefix = seg.get('speaker', 'SPEAKER')
else:
prefix = "SCREEN"
text = seg['text'].replace('\n', ' ')[:200] # Truncate long screen text
lines.append(f"[{timestamp}] {prefix}: {text}")

View File

@@ -6,6 +6,7 @@ from typing import List, Tuple, Dict, Optional
from pathlib import Path
import logging
from difflib import SequenceMatcher
import os
logger = logging.getLogger(__name__)
@@ -13,15 +14,24 @@ logger = logging.getLogger(__name__)
class VisionProcessor:
"""Process frames using local vision models via Ollama."""
def __init__(self, model: str = "llava:13b"):
def __init__(self, model: str = "llava:13b", prompts_dir: Optional[str] = None):
"""
Initialize vision processor.
Args:
model: Ollama vision model to use (llava:13b, llava:7b, llava-llama3, bakllava)
prompts_dir: Directory containing prompt files (default: meetus/prompts/)
"""
self.model = model
self._client = None
# Set prompts directory
if prompts_dir:
self.prompts_dir = Path(prompts_dir)
else:
# Default to meetus/prompts/ relative to this file
self.prompts_dir = Path(__file__).parent / "prompts"
self._init_client()
def _init_client(self):
@@ -53,61 +63,44 @@ class VisionProcessor:
"Also install Ollama: https://ollama.ai/download"
)
def analyze_frame(self, image_path: str, context: str = "meeting") -> str:
def _load_prompt(self, context: str) -> str:
"""
Load prompt from file.
Args:
context: Context name (meeting, dashboard, code, console)
Returns:
Prompt text
"""
prompt_file = self.prompts_dir / f"{context}.txt"
if prompt_file.exists():
with open(prompt_file, 'r', encoding='utf-8') as f:
return f.read().strip()
else:
# Fallback to default prompt
logger.warning(f"Prompt file not found: {prompt_file}, using default")
return "Analyze this image and describe what you see in detail."
def analyze_frame(self, image_path: str, context: str = "meeting", audio_context: str = "") -> str:
"""
Analyze a single frame using local vision model.
Args:
image_path: Path to image file
context: Context hint for analysis (meeting, dashboard, code, console)
audio_context: Optional audio transcript around this timestamp for context
Returns:
Analyzed content description
"""
# Context-specific prompts
prompts = {
"meeting": """Analyze this screen capture from a meeting recording. Extract:
1. Any visible text (titles, labels, headings)
2. Key metrics, numbers, or data points shown
3. Dashboard panels or visualizations (describe what they show)
4. Code snippets (preserve formatting and context)
5. Console/terminal output (commands and results)
6. Application names or UI elements
# Load prompt from file
prompt = self._load_prompt(context)
Focus on information that would help someone understand what was being discussed.
Be concise but include all important details. If there's code, preserve it exactly.""",
"dashboard": """Analyze this dashboard/monitoring panel. Extract:
1. Panel titles and metrics names
2. Current values and units
3. Trends (up/down/stable)
4. Alerts or warnings
5. Time ranges shown
6. Any anomalies or notable patterns
Format as structured data.""",
"code": """Analyze this code screenshot. Extract:
1. Programming language
2. File name or path (if visible)
3. Code content (preserve exact formatting)
4. Comments
5. Function/class names
6. Any error messages or warnings
Preserve code exactly as shown.""",
"console": """Analyze this console/terminal output. Extract:
1. Commands executed
2. Output/results
3. Error messages
4. Warnings or status messages
5. File paths or URLs
Preserve formatting and structure."""
}
prompt = prompts.get(context, prompts["meeting"])
# Add audio context if available
if audio_context:
prompt = f"Audio context (what's being discussed around this time):\n{audio_context}\n\n{prompt}"
try:
# Use Ollama's chat API with vision
@@ -135,7 +128,8 @@ Preserve formatting and structure."""
frames_info: List[Tuple[str, float]],
context: str = "meeting",
deduplicate: bool = True,
similarity_threshold: float = 0.85
similarity_threshold: float = 0.85,
audio_segments: Optional[List[Dict]] = None
) -> List[Dict]:
"""
Process multiple frames with vision analysis.
@@ -158,17 +152,25 @@ Preserve formatting and structure."""
for idx, (frame_path, timestamp) in enumerate(frames_info, 1):
logger.info(f"Analyzing frame {idx}/{total} at {timestamp:.2f}s...")
text = self.analyze_frame(frame_path, context)
# Get audio context around this timestamp (±30 seconds)
audio_context = self._get_audio_context(timestamp, audio_segments, window=30)
text = self.analyze_frame(frame_path, context, audio_context)
if not text:
logger.warning(f"No content extracted from frame at {timestamp:.2f}s")
continue
# Debug: Show what was extracted
logger.debug(f"Frame {idx} ({timestamp:.2f}s): Extracted {len(text)} chars")
logger.debug(f"Content preview: {text[:150]}{'...' if len(text) > 150 else ''}")
# Deduplicate similar consecutive frames
if deduplicate:
if deduplicate and prev_text:
similarity = self._text_similarity(prev_text, text)
logger.debug(f"Similarity to previous frame: {similarity:.2f} (threshold: {similarity_threshold})")
if similarity > similarity_threshold:
logger.debug(f"Skipping duplicate frame at {timestamp:.2f}s (similarity: {similarity:.2f})")
logger.debug(f"Skipping duplicate frame at {timestamp:.2f}s (similarity: {similarity:.2f})")
continue
results.append({
@@ -182,6 +184,29 @@ Preserve formatting and structure."""
logger.info(f"Extracted content from {len(results)} frames (deduplication: {deduplicate})")
return results
def _get_audio_context(self, timestamp: float, audio_segments: Optional[List[Dict]], window: int = 30) -> str:
"""
Get audio transcript around a given timestamp.
Args:
timestamp: Target timestamp in seconds
audio_segments: List of audio segments with 'timestamp' and 'text' keys
window: Time window in seconds (±window around timestamp)
Returns:
Concatenated audio text from the time window
"""
if not audio_segments:
return ""
relevant = [seg for seg in audio_segments
if abs(seg.get('timestamp', 0) - timestamp) <= window]
if not relevant:
return ""
return " ".join([seg['text'] for seg in relevant])
def _text_similarity(self, text1: str, text2: str) -> float:
"""
Calculate similarity between two texts.

523
meetus/workflow.py Normal file
View File

@@ -0,0 +1,523 @@
"""
Orchestrate the video processing workflow.
Coordinates frame extraction, analysis, and transcript merging.
"""
from pathlib import Path
import logging
import os
import subprocess
import shutil
from typing import Dict, Any, Optional
from .output_manager import OutputManager
from .cache_manager import CacheManager
from .frame_extractor import FrameExtractor
from .ocr_processor import OCRProcessor
from .vision_processor import VisionProcessor
from .transcript_merger import TranscriptMerger
logger = logging.getLogger(__name__)
class WorkflowConfig:
"""Configuration for the processing workflow."""
def __init__(self, **kwargs):
"""Initialize configuration from keyword arguments."""
# Video and paths
self.video_path = Path(kwargs['video'])
self.transcript_path = kwargs.get('transcript')
self.output_dir = kwargs.get('output_dir', 'output')
self.custom_output = kwargs.get('output')
# Whisper options
self.run_whisper = kwargs.get('run_whisper', False)
self.whisper_model = kwargs.get('whisper_model', 'medium')
self.diarize = kwargs.get('diarize', False)
# Frame extraction
self.scene_detection = kwargs.get('scene_detection', False)
self.scene_threshold = kwargs.get('scene_threshold', 15.0)
self.interval = kwargs.get('interval', 5)
# Analysis options
self.use_vision = kwargs.get('use_vision', False)
self.use_hybrid = kwargs.get('use_hybrid', False)
self.hybrid_llm_cleanup = kwargs.get('hybrid_llm_cleanup', False)
self.hybrid_llm_model = kwargs.get('hybrid_llm_model', 'llama3.2:3b')
self.vision_model = kwargs.get('vision_model', 'llava:13b')
self.vision_context = kwargs.get('vision_context', 'meeting')
self.ocr_engine = kwargs.get('ocr_engine', 'tesseract')
# Validation: can't use both vision and hybrid
if self.use_vision and self.use_hybrid:
raise ValueError("Cannot use both --use-vision and --use-hybrid. Choose one.")
# Validation: LLM cleanup requires hybrid mode
if self.hybrid_llm_cleanup and not self.use_hybrid:
raise ValueError("--hybrid-llm-cleanup requires --use-hybrid")
# Processing options
self.no_deduplicate = kwargs.get('no_deduplicate', False)
self.no_cache = kwargs.get('no_cache', False)
self.skip_cache_frames = kwargs.get('skip_cache_frames', False)
self.skip_cache_whisper = kwargs.get('skip_cache_whisper', False)
self.skip_cache_analysis = kwargs.get('skip_cache_analysis', False)
self.extract_only = kwargs.get('extract_only', False)
self.format = kwargs.get('format', 'detailed')
self.embed_images = kwargs.get('embed_images', False)
self.embed_quality = kwargs.get('embed_quality', 80)
def to_dict(self) -> Dict[str, Any]:
"""Convert config to dictionary for manifest."""
return {
"whisper": {
"enabled": self.run_whisper,
"model": self.whisper_model
},
"frame_extraction": {
"method": "scene_detection" if self.scene_detection else "interval",
"interval_seconds": self.interval if not self.scene_detection else None,
"scene_threshold": self.scene_threshold if self.scene_detection else None
},
"analysis": {
"method": "vision" if self.use_vision else ("hybrid" if self.use_hybrid else "ocr"),
"vision_model": self.vision_model if self.use_vision else None,
"vision_context": self.vision_context if self.use_vision else None,
"ocr_engine": self.ocr_engine if (not self.use_vision) else None,
"deduplication": not self.no_deduplicate
},
"output_format": self.format
}
class ProcessingWorkflow:
"""Orchestrate the complete video processing workflow."""
def __init__(self, config: WorkflowConfig):
"""
Initialize workflow.
Args:
config: Workflow configuration
"""
self.config = config
self.output_mgr = OutputManager(
config.video_path,
config.output_dir,
use_cache=not config.no_cache
)
self.cache_mgr = CacheManager(
self.output_mgr.output_dir,
self.output_mgr.frames_dir,
config.video_path.stem,
use_cache=not config.no_cache,
skip_cache_frames=config.skip_cache_frames,
skip_cache_whisper=config.skip_cache_whisper,
skip_cache_analysis=config.skip_cache_analysis
)
def run(self) -> Dict[str, Any]:
"""
Run the complete processing workflow.
Returns:
Dictionary with output paths and status
"""
logger.info("=" * 80)
logger.info("MEETING PROCESSOR")
logger.info("=" * 80)
logger.info(f"Video: {self.config.video_path.name}")
# Determine analysis method
if self.config.use_vision:
analysis_method = f"Vision Model ({self.config.vision_model})"
logger.info(f"Analysis: {analysis_method}")
logger.info(f"Context: {self.config.vision_context}")
elif self.config.use_hybrid:
analysis_method = f"Hybrid (OpenCV + {self.config.ocr_engine})"
logger.info(f"Analysis: {analysis_method}")
else:
analysis_method = f"OCR ({self.config.ocr_engine})"
logger.info(f"Analysis: {analysis_method}")
logger.info(f"Frame extraction: {'Scene detection' if self.config.scene_detection else f'Every {self.config.interval}s'}")
logger.info(f"Caching: {'Disabled' if self.config.no_cache else 'Enabled'}")
logger.info("=" * 80)
# Step 0: Whisper transcription
transcript_path = self._run_whisper()
# Step 1: Extract frames
frames_info = self._extract_frames()
if not frames_info:
logger.error("No frames extracted")
raise RuntimeError("Frame extraction failed")
# Step 2: Analyze frames
screen_segments = self._analyze_frames(frames_info)
if self.config.extract_only:
logger.info("Done! (extract-only mode)")
return self._build_result(transcript_path, screen_segments)
# Step 3: Merge with transcript
enhanced_transcript = self._merge_transcripts(transcript_path, screen_segments)
# Save manifest
self.output_mgr.save_manifest(self.config.to_dict())
# Build final result
return self._build_result(transcript_path, screen_segments, enhanced_transcript)
def _run_whisper(self) -> Optional[str]:
"""Run Whisper transcription if requested, or use cached/provided transcript."""
# First, check cache (regardless of run_whisper flag)
cached = self.cache_mgr.get_whisper_cache()
if cached:
return str(cached)
# If no cache and not running whisper/diarize, use provided transcript path (if any)
if not self.config.run_whisper and not self.config.diarize:
return self.config.transcript_path
logger.info("=" * 80)
logger.info("STEP 0: Running Whisper Transcription")
logger.info("=" * 80)
# Determine which transcription tool to use
use_diarize = getattr(self.config, 'diarize', False)
if use_diarize:
if not shutil.which("whisperx"):
logger.error("WhisperX is not installed. Install it with: pip install whisperx")
raise RuntimeError("WhisperX not installed (required for --diarize)")
transcribe_cmd = "whisperx"
else:
if not shutil.which("whisper"):
logger.error("Whisper is not installed. Install it with: pip install openai-whisper")
raise RuntimeError("Whisper not installed")
transcribe_cmd = "whisper"
# Unload Ollama model to free GPU memory for Whisper (if using vision)
if self.config.use_vision:
logger.info("Freeing GPU memory for Whisper...")
try:
subprocess.run(["ollama", "stop", self.config.vision_model],
capture_output=True, check=False)
logger.info("✓ Ollama model unloaded")
except Exception as e:
logger.warning(f"Could not unload Ollama model: {e}")
if use_diarize:
logger.info(f"Running WhisperX transcription with diarization (model: {self.config.whisper_model})...")
else:
logger.info(f"Running Whisper transcription (model: {self.config.whisper_model})...")
logger.info("This may take a few minutes depending on video length...")
# Build command
cmd = [
transcribe_cmd,
str(self.config.video_path),
"--model", self.config.whisper_model,
"--output_format", "json",
"--output_dir", str(self.output_mgr.output_dir),
]
if use_diarize:
cmd.append("--diarize")
try:
# Set up environment with cuDNN library path for whisperx
env = os.environ.copy()
if use_diarize:
import site
site_packages = site.getsitepackages()[0]
cudnn_path = Path(site_packages) / "nvidia" / "cudnn" / "lib"
if cudnn_path.exists():
env["LD_LIBRARY_PATH"] = str(cudnn_path) + ":" + env.get("LD_LIBRARY_PATH", "")
subprocess.run(cmd, check=True, capture_output=True, text=True, env=env)
transcript_path = self.output_mgr.get_path(f"{self.config.video_path.stem}.json")
if transcript_path.exists():
logger.info(f"✓ Whisper transcription completed: {transcript_path.name}")
# Debug: Show transcript preview
try:
import json
with open(transcript_path, 'r', encoding='utf-8') as f:
whisper_data = json.load(f)
if 'segments' in whisper_data:
logger.debug(f"Whisper produced {len(whisper_data['segments'])} segments")
if whisper_data['segments']:
logger.debug(f"First segment: {whisper_data['segments'][0]}")
logger.debug(f"Last segment: {whisper_data['segments'][-1]}")
if 'text' in whisper_data:
text_preview = whisper_data['text'][:200] + "..." if len(whisper_data.get('text', '')) > 200 else whisper_data.get('text', '')
logger.debug(f"Transcript preview: {text_preview}")
except Exception as e:
logger.debug(f"Could not parse whisper output for debug: {e}")
logger.info("")
return str(transcript_path)
else:
logger.error("Whisper completed but transcript file not found")
raise RuntimeError("Whisper output missing")
except subprocess.CalledProcessError as e:
logger.error(f"Whisper failed: {e.stderr}")
raise
def _extract_frames(self):
"""Extract frames from video."""
logger.info("Step 1: Extracting frames from video...")
# Check cache
cached_frames = self.cache_mgr.get_frames_cache()
if cached_frames:
return cached_frames
# Clean up old frames if regenerating
if self.config.skip_cache_frames and self.output_mgr.frames_dir.exists():
old_frames = list(self.output_mgr.frames_dir.glob("*.jpg"))
if old_frames:
logger.info(f"Cleaning up {len(old_frames)} old frames...")
for old_frame in old_frames:
old_frame.unlink()
logger.info("✓ Cleanup complete")
# Extract frames (use embed quality so saved files match embedded images)
if self.config.scene_detection:
logger.info(f"Extracting frames with scene detection (threshold={self.config.scene_threshold})...")
else:
logger.info(f"Extracting frames every {self.config.interval}s...")
extractor = FrameExtractor(
str(self.config.video_path),
str(self.output_mgr.frames_dir),
quality=self.config.embed_quality
)
if self.config.scene_detection:
frames_info = extractor.extract_scene_changes(threshold=self.config.scene_threshold)
else:
frames_info = extractor.extract_by_interval(self.config.interval)
logger.info(f"✓ Extracted {len(frames_info)} frames")
return frames_info
def _analyze_frames(self, frames_info):
"""Analyze frames with vision, hybrid, or OCR."""
# Skip analysis if just embedding images
if self.config.embed_images:
logger.info("Step 2: Skipping analysis (images will be embedded)")
# Create minimal segments with just frame paths and timestamps
screen_segments = [
{
'timestamp': timestamp,
'text': '', # No text extraction needed
'frame_path': frame_path
}
for frame_path, timestamp in frames_info
]
logger.info(f"✓ Prepared {len(screen_segments)} frames for embedding")
return screen_segments
# Determine analysis type
if self.config.use_vision:
analysis_type = 'vision'
elif self.config.use_hybrid:
analysis_type = 'hybrid'
else:
analysis_type = 'ocr'
# Check cache
cached_analysis = self.cache_mgr.get_analysis_cache(analysis_type)
if cached_analysis:
return cached_analysis
if self.config.use_vision:
return self._run_vision_analysis(frames_info)
elif self.config.use_hybrid:
return self._run_hybrid_analysis(frames_info)
else:
return self._run_ocr_analysis(frames_info)
def _run_vision_analysis(self, frames_info):
"""Run vision analysis on frames."""
logger.info("Step 2: Running vision analysis on extracted frames...")
logger.info(f"Loading vision model {self.config.vision_model} to GPU...")
# Load audio segments for context if transcript exists
audio_segments = []
transcript_path = self.config.transcript_path or self._get_cached_transcript()
if transcript_path:
transcript_file = Path(transcript_path)
if transcript_file.exists():
logger.info("Loading audio transcript for context...")
merger = TranscriptMerger()
audio_segments = merger.load_whisper_transcript(str(transcript_file))
logger.info(f"✓ Loaded {len(audio_segments)} audio segments for context")
try:
vision = VisionProcessor(model=self.config.vision_model)
screen_segments = vision.process_frames(
frames_info,
context=self.config.vision_context,
deduplicate=not self.config.no_deduplicate,
audio_segments=audio_segments
)
logger.info(f"✓ Analyzed {len(screen_segments)} frames with vision model")
# Debug: Show sample analysis results
if screen_segments:
logger.debug(f"First analysis result: timestamp={screen_segments[0].get('timestamp')}, text_length={len(screen_segments[0].get('text', ''))}")
logger.debug(f"First analysis text preview: {screen_segments[0].get('text', '')[:200]}...")
if len(screen_segments) > 1:
logger.debug(f"Last analysis result: timestamp={screen_segments[-1].get('timestamp')}, text_length={len(screen_segments[-1].get('text', ''))}")
# Cache results
self.cache_mgr.save_analysis('vision', screen_segments)
return screen_segments
except ImportError as e:
logger.error(f"{e}")
raise
def _get_cached_transcript(self) -> Optional[str]:
"""Get cached Whisper transcript if available."""
cached = self.cache_mgr.get_whisper_cache()
return str(cached) if cached else None
def _run_hybrid_analysis(self, frames_info):
"""Run hybrid analysis on frames (OpenCV + OCR)."""
if self.config.hybrid_llm_cleanup:
logger.info("Step 2: Running hybrid analysis (OpenCV + OCR + LLM cleanup)...")
else:
logger.info("Step 2: Running hybrid analysis (OpenCV text detection + OCR)...")
try:
from .hybrid_processor import HybridProcessor
hybrid = HybridProcessor(
ocr_engine=self.config.ocr_engine,
use_llm_cleanup=self.config.hybrid_llm_cleanup,
llm_model=self.config.hybrid_llm_model
)
screen_segments = hybrid.process_frames(
frames_info,
deduplicate=not self.config.no_deduplicate
)
logger.info(f"✓ Processed {len(screen_segments)} frames with hybrid analysis")
# Debug: Show sample hybrid results
if screen_segments:
logger.debug(f"First hybrid result: timestamp={screen_segments[0].get('timestamp')}, text_length={len(screen_segments[0].get('text', ''))}")
logger.debug(f"First hybrid text preview: {screen_segments[0].get('text', '')[:200]}...")
if len(screen_segments) > 1:
logger.debug(f"Last hybrid result: timestamp={screen_segments[-1].get('timestamp')}, text_length={len(screen_segments[-1].get('text', ''))}")
# Cache results
self.cache_mgr.save_analysis('hybrid', screen_segments)
return screen_segments
except ImportError as e:
logger.error(f"{e}")
raise
def _run_ocr_analysis(self, frames_info):
"""Run OCR analysis on frames."""
logger.info("Step 2: Running OCR on extracted frames...")
try:
ocr = OCRProcessor(engine=self.config.ocr_engine)
screen_segments = ocr.process_frames(
frames_info,
deduplicate=not self.config.no_deduplicate
)
logger.info(f"✓ Processed {len(screen_segments)} frames with OCR")
# Debug: Show sample OCR results
if screen_segments:
logger.debug(f"First OCR result: timestamp={screen_segments[0].get('timestamp')}, text_length={len(screen_segments[0].get('text', ''))}")
logger.debug(f"First OCR text preview: {screen_segments[0].get('text', '')[:200]}...")
if len(screen_segments) > 1:
logger.debug(f"Last OCR result: timestamp={screen_segments[-1].get('timestamp')}, text_length={len(screen_segments[-1].get('text', ''))}")
# Cache results
self.cache_mgr.save_analysis('ocr', screen_segments)
return screen_segments
except ImportError as e:
logger.error(f"{e}")
logger.error(f"To install {self.config.ocr_engine}:")
logger.error(f" pip install {self.config.ocr_engine}")
raise
def _merge_transcripts(self, transcript_path, screen_segments):
"""Merge audio and screen transcripts."""
merger = TranscriptMerger(
embed_images=self.config.embed_images,
embed_quality=self.config.embed_quality
)
# Load audio transcript if available
audio_segments = []
if transcript_path:
logger.info("Step 3: Merging with Whisper transcript...")
transcript_file = Path(transcript_path)
if not transcript_file.exists():
logger.warning(f"Transcript not found: {transcript_path}")
logger.info("Proceeding with screen content only...")
else:
# Group audio into 30-second intervals for cleaner reference timestamps
audio_segments = merger.load_whisper_transcript(str(transcript_file), group_interval=30)
logger.info(f"✓ Loaded {len(audio_segments)} audio segments")
else:
logger.info("No transcript provided, using screen content only...")
# Merge and format
merged = merger.merge_transcripts(audio_segments, screen_segments)
formatted = merger.format_for_claude(merged, format_style=self.config.format)
# Save output
if self.config.custom_output:
output_path = self.config.custom_output
else:
output_path = self.output_mgr.get_path(f"{self.config.video_path.stem}_enhanced.txt")
merger.save_transcript(formatted, str(output_path))
logger.info("=" * 80)
logger.info("✓ PROCESSING COMPLETE!")
logger.info("=" * 80)
logger.info(f"Output directory: {self.output_mgr.output_dir}")
logger.info(f"Enhanced transcript: {Path(output_path).name}")
logger.info("")
return output_path
def _build_result(self, transcript_path=None, screen_segments=None, enhanced_transcript=None):
"""Build result dictionary."""
# Determine analysis filename
if self.config.use_vision:
analysis_type = 'vision'
elif self.config.use_hybrid:
analysis_type = 'hybrid'
else:
analysis_type = 'ocr'
return {
"output_dir": str(self.output_mgr.output_dir),
"transcript": transcript_path,
"analysis": f"{self.config.video_path.stem}_{analysis_type}.json",
"frames_count": len(screen_segments) if screen_segments else 0,
"enhanced_transcript": enhanced_transcript,
"manifest": str(self.output_mgr.get_path("manifest.json"))
}

View File

@@ -1,34 +1,19 @@
#!/usr/bin/env python3
"""
Process meeting recordings to extract audio + screen content.
Combines Whisper transcripts with OCR from screen shares.
Combines Whisper transcripts with vision analysis or OCR from screen shares.
"""
import argparse
from pathlib import Path
import sys
import json
import logging
import subprocess
import shutil
from meetus.frame_extractor import FrameExtractor
from meetus.ocr_processor import OCRProcessor
from meetus.vision_processor import VisionProcessor
from meetus.transcript_merger import TranscriptMerger
logger = logging.getLogger(__name__)
from meetus.workflow import WorkflowConfig, ProcessingWorkflow
def setup_logging(verbose: bool = False):
"""
Configure logging for the application.
Args:
verbose: If True, set DEBUG level, otherwise INFO
"""
"""Configure logging for the application."""
level = logging.DEBUG if verbose else logging.INFO
# Configure root logger
logging.basicConfig(
level=level,
format='%(asctime)s - %(levelname)s - %(message)s',
@@ -41,158 +26,121 @@ def setup_logging(verbose: bool = False):
logging.getLogger('paddleocr').setLevel(logging.WARNING)
def run_whisper(video_path: Path, model: str = "base", output_dir: str = "output") -> Path:
"""
Run Whisper transcription on video file.
Args:
video_path: Path to video file
model: Whisper model to use (tiny, base, small, medium, large)
output_dir: Directory to save output
Returns:
Path to generated JSON transcript
"""
# Check if whisper is installed
if not shutil.which("whisper"):
logger.error("Whisper is not installed. Install it with: pip install openai-whisper")
sys.exit(1)
logger.info(f"Running Whisper transcription (model: {model})...")
logger.info("This may take a few minutes depending on video length...")
# Run whisper command
cmd = [
"whisper",
str(video_path),
"--model", model,
"--output_format", "json",
"--output_dir", output_dir
]
try:
result = subprocess.run(
cmd,
check=True,
capture_output=True,
text=True
)
# Whisper outputs to <output_dir>/<video_stem>.json
transcript_path = Path(output_dir) / f"{video_path.stem}.json"
if transcript_path.exists():
logger.info(f"✓ Whisper transcription completed: {transcript_path}")
return transcript_path
else:
logger.error("Whisper completed but transcript file not found")
sys.exit(1)
except subprocess.CalledProcessError as e:
logger.error(f"Whisper failed: {e.stderr}")
sys.exit(1)
def main():
parser = argparse.ArgumentParser(
description="Extract screen content from meeting recordings and merge with transcripts",
formatter_class=argparse.RawDescriptionHelpFormatter,
epilog="""
Examples:
# Run Whisper + vision analysis (recommended for code/dashboards)
python process_meeting.py samples/meeting.mkv --run-whisper --use-vision
# Reference frames for LLM analysis (recommended - transcript includes frame paths)
python process_meeting.py samples/meeting.mkv --run-whisper --embed-images --scene-detection
# Use vision with specific context hint
python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --vision-context code
# Adjust frame extraction quality (lower = smaller files)
python process_meeting.py samples/meeting.mkv --run-whisper --embed-images --embed-quality 60 --scene-detection
# Traditional OCR approach
python process_meeting.py samples/meeting.mkv --run-whisper
# Hybrid approach: OpenCV + OCR (extracts text from frames)
python process_meeting.py samples/meeting.mkv --run-whisper --use-hybrid --scene-detection
# Re-run analysis using cached frames and transcript
python process_meeting.py samples/meeting.mkv --use-vision
# Hybrid + LLM cleanup (best for code formatting)
python process_meeting.py samples/meeting.mkv --run-whisper --use-hybrid --hybrid-llm-cleanup --scene-detection
# Force reprocessing (ignore cache)
python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --no-cache
# Use scene detection for fewer frames
python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --scene-detection
# Iterate on scene threshold (reuse whisper transcript)
python process_meeting.py samples/meeting.mkv --embed-images --scene-detection --scene-threshold 5 --skip-cache-frames --skip-cache-analysis
"""
)
# Required arguments
parser.add_argument(
'video',
help='Path to video file'
)
# Whisper options
parser.add_argument(
'--transcript', '-t',
help='Path to Whisper transcript (JSON or TXT)',
default=None
)
parser.add_argument(
'--run-whisper',
action='store_true',
help='Run Whisper transcription before processing'
)
parser.add_argument(
'--whisper-model',
choices=['tiny', 'base', 'small', 'medium', 'large'],
help='Whisper model to use (default: base)',
default='base'
help='Whisper model to use (default: medium)',
default='medium'
)
parser.add_argument(
'--diarize',
action='store_true',
help='Use WhisperX with speaker diarization (requires whisperx and HuggingFace token)'
)
# Output options
parser.add_argument(
'--output', '-o',
help='Output file for enhanced transcript (default: output/<video>_enhanced.txt)',
help='Output file for enhanced transcript (default: auto-generated in output directory)',
default=None
)
parser.add_argument(
'--output-dir',
help='Directory for output files (default: output/)',
help='Base directory for outputs (default: output/)',
default='output'
)
parser.add_argument(
'--frames-dir',
help='Directory to save extracted frames (default: frames/)',
default='frames'
)
# Frame extraction options
parser.add_argument(
'--interval',
type=int,
help='Extract frame every N seconds (default: 5)',
default=5
)
parser.add_argument(
'--scene-detection',
action='store_true',
help='Use scene detection instead of interval extraction'
)
parser.add_argument(
'--scene-threshold',
type=float,
help='Scene detection threshold (0-100, lower=more sensitive, default: 15)',
default=15.0
)
# Analysis options
parser.add_argument(
'--ocr-engine',
choices=['tesseract', 'easyocr', 'paddleocr'],
help='OCR engine to use (default: tesseract)',
default='tesseract'
)
parser.add_argument(
'--use-vision',
action='store_true',
help='Use local vision model (Ollama) instead of OCR for better context understanding'
)
parser.add_argument(
'--use-hybrid',
action='store_true',
help='Use hybrid approach: OpenCV text detection + OCR (more accurate than vision models)'
)
parser.add_argument(
'--hybrid-llm-cleanup',
action='store_true',
help='Use LLM to clean up OCR output and preserve code formatting (requires --use-hybrid)'
)
parser.add_argument(
'--hybrid-llm-model',
help='LLM model for cleanup (default: llama3.2:3b)',
default='llama3.2:3b'
)
parser.add_argument(
'--vision-model',
help='Vision model to use with Ollama (default: llava:13b)',
default='llava:13b'
)
parser.add_argument(
'--vision-context',
choices=['meeting', 'dashboard', 'code', 'console'],
@@ -200,31 +148,56 @@ Examples:
default='meeting'
)
# Processing options
parser.add_argument(
'--no-cache',
action='store_true',
help='Disable caching - reprocess everything even if outputs exist'
)
parser.add_argument(
'--skip-cache-frames',
action='store_true',
help='Skip cached frames, re-extract from video (but keep whisper/analysis cache)'
)
parser.add_argument(
'--skip-cache-whisper',
action='store_true',
help='Skip cached whisper transcript, re-run transcription (but keep frames/analysis cache)'
)
parser.add_argument(
'--skip-cache-analysis',
action='store_true',
help='Skip cached analysis, re-run OCR/vision (but keep frames/whisper cache)'
)
parser.add_argument(
'--no-deduplicate',
action='store_true',
help='Disable text deduplication'
)
parser.add_argument(
'--extract-only',
action='store_true',
help='Only extract frames and OCR, skip transcript merging'
help='Only extract frames and analyze, skip transcript merging'
)
parser.add_argument(
'--format',
choices=['detailed', 'compact'],
help='Output format style (default: detailed)',
default='detailed'
)
parser.add_argument(
'--embed-images',
action='store_true',
help='Skip OCR/vision analysis and reference frame files directly (faster, lets LLM analyze images)'
)
parser.add_argument(
'--embed-quality',
type=int,
help='JPEG quality for extracted frames (default: 80, lower = smaller files)',
default=80
)
# Logging
parser.add_argument(
'--verbose', '-v',
action='store_true',
@@ -236,166 +209,38 @@ Examples:
# Setup logging
setup_logging(args.verbose)
# Validate video path
video_path = Path(args.video)
if not video_path.exists():
logger.error(f"Video file not found: {args.video}")
sys.exit(1)
try:
# Create workflow configuration
config = WorkflowConfig(**vars(args))
# Create output directory
output_dir = Path(args.output_dir)
output_dir.mkdir(parents=True, exist_ok=True)
# Run processing workflow
workflow = ProcessingWorkflow(config)
result = workflow.run()
# Set default output path
if args.output is None:
args.output = str(output_dir / f"{video_path.stem}_enhanced.txt")
# Print final summary
print("\n" + "=" * 80)
print("✓ SUCCESS!")
print("=" * 80)
print(f"Output directory: {result['output_dir']}")
if result.get('enhanced_transcript'):
print(f"Enhanced transcript ready for AI summarization!")
print("=" * 80)
# Define cache paths
whisper_cache = output_dir / f"{video_path.stem}.json"
analysis_cache = output_dir / f"{video_path.stem}_{'vision' if args.use_vision else 'ocr'}.json"
frames_cache_dir = Path(args.frames_dir)
return 0
# Check for cached Whisper transcript
if args.run_whisper:
if not args.no_cache and whisper_cache.exists():
logger.info(f"✓ Found cached Whisper transcript: {whisper_cache}")
args.transcript = str(whisper_cache)
else:
logger.info("=" * 80)
logger.info("STEP 0: Running Whisper Transcription")
logger.info("=" * 80)
transcript_path = run_whisper(video_path, args.whisper_model, str(output_dir))
args.transcript = str(transcript_path)
logger.info("")
logger.info("=" * 80)
logger.info("MEETING PROCESSOR")
logger.info("=" * 80)
logger.info(f"Video: {video_path.name}")
logger.info(f"Analysis: {'Vision Model' if args.use_vision else f'OCR ({args.ocr_engine})'}")
if args.use_vision:
logger.info(f"Vision Model: {args.vision_model}")
logger.info(f"Context: {args.vision_context}")
logger.info(f"Frame extraction: {'Scene detection' if args.scene_detection else f'Every {args.interval}s'}")
if args.transcript:
logger.info(f"Transcript: {args.transcript}")
logger.info(f"Caching: {'Disabled' if args.no_cache else 'Enabled'}")
logger.info("=" * 80)
# Step 1: Extract frames (with caching)
logger.info("Step 1: Extracting frames from video...")
# Check if frames already exist
existing_frames = list(frames_cache_dir.glob(f"{video_path.stem}_*.jpg")) if frames_cache_dir.exists() else []
if not args.no_cache and existing_frames and len(existing_frames) > 0:
logger.info(f"✓ Found {len(existing_frames)} cached frames in {args.frames_dir}/")
# Build frames_info from existing files
frames_info = []
for frame_path in sorted(existing_frames):
# Try to extract timestamp from filename (e.g., video_00001_12.34s.jpg)
try:
timestamp_str = frame_path.stem.split('_')[-1].rstrip('s')
timestamp = float(timestamp_str)
except:
timestamp = 0.0
frames_info.append((str(frame_path), timestamp))
else:
extractor = FrameExtractor(str(video_path), args.frames_dir)
if args.scene_detection:
frames_info = extractor.extract_scene_changes()
else:
frames_info = extractor.extract_by_interval(args.interval)
if not frames_info:
logger.error("No frames extracted")
sys.exit(1)
logger.info(f"✓ Extracted {len(frames_info)} frames")
# Step 2: Run analysis on frames (with caching)
if not args.no_cache and analysis_cache.exists():
logger.info(f"✓ Found cached analysis results: {analysis_cache}")
with open(analysis_cache, 'r', encoding='utf-8') as f:
screen_segments = json.load(f)
logger.info(f"✓ Loaded {len(screen_segments)} analyzed frames from cache")
else:
if args.use_vision:
# Use vision model
logger.info("Step 2: Running vision analysis on extracted frames...")
try:
vision = VisionProcessor(model=args.vision_model)
screen_segments = vision.process_frames(
frames_info,
context=args.vision_context,
deduplicate=not args.no_deduplicate
)
logger.info(f"✓ Analyzed {len(screen_segments)} frames with vision model")
except ImportError as e:
logger.error(f"{e}")
sys.exit(1)
else:
# Use OCR
logger.info("Step 2: Running OCR on extracted frames...")
try:
ocr = OCRProcessor(engine=args.ocr_engine)
screen_segments = ocr.process_frames(
frames_info,
deduplicate=not args.no_deduplicate
)
logger.info(f"✓ Processed {len(screen_segments)} frames with OCR")
except ImportError as e:
logger.error(f"{e}")
logger.error(f"To install {args.ocr_engine}:")
logger.error(f" pip install {args.ocr_engine}")
sys.exit(1)
# Save analysis results as JSON
with open(analysis_cache, 'w', encoding='utf-8') as f:
json.dump(screen_segments, f, indent=2, ensure_ascii=False)
logger.info(f"✓ Saved analysis results to: {analysis_cache}")
if args.extract_only:
logger.info("Done! (extract-only mode)")
return
# Step 3: Merge with transcript (if provided)
merger = TranscriptMerger()
if args.transcript:
logger.info("Step 3: Merging with Whisper transcript...")
transcript_path = Path(args.transcript)
if not transcript_path.exists():
logger.warning(f"Transcript not found: {args.transcript}")
logger.info("Proceeding with screen content only...")
audio_segments = []
else:
audio_segments = merger.load_whisper_transcript(str(transcript_path))
logger.info(f"✓ Loaded {len(audio_segments)} audio segments")
else:
logger.info("No transcript provided, using screen content only...")
audio_segments = []
# Merge and format
merged = merger.merge_transcripts(audio_segments, screen_segments)
formatted = merger.format_for_claude(merged, format_style=args.format)
# Save output
merger.save_transcript(formatted, args.output)
logger.info("=" * 80)
logger.info("✓ PROCESSING COMPLETE!")
logger.info("=" * 80)
logger.info(f"Enhanced transcript: {args.output}")
logger.info(f"OCR data: {ocr_output}")
logger.info(f"Frames: {args.frames_dir}/")
logger.info("")
logger.info("You can now use the enhanced transcript with Claude for summarization!")
except FileNotFoundError as e:
logging.error(f"File not found: {e}")
return 1
except RuntimeError as e:
logging.error(f"Processing failed: {e}")
return 1
except KeyboardInterrupt:
logging.warning("\nProcessing interrupted by user")
return 130
except Exception as e:
logging.exception(f"Unexpected error: {e}")
return 1
if __name__ == '__main__':
main()
sys.exit(main())

View File

@@ -1,6 +1,7 @@
# Core dependencies
opencv-python>=4.8.0
Pillow>=10.0.0
ffmpeg-python>=0.2.0
# Vision analysis (recommended for better results)
# Requires Ollama to be installed: https://ollama.ai/download