Compare commits
10 Commits
a999bc9093
...
main
| Author | SHA1 | Date | |
|---|---|---|---|
|
|
eb8b1f4f11 | ||
|
|
331cccb15f | ||
|
|
7d7ec15ff7 | ||
|
|
7b919beda6 | ||
|
|
118ef04223 | ||
|
|
b1e1daf278 | ||
|
|
c871af2def | ||
|
|
cdf7ad1199 | ||
|
|
b9c3cbfbab | ||
|
|
cd7b0aed07 |
9
.gitignore
vendored
9
.gitignore
vendored
@@ -2,10 +2,11 @@
|
|||||||
samples/*
|
samples/*
|
||||||
!samples/.gitkeep
|
!samples/.gitkeep
|
||||||
|
|
||||||
# Output files
|
# Output directories (timestamped folders for each video)
|
||||||
output/*
|
output/*
|
||||||
!output/.gitkeep
|
!output/.gitkeep
|
||||||
|
|
||||||
# Extracted frames
|
# Python cache
|
||||||
frames/
|
__pycache__
|
||||||
__pycache__
|
*.pyc
|
||||||
|
.pytest_cache/
|
||||||
407
README.md
407
README.md
@@ -1,34 +1,21 @@
|
|||||||
# Meeting Processor
|
# Meeting Processor
|
||||||
|
|
||||||
Extract screen content from meeting recordings and merge with Whisper transcripts for better AI summarization.
|
Extract screen content from meeting recordings and merge with Whisper/WhisperX transcripts for better AI summarization.
|
||||||
|
|
||||||
## Overview
|
## Overview
|
||||||
|
|
||||||
This tool enhances meeting transcripts by combining:
|
This tool enhances meeting transcripts by combining:
|
||||||
- **Audio transcription** (from Whisper)
|
- **Audio transcription** (Whisper or WhisperX with speaker diarization)
|
||||||
- **Screen content analysis** (Vision models or OCR)
|
- **Screen content extraction** via FFmpeg scene detection
|
||||||
|
- **Frame embedding** for direct LLM analysis
|
||||||
|
|
||||||
### Vision Analysis vs OCR
|
The result is a rich, timestamped transcript with embedded screen frames that provides full context for AI summarization.
|
||||||
|
|
||||||
- **Vision Models** (recommended): Uses local LLaVA model via Ollama to understand context - great for dashboards, code, consoles
|
|
||||||
- **OCR**: Traditional text extraction - faster but less context-aware
|
|
||||||
|
|
||||||
The result is a rich, timestamped transcript that provides full context for AI summarization.
|
|
||||||
|
|
||||||
## Installation
|
## Installation
|
||||||
|
|
||||||
### 1. System Dependencies
|
### 1. System Dependencies
|
||||||
|
|
||||||
**Ollama** (required for vision analysis):
|
**FFmpeg** (required for scene detection and frame extraction):
|
||||||
```bash
|
|
||||||
# Install from https://ollama.ai/download
|
|
||||||
# Then pull a vision model:
|
|
||||||
ollama pull llava:13b
|
|
||||||
# or for lighter model:
|
|
||||||
ollama pull llava:7b
|
|
||||||
```
|
|
||||||
|
|
||||||
**FFmpeg** (for scene detection):
|
|
||||||
```bash
|
```bash
|
||||||
# Ubuntu/Debian
|
# Ubuntu/Debian
|
||||||
sudo apt-get install ffmpeg
|
sudo apt-get install ffmpeg
|
||||||
@@ -37,210 +24,152 @@ sudo apt-get install ffmpeg
|
|||||||
brew install ffmpeg
|
brew install ffmpeg
|
||||||
```
|
```
|
||||||
|
|
||||||
**Tesseract OCR** (optional, if not using vision):
|
|
||||||
```bash
|
|
||||||
# Ubuntu/Debian
|
|
||||||
sudo apt-get install tesseract-ocr
|
|
||||||
|
|
||||||
# macOS
|
|
||||||
brew install tesseract
|
|
||||||
|
|
||||||
# Arch Linux
|
|
||||||
sudo pacman -S tesseract
|
|
||||||
```
|
|
||||||
|
|
||||||
### 2. Python Dependencies
|
### 2. Python Dependencies
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
pip install -r requirements.txt
|
pip install -r requirements.txt
|
||||||
```
|
```
|
||||||
|
|
||||||
### 3. Whisper (for audio transcription)
|
### 3. Whisper or WhisperX (for audio transcription)
|
||||||
|
|
||||||
|
**Standard Whisper:**
|
||||||
```bash
|
```bash
|
||||||
pip install openai-whisper
|
pip install openai-whisper
|
||||||
```
|
```
|
||||||
|
|
||||||
### 4. Optional: Install Alternative OCR Engines
|
**WhisperX** (recommended - includes speaker diarization):
|
||||||
|
|
||||||
If you prefer OCR over vision analysis:
|
|
||||||
```bash
|
```bash
|
||||||
# EasyOCR (better for rotated/handwritten text)
|
pip install whisperx
|
||||||
pip install easyocr
|
|
||||||
|
|
||||||
# PaddleOCR (better for code/terminal screens)
|
|
||||||
pip install paddleocr
|
|
||||||
```
|
```
|
||||||
|
|
||||||
|
For speaker diarization, you'll need a HuggingFace token with access to pyannote models.
|
||||||
|
|
||||||
## Quick Start
|
## Quick Start
|
||||||
|
|
||||||
### Recommended: Vision Analysis (Best for Code/Dashboards)
|
### Recommended Usage
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
python process_meeting.py samples/meeting.mkv --run-whisper --use-vision
|
python process_meeting.py samples/meeting.mkv --embed-images --scene-detection --scene-threshold 10 --diarize
|
||||||
```
|
```
|
||||||
|
|
||||||
This will:
|
This will:
|
||||||
1. Run Whisper transcription (audio → text)
|
1. Run WhisperX transcription with speaker diarization
|
||||||
2. Extract frames every 5 seconds
|
2. Extract frames at scene changes (threshold 10 = moderately sensitive)
|
||||||
3. Use LLaVA vision model to analyze frames with context
|
3. Create an enhanced transcript with frame file references
|
||||||
4. Merge audio + screen content
|
4. Save everything to `output/` folder
|
||||||
5. Save everything to `output/` folder
|
|
||||||
|
The `--embed-images` flag adds frame paths to the transcript (e.g., `Frame: frames/video_00257.jpg`), keeping the transcript small while frames stay in `frames/` folder for LLM access.
|
||||||
|
|
||||||
### Re-run with Cached Results
|
### Re-run with Cached Results
|
||||||
|
|
||||||
Already ran it once? Re-run instantly using cached results:
|
Already ran it once? Re-run instantly using cached results:
|
||||||
```bash
|
```bash
|
||||||
# Uses cached transcript, frames, and analysis
|
# Uses cached transcript and frames
|
||||||
python process_meeting.py samples/meeting.mkv --use-vision
|
python process_meeting.py samples/meeting.mkv --embed-images
|
||||||
|
|
||||||
# Force reprocessing
|
# Skip only specific cached items
|
||||||
python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --no-cache
|
python process_meeting.py samples/meeting.mkv --embed-images --skip-cache-frames
|
||||||
```
|
python process_meeting.py samples/meeting.mkv --embed-images --skip-cache-whisper
|
||||||
|
|
||||||
### Traditional OCR (Faster, Less Context-Aware)
|
# Force complete reprocessing
|
||||||
|
python process_meeting.py samples/meeting.mkv --embed-images --scene-detection --diarize --no-cache
|
||||||
```bash
|
|
||||||
python process_meeting.py samples/meeting.mkv --run-whisper
|
|
||||||
```
|
```
|
||||||
|
|
||||||
## Usage Examples
|
## Usage Examples
|
||||||
|
|
||||||
### Vision Analysis with Context Hints
|
### Scene Detection Options
|
||||||
```bash
|
```bash
|
||||||
# For code-heavy meetings
|
# Default threshold (15)
|
||||||
python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --vision-context code
|
python process_meeting.py samples/meeting.mkv --embed-images --scene-detection --diarize
|
||||||
|
|
||||||
# For dashboard/monitoring meetings (Grafana, GCP, etc.)
|
# More sensitive (more frames, threshold: 5)
|
||||||
python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --vision-context dashboard
|
python process_meeting.py samples/meeting.mkv --embed-images --scene-detection --scene-threshold 5 --diarize
|
||||||
|
|
||||||
# For console/terminal sessions
|
# Less sensitive (fewer frames, threshold: 30)
|
||||||
python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --vision-context console
|
python process_meeting.py samples/meeting.mkv --embed-images --scene-detection --scene-threshold 30 --diarize
|
||||||
```
|
```
|
||||||
|
|
||||||
### Different Vision Models
|
### Fixed Interval Extraction (alternative to scene detection)
|
||||||
```bash
|
|
||||||
# Lighter/faster model (7B parameters)
|
|
||||||
python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --vision-model llava:7b
|
|
||||||
|
|
||||||
# Default model (13B parameters, better quality)
|
|
||||||
python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --vision-model llava:13b
|
|
||||||
|
|
||||||
# Alternative models
|
|
||||||
python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --vision-model bakllava
|
|
||||||
```
|
|
||||||
|
|
||||||
### Extract frames at different intervals
|
|
||||||
```bash
|
```bash
|
||||||
# Every 10 seconds
|
# Every 10 seconds
|
||||||
python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --interval 10
|
python process_meeting.py samples/meeting.mkv --embed-images --interval 10 --diarize
|
||||||
|
|
||||||
# Every 3 seconds (more detailed)
|
# Every 3 seconds (more detailed)
|
||||||
python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --interval 3
|
python process_meeting.py samples/meeting.mkv --embed-images --interval 3 --diarize
|
||||||
```
|
|
||||||
|
|
||||||
### Use scene detection (smarter, fewer frames)
|
|
||||||
```bash
|
|
||||||
python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --scene-detection
|
|
||||||
```
|
|
||||||
|
|
||||||
### Traditional OCR (if you prefer)
|
|
||||||
```bash
|
|
||||||
# Tesseract (default)
|
|
||||||
python process_meeting.py samples/meeting.mkv --run-whisper
|
|
||||||
|
|
||||||
# EasyOCR
|
|
||||||
python process_meeting.py samples/meeting.mkv --run-whisper --ocr-engine easyocr
|
|
||||||
|
|
||||||
# PaddleOCR
|
|
||||||
python process_meeting.py samples/meeting.mkv --run-whisper --ocr-engine paddleocr
|
|
||||||
```
|
```
|
||||||
|
|
||||||
### Caching Examples
|
### Caching Examples
|
||||||
```bash
|
```bash
|
||||||
# First run - processes everything
|
# First run - processes everything
|
||||||
python process_meeting.py samples/meeting.mkv --run-whisper --use-vision
|
python process_meeting.py samples/meeting.mkv --embed-images --scene-detection --scene-threshold 10 --diarize
|
||||||
|
|
||||||
# Second run - uses cached transcript and frames, only re-merges
|
# Iterate on scene threshold (reuse whisper transcript)
|
||||||
python process_meeting.py samples/meeting.mkv
|
python process_meeting.py samples/meeting.mkv --embed-images --scene-detection --scene-threshold 5 --skip-cache-frames --skip-cache-analysis
|
||||||
|
|
||||||
# Switch from OCR to vision using existing frames
|
# Re-run whisper only
|
||||||
python process_meeting.py samples/meeting.mkv --use-vision
|
python process_meeting.py samples/meeting.mkv --embed-images --skip-cache-whisper
|
||||||
|
|
||||||
# Force complete reprocessing
|
# Force complete reprocessing
|
||||||
python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --no-cache
|
python process_meeting.py samples/meeting.mkv --embed-images --scene-detection --diarize --no-cache
|
||||||
```
|
```
|
||||||
|
|
||||||
### Custom output location
|
### Custom output location
|
||||||
```bash
|
```bash
|
||||||
python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --output-dir my_outputs/
|
python process_meeting.py samples/meeting.mkv --embed-images --scene-detection --diarize --output-dir my_outputs/
|
||||||
```
|
```
|
||||||
|
|
||||||
### Enable verbose logging
|
### Enable verbose logging
|
||||||
```bash
|
```bash
|
||||||
# Show detailed debug information
|
python process_meeting.py samples/meeting.mkv --embed-images --scene-detection --diarize --verbose
|
||||||
python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --verbose
|
|
||||||
```
|
```
|
||||||
|
|
||||||
## Output Files
|
## Output Files
|
||||||
|
|
||||||
All output files are saved to the `output/` directory by default:
|
Each video gets its own timestamped output directory:
|
||||||
|
|
||||||
- **`output/<video>_enhanced.txt`** - Enhanced transcript ready for AI summarization
|
```
|
||||||
- **`output/<video>.json`** - Whisper transcript (if `--run-whisper` was used)
|
output/
|
||||||
- **`output/<video>_vision.json`** - Vision analysis results with timestamps (if `--use-vision`)
|
└── 20241019_143022-meeting/
|
||||||
- **`output/<video>_ocr.json`** - OCR results with timestamps (if using OCR)
|
├── manifest.json # Processing configuration
|
||||||
- **`frames/`** - Extracted video frames (JPG files)
|
├── meeting_enhanced.txt # Enhanced transcript for AI
|
||||||
|
├── meeting.json # Whisper/WhisperX transcript
|
||||||
|
└── frames/ # Extracted video frames
|
||||||
|
├── frame_00001_5.00s.jpg
|
||||||
|
├── frame_00002_10.00s.jpg
|
||||||
|
└── ...
|
||||||
|
```
|
||||||
|
|
||||||
### Caching Behavior
|
### Caching Behavior
|
||||||
|
|
||||||
The tool automatically caches intermediate results to speed up re-runs:
|
The tool automatically reuses the most recent output directory for the same video:
|
||||||
- **Whisper transcript**: Cached as `output/<video>.json`
|
- **First run**: Creates new timestamped directory (e.g., `20241019_143022-meeting/`)
|
||||||
- **Extracted frames**: Cached in `frames/<video>_*.jpg`
|
- **Subsequent runs**: Reuses the same directory and cached results
|
||||||
- **Analysis results**: Cached as `output/<video>_vision.json` or `output/<video>_ocr.json`
|
- **Cached items**: Whisper transcript, extracted frames, analysis results
|
||||||
|
|
||||||
Re-running with the same video will use cached results unless `--no-cache` is specified.
|
**Fine-grained cache control:**
|
||||||
|
- `--no-cache`: Force complete reprocessing
|
||||||
|
- `--skip-cache-frames`: Re-extract frames only
|
||||||
|
- `--skip-cache-whisper`: Re-run transcription only
|
||||||
|
- `--skip-cache-analysis`: Re-run analysis only
|
||||||
|
|
||||||
|
This allows you to iterate on scene detection thresholds without re-running Whisper!
|
||||||
|
|
||||||
## Workflow for Meeting Analysis
|
## Workflow for Meeting Analysis
|
||||||
|
|
||||||
### Complete Workflow (One Command!)
|
### Complete Workflow (One Command!)
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
# Process everything in one step with vision analysis
|
python process_meeting.py samples/meeting.mkv --embed-images --scene-detection --scene-threshold 10 --diarize
|
||||||
python process_meeting.py samples/alo-intro1.mkv --run-whisper --use-vision --scene-detection
|
|
||||||
|
|
||||||
# Output will be in output/alo-intro1_enhanced.txt
|
|
||||||
```
|
```
|
||||||
|
|
||||||
### Typical Iterative Workflow
|
### Typical Iterative Workflow
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
# First run - full processing
|
# First run - full processing
|
||||||
python process_meeting.py samples/meeting.mkv --run-whisper --use-vision
|
python process_meeting.py samples/meeting.mkv --embed-images --scene-detection --scene-threshold 10 --diarize
|
||||||
|
|
||||||
# Review results, then re-run with different context if needed
|
# Adjust scene threshold (keeps cached whisper transcript)
|
||||||
python process_meeting.py samples/meeting.mkv --use-vision --vision-context code
|
python process_meeting.py samples/meeting.mkv --embed-images --scene-detection --scene-threshold 5 --skip-cache-frames --skip-cache-analysis
|
||||||
|
|
||||||
# Or switch to a different vision model
|
|
||||||
python process_meeting.py samples/meeting.mkv --use-vision --vision-model llava:7b
|
|
||||||
|
|
||||||
# All use cached frames and transcript!
|
|
||||||
```
|
|
||||||
|
|
||||||
### Traditional Workflow (Separate Steps)
|
|
||||||
|
|
||||||
```bash
|
|
||||||
# 1. Extract audio and transcribe with Whisper (optional, if not using --run-whisper)
|
|
||||||
whisper samples/alo-intro1.mkv --model base --output_format json --output_dir output
|
|
||||||
|
|
||||||
# 2. Process video to extract screen content with vision
|
|
||||||
python process_meeting.py samples/alo-intro1.mkv \
|
|
||||||
--transcript output/alo-intro1.json \
|
|
||||||
--use-vision \
|
|
||||||
--scene-detection
|
|
||||||
|
|
||||||
# 3. Use the enhanced transcript with AI
|
|
||||||
# Copy the content from output/alo-intro1_enhanced.txt and paste into Claude or your LLM
|
|
||||||
```
|
```
|
||||||
|
|
||||||
### Example Prompt for Claude
|
### Example Prompt for Claude
|
||||||
@@ -260,64 +189,54 @@ Please summarize this meeting transcript. Pay special attention to:
|
|||||||
```
|
```
|
||||||
usage: process_meeting.py [-h] [--transcript TRANSCRIPT] [--run-whisper]
|
usage: process_meeting.py [-h] [--transcript TRANSCRIPT] [--run-whisper]
|
||||||
[--whisper-model {tiny,base,small,medium,large}]
|
[--whisper-model {tiny,base,small,medium,large}]
|
||||||
[--output OUTPUT] [--output-dir OUTPUT_DIR]
|
[--diarize] [--output OUTPUT] [--output-dir OUTPUT_DIR]
|
||||||
[--frames-dir FRAMES_DIR] [--interval INTERVAL]
|
[--interval INTERVAL] [--scene-detection]
|
||||||
[--scene-detection]
|
[--scene-threshold SCENE_THRESHOLD]
|
||||||
[--ocr-engine {tesseract,easyocr,paddleocr}]
|
[--embed-images] [--embed-quality EMBED_QUALITY]
|
||||||
[--no-deduplicate] [--extract-only]
|
[--no-cache] [--skip-cache-frames] [--skip-cache-whisper]
|
||||||
[--format {detailed,compact}] [--verbose]
|
[--skip-cache-analysis] [--no-deduplicate]
|
||||||
video
|
[--extract-only] [--format {detailed,compact}]
|
||||||
|
[--verbose] video
|
||||||
|
|
||||||
Options:
|
Main Options:
|
||||||
video Path to video file
|
video Path to video file
|
||||||
--transcript, -t Path to Whisper transcript (JSON or TXT)
|
--diarize Use WhisperX with speaker diarization
|
||||||
--run-whisper Run Whisper transcription before processing
|
--embed-images Add frame file references to transcript (recommended)
|
||||||
--whisper-model Whisper model: tiny, base, small, medium, large (default: base)
|
|
||||||
--output, -o Output file for enhanced transcript
|
Frame Extraction:
|
||||||
--output-dir Directory for output files (default: output/)
|
--scene-detection Use FFmpeg scene detection (recommended)
|
||||||
--frames-dir Directory to save extracted frames (default: frames/)
|
--scene-threshold Detection sensitivity 0-100 (default: 15, lower=more sensitive)
|
||||||
--interval Extract frame every N seconds (default: 5)
|
--interval Extract frame every N seconds (alternative to scene detection)
|
||||||
--scene-detection Use scene detection instead of interval extraction
|
|
||||||
--ocr-engine OCR engine: tesseract, easyocr, paddleocr (default: tesseract)
|
Caching:
|
||||||
--no-deduplicate Disable text deduplication
|
--no-cache Force complete reprocessing
|
||||||
--extract-only Only extract frames and OCR, skip transcript merging
|
--skip-cache-frames Re-extract frames only
|
||||||
--format Output format: detailed or compact (default: detailed)
|
--skip-cache-whisper Re-run transcription only
|
||||||
--verbose, -v Enable verbose logging (DEBUG level)
|
--skip-cache-analysis Re-run analysis only
|
||||||
|
|
||||||
|
Other:
|
||||||
|
--run-whisper Run Whisper (without diarization)
|
||||||
|
--whisper-model Whisper model: tiny, base, small, medium, large (default: medium)
|
||||||
|
--transcript, -t Path to existing Whisper transcript (JSON or TXT)
|
||||||
|
--output, -o Output file for enhanced transcript
|
||||||
|
--output-dir Directory for output files (default: output/)
|
||||||
|
--verbose, -v Enable verbose logging
|
||||||
```
|
```
|
||||||
|
|
||||||
## Tips for Best Results
|
## Tips for Best Results
|
||||||
|
|
||||||
### Vision vs OCR: When to Use Each
|
|
||||||
|
|
||||||
**Use Vision Models (`--use-vision`) when:**
|
|
||||||
- ✅ Analyzing dashboards (Grafana, GCP Console, monitoring tools)
|
|
||||||
- ✅ Code walkthroughs or debugging sessions
|
|
||||||
- ✅ Complex layouts with mixed content
|
|
||||||
- ✅ Need contextual understanding, not just text extraction
|
|
||||||
- ✅ Working with charts, graphs, or visualizations
|
|
||||||
- ⚠️ Trade-off: Slower (requires GPU/CPU for local model)
|
|
||||||
|
|
||||||
**Use OCR when:**
|
|
||||||
- ✅ Simple text extraction from slides or documents
|
|
||||||
- ✅ Need maximum speed
|
|
||||||
- ✅ Limited computational resources
|
|
||||||
- ✅ Presentations with mostly text
|
|
||||||
- ⚠️ Trade-off: Less context-aware, may miss visual relationships
|
|
||||||
|
|
||||||
### Context Hints for Vision Analysis
|
|
||||||
- **`--vision-context meeting`**: General purpose (default)
|
|
||||||
- **`--vision-context code`**: Optimized for code screenshots, preserves formatting
|
|
||||||
- **`--vision-context dashboard`**: Extracts metrics, trends, panel names
|
|
||||||
- **`--vision-context console`**: Captures commands, output, error messages
|
|
||||||
|
|
||||||
### Scene Detection vs Interval
|
### Scene Detection vs Interval
|
||||||
- **Scene detection**: Better for presentations with distinct slides. More efficient.
|
- **Scene detection** (`--scene-detection`): Recommended. Captures frames when content changes. More efficient.
|
||||||
- **Interval extraction**: Better for continuous screen sharing (coding, browsing). More thorough.
|
- **Interval extraction** (`--interval N`): Alternative for continuous content. Captures every N seconds.
|
||||||
|
|
||||||
### Vision Model Selection
|
### Scene Detection Threshold
|
||||||
- **`llava:7b`**: Faster, lower memory (~4GB RAM), good quality
|
- Lower values (5-10): More sensitive, captures more frames
|
||||||
- **`llava:13b`**: Better quality, slower, needs ~8GB RAM (default)
|
- Default (15): Good balance for most meetings
|
||||||
- **`bakllava`**: Alternative with different strengths
|
- Higher values (20-30): Less sensitive, fewer frames
|
||||||
|
|
||||||
|
### Whisper vs WhisperX
|
||||||
|
- **Whisper** (`--run-whisper`): Standard transcription, fast
|
||||||
|
- **WhisperX** (`--run-whisper --diarize`): Adds speaker identification, requires HuggingFace token
|
||||||
|
|
||||||
### Deduplication
|
### Deduplication
|
||||||
- Enabled by default - removes similar consecutive frames
|
- Enabled by default - removes similar consecutive frames
|
||||||
@@ -325,73 +244,75 @@ Options:
|
|||||||
|
|
||||||
## Troubleshooting
|
## Troubleshooting
|
||||||
|
|
||||||
### Vision Model Issues
|
### Frame Extraction Issues
|
||||||
|
|
||||||
**"ollama package not installed"**
|
|
||||||
```bash
|
|
||||||
pip install ollama
|
|
||||||
```
|
|
||||||
|
|
||||||
**"Ollama not found" or connection errors**
|
|
||||||
```bash
|
|
||||||
# Install Ollama first: https://ollama.ai/download
|
|
||||||
# Then pull a vision model:
|
|
||||||
ollama pull llava:13b
|
|
||||||
```
|
|
||||||
|
|
||||||
**Vision analysis is slow**
|
|
||||||
- Use lighter model: `--vision-model llava:7b`
|
|
||||||
- Reduce frame count: `--scene-detection` or `--interval 10`
|
|
||||||
- Check if Ollama is using GPU (much faster)
|
|
||||||
|
|
||||||
**Poor vision analysis results**
|
|
||||||
- Try different context hint: `--vision-context code` or `--vision-context dashboard`
|
|
||||||
- Use larger model: `--vision-model llava:13b`
|
|
||||||
- Ensure frames are clear (check video resolution)
|
|
||||||
|
|
||||||
### OCR Issues
|
|
||||||
|
|
||||||
**"pytesseract not installed"**
|
|
||||||
```bash
|
|
||||||
pip install pytesseract
|
|
||||||
sudo apt-get install tesseract-ocr # Don't forget system package!
|
|
||||||
```
|
|
||||||
|
|
||||||
**Poor OCR quality**
|
|
||||||
- **Solution**: Switch to vision analysis with `--use-vision`
|
|
||||||
- Or try different OCR engine: `--ocr-engine easyocr`
|
|
||||||
- Check if video resolution is sufficient
|
|
||||||
- Use `--no-deduplicate` to keep more frames
|
|
||||||
|
|
||||||
### General Issues
|
|
||||||
|
|
||||||
**"No frames extracted"**
|
**"No frames extracted"**
|
||||||
- Check video file is valid: `ffmpeg -i video.mkv`
|
- Check video file is valid: `ffmpeg -i video.mkv`
|
||||||
- Try lower interval: `--interval 3`
|
- Try lower scene threshold: `--scene-threshold 5`
|
||||||
- Check disk space in frames directory
|
- Try interval extraction: `--interval 3`
|
||||||
|
- Check disk space in output directory
|
||||||
|
|
||||||
**Scene detection not working**
|
**Scene detection not working**
|
||||||
- Fallback to interval extraction automatically
|
|
||||||
- Ensure FFmpeg is installed
|
- Ensure FFmpeg is installed
|
||||||
|
- Falls back to interval extraction automatically
|
||||||
- Try manual interval: `--interval 5`
|
- Try manual interval: `--interval 5`
|
||||||
|
|
||||||
|
### Whisper/WhisperX Issues
|
||||||
|
|
||||||
|
**WhisperX diarization not working**
|
||||||
|
- Ensure you have a HuggingFace token set
|
||||||
|
- Token needs access to pyannote models
|
||||||
|
- Fall back to standard Whisper without `--diarize`
|
||||||
|
|
||||||
|
### Cache Issues
|
||||||
|
|
||||||
**Cache not being used**
|
**Cache not being used**
|
||||||
- Ensure you're using the same video filename
|
- Ensure you're using the same video filename
|
||||||
- Check that output directory contains cached files
|
- Check that output directory contains cached files
|
||||||
- Use `--verbose` to see what's being cached/loaded
|
- Use `--verbose` to see what's being cached/loaded
|
||||||
|
|
||||||
|
**Want to re-run specific steps**
|
||||||
|
- `--skip-cache-frames`: Re-extract frames
|
||||||
|
- `--skip-cache-whisper`: Re-run transcription
|
||||||
|
- `--skip-cache-analysis`: Re-run analysis
|
||||||
|
- `--no-cache`: Force complete reprocessing
|
||||||
|
|
||||||
|
## Experimental Features
|
||||||
|
|
||||||
|
### OCR and Vision Analysis
|
||||||
|
|
||||||
|
OCR (`--ocr-engine`) and Vision analysis (`--use-vision`) options are available but experimental. The recommended approach is to use `--embed-images` which embeds frame references directly in the transcript, letting your LLM analyze the images.
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Experimental: OCR extraction
|
||||||
|
python process_meeting.py samples/meeting.mkv --run-whisper --ocr-engine tesseract
|
||||||
|
|
||||||
|
# Experimental: Vision model analysis
|
||||||
|
python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --vision-model llava:13b
|
||||||
|
|
||||||
|
# Experimental: Hybrid OpenCV + OCR
|
||||||
|
python process_meeting.py samples/meeting.mkv --run-whisper --use-hybrid
|
||||||
|
```
|
||||||
|
|
||||||
## Project Structure
|
## Project Structure
|
||||||
|
|
||||||
```
|
```
|
||||||
meetus/
|
meetus/
|
||||||
├── meetus/ # Main package
|
├── meetus/ # Main package
|
||||||
│ ├── __init__.py
|
│ ├── __init__.py
|
||||||
│ ├── frame_extractor.py # Video frame extraction
|
│ ├── workflow.py # Processing orchestrator
|
||||||
│ ├── ocr_processor.py # OCR processing
|
│ ├── output_manager.py # Output directory & manifest management
|
||||||
│ └── transcript_merger.py # Transcript merging
|
│ ├── cache_manager.py # Caching logic
|
||||||
├── process_meeting.py # Main CLI script
|
│ ├── frame_extractor.py # Video frame extraction (FFmpeg scene detection)
|
||||||
├── requirements.txt # Python dependencies
|
│ ├── vision_processor.py # Vision model analysis (experimental)
|
||||||
└── README.md # This file
|
│ ├── ocr_processor.py # OCR processing (experimental)
|
||||||
|
│ └── transcript_merger.py # Transcript merging
|
||||||
|
├── process_meeting.py # Main CLI script
|
||||||
|
├── requirements.txt # Python dependencies
|
||||||
|
├── output/ # Timestamped output directories
|
||||||
|
│ └── YYYYMMDD_HHMMSS-video/ # Auto-generated per video
|
||||||
|
├── samples/ # Sample videos (gitignored)
|
||||||
|
└── README.md # This file
|
||||||
```
|
```
|
||||||
|
|
||||||
## License
|
## License
|
||||||
|
|||||||
80
def/01-scene-detection-quality-caching.md
Normal file
80
def/01-scene-detection-quality-caching.md
Normal file
@@ -0,0 +1,80 @@
|
|||||||
|
# 01 - Scene Detection Sensitivity, Image Quality, and Granular Caching
|
||||||
|
|
||||||
|
## Date
|
||||||
|
2025-10-28
|
||||||
|
|
||||||
|
## Context
|
||||||
|
Last run on zaca-run-scrapers sample (Zed editor walkthrough) only detected 19 frames with 7+ minute gaps. Whisper wasn't running (flag not passed). JPEG compression quality was poor for code/text readability.
|
||||||
|
|
||||||
|
## Problems Identified
|
||||||
|
1. **Scene detection too conservative** - Default threshold of 30.0 missed file switches and scrolling in clean UI (Zed vs VS Code)
|
||||||
|
2. **No whisper transcription** - User expected it to run but `--run-whisper` is opt-in
|
||||||
|
3. **Poor JPEG quality** - Default compression made code/text hard to read for OCR/vision
|
||||||
|
4. **Subprocess-based FFmpeg** - Using shell commands instead of Python library
|
||||||
|
5. **All-or-nothing caching** - `--no-cache` regenerates everything including slow whisper transcription
|
||||||
|
|
||||||
|
## Changes Made
|
||||||
|
|
||||||
|
### 1. Scene Detection Sensitivity
|
||||||
|
**Files:** `meetus/frame_extractor.py`, `process_meeting.py`, `meetus/workflow.py`
|
||||||
|
|
||||||
|
- Lowered default threshold: `30.0` → `15.0` (more sensitive for clean UIs)
|
||||||
|
- Added `--scene-threshold` CLI argument (0-100, lower = more sensitive)
|
||||||
|
- Added threshold to manifest for tracking
|
||||||
|
- Updated docstring with usage guidelines:
|
||||||
|
- 15.0: Good for clean UIs like Zed
|
||||||
|
- 20-30: Busy UIs like VS Code
|
||||||
|
- 5-10: Very subtle changes
|
||||||
|
|
||||||
|
### 2. JPEG Quality Improvements
|
||||||
|
**Files:** `meetus/frame_extractor.py`
|
||||||
|
|
||||||
|
- **Interval extraction**: Added `cv2.IMWRITE_JPEG_QUALITY, 95` (line 60)
|
||||||
|
- **Scene detection**: Added `-q:v 2` to FFmpeg (best quality, line 94)
|
||||||
|
|
||||||
|
### 3. Migration to ffmpeg-python
|
||||||
|
**Files:** `meetus/frame_extractor.py`, `requirements.txt`
|
||||||
|
|
||||||
|
- Replaced `subprocess.run()` with `ffmpeg-python` library
|
||||||
|
- Cleaner, more Pythonic API
|
||||||
|
- Better error handling with `ffmpeg.Error`
|
||||||
|
- Added to requirements.txt
|
||||||
|
|
||||||
|
### 4. Granular Cache Control
|
||||||
|
**Files:** `process_meeting.py`, `meetus/workflow.py`, `meetus/cache_manager.py`
|
||||||
|
|
||||||
|
Added three new flags for selective cache invalidation:
|
||||||
|
- `--skip-cache-frames`: Regenerate frames (useful when tuning scene threshold)
|
||||||
|
- `--skip-cache-whisper`: Rerun whisper transcription
|
||||||
|
- `--skip-cache-analysis`: Rerun OCR/vision analysis
|
||||||
|
|
||||||
|
**Key design:**
|
||||||
|
- `--no-cache`: Still works as before (new directory + regenerate everything)
|
||||||
|
- New flags: Reuse existing output directory but selectively invalidate caches
|
||||||
|
- Frames are cleaned up when regenerating to avoid stale data
|
||||||
|
|
||||||
|
## Typical Workflow
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# First run - generate everything including whisper (expensive, once)
|
||||||
|
python process_meeting.py samples/video.mkv --run-whisper --scene-detection --use-vision
|
||||||
|
|
||||||
|
# Iterate on scene threshold without re-running whisper
|
||||||
|
python process_meeting.py samples/video.mkv --scene-detection --scene-threshold 10 --use-vision --skip-cache-frames --skip-cache-analysis
|
||||||
|
|
||||||
|
# Try even more sensitive
|
||||||
|
python process_meeting.py samples/video.mkv --scene-detection --scene-threshold 5 --use-vision --skip-cache-frames --skip-cache-analysis
|
||||||
|
```
|
||||||
|
|
||||||
|
## Notes
|
||||||
|
- Whisper is the most expensive and reliable step → always cache it during iteration
|
||||||
|
- Scene detection needs tuning per UI style (Zed vs VS Code)
|
||||||
|
- Vision analysis should regenerate when frames change
|
||||||
|
- Walking through code (file switches, scrolling) should trigger scene changes
|
||||||
|
|
||||||
|
## Files Modified
|
||||||
|
- `meetus/frame_extractor.py` - Scene threshold, quality, ffmpeg-python
|
||||||
|
- `meetus/workflow.py` - Cache flags, frame cleanup
|
||||||
|
- `meetus/cache_manager.py` - Granular cache checks
|
||||||
|
- `process_meeting.py` - CLI arguments
|
||||||
|
- `requirements.txt` - Added ffmpeg-python
|
||||||
111
def/02-hybrid-opencv-ocr-llm.md
Normal file
111
def/02-hybrid-opencv-ocr-llm.md
Normal file
@@ -0,0 +1,111 @@
|
|||||||
|
# 02 - Hybrid OpenCV + OCR + LLM Approach
|
||||||
|
|
||||||
|
## Date
|
||||||
|
2025-10-28
|
||||||
|
|
||||||
|
## Context
|
||||||
|
Vision models (llava) were hallucinating text content badly - showing HTML code when there was none, inventing text that didn't exist. Pure OCR was fast and accurate but lost code formatting and structure.
|
||||||
|
|
||||||
|
## Problem
|
||||||
|
- **Vision models**: Hallucinate text content, can't be trusted for accurate extraction
|
||||||
|
- **Pure OCR**: Accurate text but messy output, lost indentation/formatting
|
||||||
|
- **Need**: Accurate text extraction + preserved code structure
|
||||||
|
|
||||||
|
## Solution: Three-Stage Hybrid Approach
|
||||||
|
|
||||||
|
### Stage 1: OpenCV Text Detection
|
||||||
|
Use morphological operations to find text regions:
|
||||||
|
- Adaptive thresholding (handles varying lighting)
|
||||||
|
- Dilation with horizontal kernel to connect text lines
|
||||||
|
- Contour detection to find bounding boxes
|
||||||
|
- Filter by area and aspect ratio
|
||||||
|
- Merge overlapping regions
|
||||||
|
|
||||||
|
### Stage 2: Region-Based OCR
|
||||||
|
- Sort regions by reading order (top-to-bottom, left-to-right)
|
||||||
|
- Crop each region from original image
|
||||||
|
- Run OCR on cropped regions (more accurate than full frame)
|
||||||
|
- Tesseract with PSM 6 mode to preserve layout
|
||||||
|
- Preserve indentation in cleaning step
|
||||||
|
|
||||||
|
### Stage 3: Optional LLM Cleanup
|
||||||
|
- Take accurate OCR output (no hallucination)
|
||||||
|
- Use lightweight LLM (llama3.2:3b for speed) to:
|
||||||
|
- Fix obvious OCR errors (l→1, O→0)
|
||||||
|
- Restore code indentation and structure
|
||||||
|
- Preserve exact text content
|
||||||
|
- No added explanations or hallucinated content
|
||||||
|
|
||||||
|
## Benefits
|
||||||
|
✓ **Accurate**: OCR reads actual pixels, no hallucination
|
||||||
|
✓ **Fast**: OpenCV detection is instant, focused OCR is quick
|
||||||
|
✓ **Structured**: Regions separated with headers showing position
|
||||||
|
✓ **Formatted**: Optional LLM cleanup preserves/restores code structure
|
||||||
|
✓ **Deterministic**: Same input = same output (unlike vision models)
|
||||||
|
|
||||||
|
## Implementation
|
||||||
|
|
||||||
|
**New file:** `meetus/hybrid_processor.py`
|
||||||
|
- `HybridProcessor` class with OpenCV detection + OCR + optional LLM
|
||||||
|
- Region sorting for proper reading order
|
||||||
|
- Visual separators between regions
|
||||||
|
|
||||||
|
**CLI flags:**
|
||||||
|
```bash
|
||||||
|
--use-hybrid # Enable hybrid mode
|
||||||
|
--hybrid-llm-cleanup # Add LLM post-processing (optional)
|
||||||
|
--hybrid-llm-model MODEL # LLM model (default: llama3.2:3b)
|
||||||
|
```
|
||||||
|
|
||||||
|
**OCR improvements:**
|
||||||
|
- Tesseract PSM 6 mode for better layout preservation
|
||||||
|
- Modified text cleaning to keep indentation
|
||||||
|
- `preserve_layout` parameter
|
||||||
|
|
||||||
|
## Usage
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Basic hybrid (OpenCV + OCR)
|
||||||
|
python process_meeting.py samples/video.mkv --use-hybrid --scene-detection
|
||||||
|
|
||||||
|
# With LLM cleanup for best code formatting
|
||||||
|
python process_meeting.py samples/video.mkv --use-hybrid --hybrid-llm-cleanup --scene-detection -v
|
||||||
|
|
||||||
|
# Iterate on threshold
|
||||||
|
python process_meeting.py samples/video.mkv --use-hybrid --scene-detection --scene-threshold 5 --skip-cache-frames --skip-cache-analysis
|
||||||
|
```
|
||||||
|
|
||||||
|
## Output Format
|
||||||
|
|
||||||
|
```
|
||||||
|
[Region 1 at y=120]
|
||||||
|
function calculateTotal(items) {
|
||||||
|
return items.reduce((sum, item) => sum + item.price, 0);
|
||||||
|
}
|
||||||
|
|
||||||
|
============================================================
|
||||||
|
|
||||||
|
[Region 2 at y=450]
|
||||||
|
const result = calculateTotal(cartItems);
|
||||||
|
console.log('Total:', result);
|
||||||
|
```
|
||||||
|
|
||||||
|
## Performance
|
||||||
|
- **Without LLM cleanup**: Very fast (~2-3s per frame)
|
||||||
|
- **With LLM cleanup**: Slower but still faster than vision models (~5-8s per frame)
|
||||||
|
- **Accuracy**: Much better than vision model hallucinations
|
||||||
|
|
||||||
|
## When to Use What
|
||||||
|
|
||||||
|
| Method | Best For | Pros | Cons |
|
||||||
|
|--------|----------|------|------|
|
||||||
|
| **Hybrid** | Code/terminal text extraction | Accurate, fast, no hallucination | Formatting may be messy |
|
||||||
|
| **Hybrid + LLM** | Code with preserved structure | Accurate + formatted | Slower, needs Ollama |
|
||||||
|
| **Vision** | Understanding layout/context | Semantic understanding | Hallucinates text |
|
||||||
|
| **Pure OCR** | Simple text, no structure needed | Fast, simple | Full-frame, no region detection |
|
||||||
|
|
||||||
|
## Files Modified
|
||||||
|
- `meetus/hybrid_processor.py` - New hybrid processor
|
||||||
|
- `meetus/ocr_processor.py` - Layout preservation
|
||||||
|
- `meetus/workflow.py` - Hybrid mode integration
|
||||||
|
- `process_meeting.py` - CLI flags and examples
|
||||||
100
def/03-embed-images-for-llm.md
Normal file
100
def/03-embed-images-for-llm.md
Normal file
@@ -0,0 +1,100 @@
|
|||||||
|
# 03 - Embed Images for LLM Analysis
|
||||||
|
|
||||||
|
## Date
|
||||||
|
2025-10-28
|
||||||
|
|
||||||
|
## Context
|
||||||
|
Hybrid OCR approach was fast and accurate but formatting was messy. Vision models hallucinated text. Rather than fighting with text extraction, a better approach is to embed the actual frame images in the enhanced transcript and let the end-user's LLM analyze them with full audio context.
|
||||||
|
|
||||||
|
## Problem
|
||||||
|
- OCR/vision models either hallucinate or produce messy text
|
||||||
|
- Code formatting/indentation is hard to preserve
|
||||||
|
- User wants to analyze frames with their own LLM (Claude, GPT, etc.)
|
||||||
|
- Need to keep file size reasonable (~200KB per image is too big)
|
||||||
|
|
||||||
|
## Solution: Image Embedding
|
||||||
|
|
||||||
|
Instead of extracting text, embed the actual frame images as base64 in the enhanced transcript. The LLM can then:
|
||||||
|
- See the actual screen content (no hallucination)
|
||||||
|
- Understand code structure, layout, and formatting visually
|
||||||
|
- Have full audio transcript context for each frame
|
||||||
|
- Analyze dashboards, terminals, editors with perfect accuracy
|
||||||
|
|
||||||
|
## Implementation
|
||||||
|
|
||||||
|
**Quality Optimization:**
|
||||||
|
- Default JPEG quality: 80 (good tradeoff between size and readability)
|
||||||
|
- Configurable via `--embed-quality` (0-100)
|
||||||
|
- Typical sizes at quality 80: ~40-80KB per image (vs 200KB original)
|
||||||
|
|
||||||
|
**Format:**
|
||||||
|
```
|
||||||
|
[MM:SS] SPEAKER:
|
||||||
|
Audio transcript text here
|
||||||
|
|
||||||
|
[MM:SS] SCREEN CONTENT:
|
||||||
|
IMAGE (base64, 52KB):
|
||||||
|
<image>data:image/jpeg;base64,/9j/4AAQSkZJRg...</image>
|
||||||
|
|
||||||
|
TEXT:
|
||||||
|
| Optional OCR text for reference
|
||||||
|
```
|
||||||
|
|
||||||
|
**Features:**
|
||||||
|
- Base64 encoding for easy embedding
|
||||||
|
- Size tracking and reporting
|
||||||
|
- Optional text content alongside images
|
||||||
|
- Works with scene detection for smart frame selection
|
||||||
|
|
||||||
|
## Usage
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Basic: Embed images at quality 80 (default)
|
||||||
|
python process_meeting.py samples/video.mkv --run-whisper --embed-images --scene-detection --no-cache -v
|
||||||
|
|
||||||
|
# Lower quality for smaller files (still readable)
|
||||||
|
python process_meeting.py samples/video.mkv --run-whisper --embed-images --embed-quality 60 --scene-detection --no-cache -v
|
||||||
|
|
||||||
|
# Higher quality for detailed code
|
||||||
|
python process_meeting.py samples/video.mkv --run-whisper --embed-images --embed-quality 90 --scene-detection --no-cache -v
|
||||||
|
|
||||||
|
# Iterate on scene threshold (reuse whisper)
|
||||||
|
python process_meeting.py samples/video.mkv --embed-images --scene-detection --scene-threshold 5 --skip-cache-frames --skip-cache-analysis -v
|
||||||
|
```
|
||||||
|
|
||||||
|
## File Sizes
|
||||||
|
|
||||||
|
**Example for 20 frames:**
|
||||||
|
- Quality 60: ~30-50KB per image = 0.6-1MB total
|
||||||
|
- Quality 80: ~40-80KB per image = 0.8-1.6MB total (recommended)
|
||||||
|
- Quality 90: ~80-120KB per image = 1.6-2.4MB total
|
||||||
|
- Original: ~200KB per image = 4MB total
|
||||||
|
|
||||||
|
## Benefits
|
||||||
|
|
||||||
|
✓ **No hallucination**: LLM sees actual pixels
|
||||||
|
✓ **Perfect formatting**: Code structure preserved visually
|
||||||
|
✓ **Full context**: Audio transcript + visual frame together
|
||||||
|
✓ **User's choice**: Use your preferred LLM (Claude, GPT, etc.)
|
||||||
|
✓ **Reasonable size**: Quality 80 gives 4x smaller files vs original
|
||||||
|
✓ **Simple workflow**: One file contains everything
|
||||||
|
|
||||||
|
## Use Cases
|
||||||
|
|
||||||
|
**Code walkthroughs:** LLM can see actual code structure and indentation
|
||||||
|
**Dashboard analysis:** Charts, graphs, metrics visible to LLM
|
||||||
|
**Terminal sessions:** Commands and output in proper context
|
||||||
|
**UI reviews:** Actual interface visible with audio commentary
|
||||||
|
|
||||||
|
## Files Modified
|
||||||
|
|
||||||
|
- `meetus/transcript_merger.py` - Image encoding and embedding
|
||||||
|
- `meetus/workflow.py` - Wire through config
|
||||||
|
- `process_meeting.py` - CLI flags
|
||||||
|
- `meetus/output_manager.py` - Cleaner directory naming (date + increment)
|
||||||
|
|
||||||
|
## Output Directory Naming
|
||||||
|
|
||||||
|
Also changed output directory format for clarity:
|
||||||
|
- Old: `20251028_054553-video` (confusing timestamps)
|
||||||
|
- New: `20251028-001-video` (clear date + run number)
|
||||||
78
def/04-fix-whisper-cache-loading.md
Normal file
78
def/04-fix-whisper-cache-loading.md
Normal file
@@ -0,0 +1,78 @@
|
|||||||
|
# 04 - Fix Whisper Cache Loading
|
||||||
|
|
||||||
|
## Date
|
||||||
|
2025-10-28
|
||||||
|
|
||||||
|
## Problem
|
||||||
|
Enhanced transcript was not including the audio segments from cached whisper transcripts when running without the `--run-whisper` flag.
|
||||||
|
|
||||||
|
Example command that failed:
|
||||||
|
```bash
|
||||||
|
python process_meeting.py samples/zaca-run-scrapers.mkv --embed-images --scene-detection --scene-threshold 10 --skip-cache-frames -v
|
||||||
|
```
|
||||||
|
|
||||||
|
Result: Enhanced transcript only contained embedded images, no audio segments (0 "SPEAKER" entries).
|
||||||
|
|
||||||
|
## Root Cause
|
||||||
|
In `workflow.py`, the `_run_whisper()` method was checking the `run_whisper` flag **before** checking the cache:
|
||||||
|
|
||||||
|
```python
|
||||||
|
def _run_whisper(self) -> Optional[str]:
|
||||||
|
if not self.config.run_whisper:
|
||||||
|
return self.config.transcript_path # Returns None if --transcript not specified
|
||||||
|
|
||||||
|
# Cache check NEVER REACHED if run_whisper is False
|
||||||
|
cached = self.cache_mgr.get_whisper_cache()
|
||||||
|
if cached:
|
||||||
|
return str(cached)
|
||||||
|
```
|
||||||
|
|
||||||
|
This meant:
|
||||||
|
- User runs command without `--run-whisper`
|
||||||
|
- Method returns None immediately
|
||||||
|
- Cached whisper transcript is never discovered
|
||||||
|
- No audio segments in enhanced output
|
||||||
|
|
||||||
|
## Solution
|
||||||
|
Reorder the logic to check cache **first**, regardless of flags:
|
||||||
|
|
||||||
|
```python
|
||||||
|
def _run_whisper(self) -> Optional[str]:
|
||||||
|
"""Run Whisper transcription if requested, or use cached/provided transcript."""
|
||||||
|
# First, check cache (regardless of run_whisper flag)
|
||||||
|
cached = self.cache_mgr.get_whisper_cache()
|
||||||
|
if cached:
|
||||||
|
return str(cached)
|
||||||
|
|
||||||
|
# If no cache and not running whisper, use provided transcript path (if any)
|
||||||
|
if not self.config.run_whisper:
|
||||||
|
return self.config.transcript_path
|
||||||
|
|
||||||
|
# If no cache and run_whisper is True, run whisper transcription
|
||||||
|
# ... rest of whisper code
|
||||||
|
```
|
||||||
|
|
||||||
|
## New Behavior
|
||||||
|
1. Cache is checked first (regardless of `--run-whisper` flag)
|
||||||
|
2. If cached whisper exists, use it
|
||||||
|
3. If no cache and `--run-whisper` not specified, use `--transcript` path (or None)
|
||||||
|
4. If no cache and `--run-whisper` specified, run whisper
|
||||||
|
|
||||||
|
## Benefits
|
||||||
|
✓ Cached whisper transcripts are always discovered and used
|
||||||
|
✓ User can iterate on frame extraction/analysis without re-running whisper
|
||||||
|
✓ Enhanced transcripts now properly include both audio + visual content
|
||||||
|
✓ Granular cache flags (`--skip-cache-frames`, `--skip-cache-whisper`) work as expected
|
||||||
|
|
||||||
|
## Use Case
|
||||||
|
```bash
|
||||||
|
# First run: Generate whisper transcript + extract frames
|
||||||
|
python process_meeting.py samples/video.mkv --run-whisper --embed-images --scene-detection -v
|
||||||
|
|
||||||
|
# Second run: Iterate on scene threshold without re-running whisper
|
||||||
|
python process_meeting.py samples/video.mkv --embed-images --scene-detection --scene-threshold 10 --skip-cache-frames -v
|
||||||
|
# Now correctly includes cached whisper transcript in enhanced output!
|
||||||
|
```
|
||||||
|
|
||||||
|
## Files Modified
|
||||||
|
- `meetus/workflow.py` - Reordered logic in `_run_whisper()` method (lines 172-181)
|
||||||
124
def/05-reference-frames-instead-of-embedding.md
Normal file
124
def/05-reference-frames-instead-of-embedding.md
Normal file
@@ -0,0 +1,124 @@
|
|||||||
|
# 05 - Reference Frame Files Instead of Embedding
|
||||||
|
|
||||||
|
## Date
|
||||||
|
2025-10-28
|
||||||
|
|
||||||
|
## Context
|
||||||
|
Embedding base64 images made the enhanced transcript files very large (3.7MB for ~40 frames). This made them harder to work with and slower to process.
|
||||||
|
|
||||||
|
## Problem
|
||||||
|
- Enhanced transcript with embedded base64 images was 3.7MB
|
||||||
|
- Large file size makes it slow to read/process
|
||||||
|
- Difficult to inspect individual frames
|
||||||
|
- Harder to share and version control
|
||||||
|
|
||||||
|
## Solution: Reference Frame Paths
|
||||||
|
Instead of embedding base64 image data, reference the frame files by their relative paths.
|
||||||
|
|
||||||
|
### Before (Embedded):
|
||||||
|
```
|
||||||
|
[00:08] SCREEN CONTENT:
|
||||||
|
IMAGE (base64, 85KB):
|
||||||
|
<image>data:image/jpeg;base64,/9j/4AAQSkZJRg...</image>
|
||||||
|
```
|
||||||
|
File size: 3.7MB
|
||||||
|
|
||||||
|
### After (Referenced):
|
||||||
|
```
|
||||||
|
[00:08] SCREEN CONTENT:
|
||||||
|
Frame: frames/zaca-run-scrapers_00257.jpg
|
||||||
|
```
|
||||||
|
File size: ~50KB
|
||||||
|
|
||||||
|
## Implementation
|
||||||
|
|
||||||
|
**Directory Structure:**
|
||||||
|
```
|
||||||
|
output/20251028-003-zaca-run-scrapers/
|
||||||
|
├── frames/
|
||||||
|
│ ├── zaca-run-scrapers_00257.jpg
|
||||||
|
│ ├── zaca-run-scrapers_00487.jpg
|
||||||
|
│ └── ...
|
||||||
|
├── zaca-run-scrapers.json (whisper transcript)
|
||||||
|
└── zaca-run-scrapers_enhanced.txt (references frames/ directory)
|
||||||
|
```
|
||||||
|
|
||||||
|
**Enhanced Transcript Format:**
|
||||||
|
```
|
||||||
|
================================================================================
|
||||||
|
ENHANCED MEETING TRANSCRIPT
|
||||||
|
Audio transcript + Screen frames
|
||||||
|
================================================================================
|
||||||
|
|
||||||
|
[00:30] SPEAKER:
|
||||||
|
Bueno, te dio un tour para el proyecto...
|
||||||
|
|
||||||
|
[00:08] SCREEN CONTENT:
|
||||||
|
Frame: frames/zaca-run-scrapers_00257.jpg
|
||||||
|
|
||||||
|
[01:00] SPEAKER:
|
||||||
|
Mayormente en Scrapping lo que tenemos...
|
||||||
|
|
||||||
|
[01:15] SCREEN CONTENT:
|
||||||
|
Frame: frames/zaca-run-scrapers_00487.jpg
|
||||||
|
TEXT:
|
||||||
|
| Code snippet from screen (if OCR was used)
|
||||||
|
```
|
||||||
|
|
||||||
|
## Benefits
|
||||||
|
|
||||||
|
✓ **Much smaller files**: ~50KB vs 3.7MB (74x smaller!)
|
||||||
|
✓ **Easier to inspect**: Can view individual frames directly
|
||||||
|
✓ **LLM can access images**: Frame paths allow LLM to load images on demand
|
||||||
|
✓ **Better version control**: Text files are small and diffable
|
||||||
|
✓ **Cleaner structure**: Frames organized in dedicated directory
|
||||||
|
✓ **Flexible**: Can still do OCR/vision analysis if needed (adds TEXT section)
|
||||||
|
|
||||||
|
## Flags
|
||||||
|
|
||||||
|
**`--embed-images`**: Skip OCR/vision analysis, just reference frame files
|
||||||
|
- Faster (no analysis needed)
|
||||||
|
- Lets LLM analyze raw images
|
||||||
|
- Enhanced transcript only contains frame references
|
||||||
|
|
||||||
|
**Without `--embed-images`**: Run OCR/vision analysis
|
||||||
|
- Extracts text from frames
|
||||||
|
- Enhanced transcript includes both frame reference AND extracted text
|
||||||
|
- Useful for code/dashboard analysis
|
||||||
|
|
||||||
|
## Usage
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Reference frames only (no OCR, faster)
|
||||||
|
python process_meeting.py samples/video.mkv --run-whisper --embed-images --scene-detection -v
|
||||||
|
|
||||||
|
# Reference frames + OCR text extraction
|
||||||
|
python process_meeting.py samples/video.mkv --run-whisper --use-hybrid --scene-detection -v
|
||||||
|
|
||||||
|
# Adjust frame quality (smaller files)
|
||||||
|
python process_meeting.py samples/video.mkv --run-whisper --embed-images --embed-quality 60 --scene-detection -v
|
||||||
|
```
|
||||||
|
|
||||||
|
## Files Modified
|
||||||
|
|
||||||
|
- `meetus/transcript_merger.py` - Modified `_format_detailed()` to output frame paths instead of base64
|
||||||
|
- `process_meeting.py` - Updated help text and examples to reflect frame referencing
|
||||||
|
- All processors (OCR, vision, hybrid) already include `frame_path` in results (no changes needed)
|
||||||
|
|
||||||
|
## Workflow Example
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# First run: Generate everything
|
||||||
|
python process_meeting.py samples/meeting.mkv --run-whisper --embed-images --scene-detection -v
|
||||||
|
|
||||||
|
# Result:
|
||||||
|
# - output/20251028-004-meeting/
|
||||||
|
# - frames/ (40 frames, ~80KB each)
|
||||||
|
# - meeting.json (whisper transcript)
|
||||||
|
# - meeting_enhanced.txt (~50KB, references frames/)
|
||||||
|
|
||||||
|
# LLM can now:
|
||||||
|
# 1. Read enhanced transcript
|
||||||
|
# 2. See timeline of audio + screen changes
|
||||||
|
# 3. Load individual frames as needed from frames/ directory
|
||||||
|
```
|
||||||
162
meetus/cache_manager.py
Normal file
162
meetus/cache_manager.py
Normal file
@@ -0,0 +1,162 @@
|
|||||||
|
"""
|
||||||
|
Manage caching for frames, transcripts, and analysis results.
|
||||||
|
"""
|
||||||
|
from pathlib import Path
|
||||||
|
import json
|
||||||
|
import logging
|
||||||
|
from typing import List, Tuple, Dict, Optional
|
||||||
|
|
||||||
|
logger = logging.getLogger(__name__)
|
||||||
|
|
||||||
|
|
||||||
|
class CacheManager:
|
||||||
|
"""Manage caching of intermediate processing results."""
|
||||||
|
|
||||||
|
def __init__(self, output_dir: Path, frames_dir: Path, video_name: str, use_cache: bool = True,
|
||||||
|
skip_cache_frames: bool = False, skip_cache_whisper: bool = False,
|
||||||
|
skip_cache_analysis: bool = False):
|
||||||
|
"""
|
||||||
|
Initialize cache manager.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
output_dir: Output directory for cached files
|
||||||
|
frames_dir: Directory for cached frames
|
||||||
|
video_name: Name of the video (stem)
|
||||||
|
use_cache: Whether to use caching globally
|
||||||
|
skip_cache_frames: Skip cached frames specifically
|
||||||
|
skip_cache_whisper: Skip cached whisper specifically
|
||||||
|
skip_cache_analysis: Skip cached analysis specifically
|
||||||
|
"""
|
||||||
|
self.output_dir = output_dir
|
||||||
|
self.frames_dir = frames_dir
|
||||||
|
self.video_name = video_name
|
||||||
|
self.use_cache = use_cache
|
||||||
|
self.skip_cache_frames = skip_cache_frames
|
||||||
|
self.skip_cache_whisper = skip_cache_whisper
|
||||||
|
self.skip_cache_analysis = skip_cache_analysis
|
||||||
|
|
||||||
|
def get_whisper_cache(self) -> Optional[Path]:
|
||||||
|
"""
|
||||||
|
Check for cached Whisper transcript.
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
Path to cached transcript or None
|
||||||
|
"""
|
||||||
|
if not self.use_cache or self.skip_cache_whisper:
|
||||||
|
return None
|
||||||
|
|
||||||
|
cache_path = self.output_dir / f"{self.video_name}.json"
|
||||||
|
if cache_path.exists():
|
||||||
|
logger.info(f"✓ Found cached Whisper transcript: {cache_path.name}")
|
||||||
|
|
||||||
|
# Debug: Show cached transcript info
|
||||||
|
try:
|
||||||
|
import json
|
||||||
|
with open(cache_path, 'r', encoding='utf-8') as f:
|
||||||
|
data = json.load(f)
|
||||||
|
if 'segments' in data:
|
||||||
|
logger.debug(f"Cached transcript has {len(data['segments'])} segments")
|
||||||
|
except Exception as e:
|
||||||
|
logger.debug(f"Could not parse cached whisper for debug: {e}")
|
||||||
|
|
||||||
|
return cache_path
|
||||||
|
|
||||||
|
return None
|
||||||
|
|
||||||
|
def get_frames_cache(self) -> Optional[List[Tuple[str, float]]]:
|
||||||
|
"""
|
||||||
|
Check for cached frames.
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
List of (frame_path, timestamp) tuples or None
|
||||||
|
"""
|
||||||
|
if not self.use_cache or self.skip_cache_frames or not self.frames_dir.exists():
|
||||||
|
return None
|
||||||
|
|
||||||
|
existing_frames = list(self.frames_dir.glob("*.jpg"))
|
||||||
|
|
||||||
|
if not existing_frames:
|
||||||
|
return None
|
||||||
|
|
||||||
|
logger.info(f"✓ Found {len(existing_frames)} cached frames in {self.frames_dir.name}/")
|
||||||
|
logger.debug(f"Frame filenames: {[f.name for f in sorted(existing_frames)[:3]]}...")
|
||||||
|
|
||||||
|
# Build frames_info from existing files
|
||||||
|
frames_info = []
|
||||||
|
for frame_path in sorted(existing_frames):
|
||||||
|
# Try to extract timestamp from filename (e.g., frame_00001_12.34s.jpg)
|
||||||
|
try:
|
||||||
|
timestamp_str = frame_path.stem.split('_')[-1].rstrip('s')
|
||||||
|
timestamp = float(timestamp_str)
|
||||||
|
except:
|
||||||
|
timestamp = 0.0
|
||||||
|
frames_info.append((str(frame_path), timestamp))
|
||||||
|
|
||||||
|
return frames_info
|
||||||
|
|
||||||
|
def get_analysis_cache(self, analysis_type: str) -> Optional[List[Dict]]:
|
||||||
|
"""
|
||||||
|
Check for cached analysis results.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
analysis_type: 'vision' or 'ocr'
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
List of analysis results or None
|
||||||
|
"""
|
||||||
|
if not self.use_cache or self.skip_cache_analysis:
|
||||||
|
return None
|
||||||
|
|
||||||
|
cache_path = self.output_dir / f"{self.video_name}_{analysis_type}.json"
|
||||||
|
|
||||||
|
if cache_path.exists():
|
||||||
|
logger.info(f"✓ Found cached {analysis_type} analysis: {cache_path.name}")
|
||||||
|
with open(cache_path, 'r', encoding='utf-8') as f:
|
||||||
|
results = json.load(f)
|
||||||
|
logger.info(f"✓ Loaded {len(results)} analyzed frames from cache")
|
||||||
|
|
||||||
|
# Debug: Show first cached result
|
||||||
|
if results:
|
||||||
|
logger.debug(f"First cached result: timestamp={results[0].get('timestamp')}, text_length={len(results[0].get('text', ''))}")
|
||||||
|
|
||||||
|
return results
|
||||||
|
|
||||||
|
return None
|
||||||
|
|
||||||
|
def save_analysis(self, analysis_type: str, results: List[Dict]):
|
||||||
|
"""
|
||||||
|
Save analysis results to cache.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
analysis_type: 'vision' or 'ocr'
|
||||||
|
results: Analysis results to save
|
||||||
|
"""
|
||||||
|
cache_path = self.output_dir / f"{self.video_name}_{analysis_type}.json"
|
||||||
|
|
||||||
|
with open(cache_path, 'w', encoding='utf-8') as f:
|
||||||
|
json.dump(results, f, indent=2, ensure_ascii=False)
|
||||||
|
|
||||||
|
logger.info(f"✓ Saved {analysis_type} analysis to: {cache_path.name}")
|
||||||
|
|
||||||
|
def cache_exists(self, analysis_type: Optional[str] = None) -> Dict[str, bool]:
|
||||||
|
"""
|
||||||
|
Check what caches exist.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
analysis_type: Optional specific analysis type to check
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
Dictionary of cache status
|
||||||
|
"""
|
||||||
|
status = {
|
||||||
|
"whisper": (self.output_dir / f"{self.video_name}.json").exists(),
|
||||||
|
"frames": len(list(self.frames_dir.glob("frame_*.jpg"))) > 0 if self.frames_dir.exists() else False,
|
||||||
|
}
|
||||||
|
|
||||||
|
if analysis_type:
|
||||||
|
status[analysis_type] = (self.output_dir / f"{self.video_name}_{analysis_type}.json").exists()
|
||||||
|
else:
|
||||||
|
status["vision"] = (self.output_dir / f"{self.video_name}_vision.json").exists()
|
||||||
|
status["ocr"] = (self.output_dir / f"{self.video_name}_ocr.json").exists()
|
||||||
|
|
||||||
|
return status
|
||||||
@@ -6,9 +6,9 @@ import cv2
|
|||||||
import os
|
import os
|
||||||
from pathlib import Path
|
from pathlib import Path
|
||||||
from typing import List, Tuple, Optional
|
from typing import List, Tuple, Optional
|
||||||
import subprocess
|
|
||||||
import json
|
import json
|
||||||
import logging
|
import logging
|
||||||
|
import re
|
||||||
|
|
||||||
logger = logging.getLogger(__name__)
|
logger = logging.getLogger(__name__)
|
||||||
|
|
||||||
@@ -16,17 +16,19 @@ logger = logging.getLogger(__name__)
|
|||||||
class FrameExtractor:
|
class FrameExtractor:
|
||||||
"""Extract frames from video files."""
|
"""Extract frames from video files."""
|
||||||
|
|
||||||
def __init__(self, video_path: str, output_dir: str = "frames"):
|
def __init__(self, video_path: str, output_dir: str = "frames", quality: int = 75):
|
||||||
"""
|
"""
|
||||||
Initialize frame extractor.
|
Initialize frame extractor.
|
||||||
|
|
||||||
Args:
|
Args:
|
||||||
video_path: Path to video file
|
video_path: Path to video file
|
||||||
output_dir: Directory to save extracted frames
|
output_dir: Directory to save extracted frames
|
||||||
|
quality: JPEG quality for saved frames (0-100)
|
||||||
"""
|
"""
|
||||||
self.video_path = video_path
|
self.video_path = video_path
|
||||||
self.output_dir = Path(output_dir)
|
self.output_dir = Path(output_dir)
|
||||||
self.output_dir.mkdir(parents=True, exist_ok=True)
|
self.output_dir.mkdir(parents=True, exist_ok=True)
|
||||||
|
self.quality = quality
|
||||||
|
|
||||||
def extract_by_interval(self, interval_seconds: int = 5) -> List[Tuple[str, float]]:
|
def extract_by_interval(self, interval_seconds: int = 5) -> List[Tuple[str, float]]:
|
||||||
"""
|
"""
|
||||||
@@ -56,7 +58,16 @@ class FrameExtractor:
|
|||||||
frame_filename = f"frame_{saved_count:05d}_{timestamp:.2f}s.jpg"
|
frame_filename = f"frame_{saved_count:05d}_{timestamp:.2f}s.jpg"
|
||||||
frame_path = self.output_dir / frame_filename
|
frame_path = self.output_dir / frame_filename
|
||||||
|
|
||||||
cv2.imwrite(str(frame_path), frame)
|
# Downscale to 1600px width for smaller file size (but still readable)
|
||||||
|
height, width = frame.shape[:2]
|
||||||
|
if width > 1600:
|
||||||
|
ratio = 1600 / width
|
||||||
|
new_width = 1600
|
||||||
|
new_height = int(height * ratio)
|
||||||
|
frame = cv2.resize(frame, (new_width, new_height), interpolation=cv2.INTER_LANCZOS4)
|
||||||
|
|
||||||
|
# Save with configured quality (matches embed quality)
|
||||||
|
cv2.imwrite(str(frame_path), frame, [cv2.IMWRITE_JPEG_QUALITY, self.quality])
|
||||||
frames_info.append((str(frame_path), timestamp))
|
frames_info.append((str(frame_path), timestamp))
|
||||||
saved_count += 1
|
saved_count += 1
|
||||||
|
|
||||||
@@ -66,48 +77,80 @@ class FrameExtractor:
|
|||||||
logger.info(f"Extracted {saved_count} frames at {interval_seconds}s intervals")
|
logger.info(f"Extracted {saved_count} frames at {interval_seconds}s intervals")
|
||||||
return frames_info
|
return frames_info
|
||||||
|
|
||||||
def extract_scene_changes(self, threshold: float = 30.0) -> List[Tuple[str, float]]:
|
def extract_scene_changes(self, threshold: float = 15.0) -> List[Tuple[str, float]]:
|
||||||
"""
|
"""
|
||||||
Extract frames only on scene changes using FFmpeg.
|
Extract frames only on scene changes using FFmpeg.
|
||||||
More efficient than interval-based extraction.
|
More efficient than interval-based extraction.
|
||||||
|
|
||||||
Args:
|
Args:
|
||||||
threshold: Scene change detection threshold (0-100, lower = more sensitive)
|
threshold: Scene change detection threshold (0-100, lower = more sensitive)
|
||||||
|
Default: 15.0 (good for clean UIs like Zed)
|
||||||
|
Higher values (20-30) for busy UIs like VS Code
|
||||||
|
Lower values (5-10) for very subtle changes
|
||||||
|
|
||||||
Returns:
|
Returns:
|
||||||
List of (frame_path, timestamp) tuples
|
List of (frame_path, timestamp) tuples
|
||||||
"""
|
"""
|
||||||
|
try:
|
||||||
|
import ffmpeg
|
||||||
|
except ImportError:
|
||||||
|
raise ImportError("ffmpeg-python not installed. Run: pip install ffmpeg-python")
|
||||||
|
|
||||||
video_name = Path(self.video_path).stem
|
video_name = Path(self.video_path).stem
|
||||||
output_pattern = self.output_dir / f"{video_name}_%05d.jpg"
|
output_pattern = self.output_dir / f"{video_name}_%05d.jpg"
|
||||||
|
|
||||||
# Use FFmpeg's scene detection filter
|
|
||||||
cmd = [
|
|
||||||
'ffmpeg',
|
|
||||||
'-i', self.video_path,
|
|
||||||
'-vf', f'select=gt(scene\\,{threshold/100}),showinfo',
|
|
||||||
'-vsync', 'vfr',
|
|
||||||
'-frame_pts', '1',
|
|
||||||
str(output_pattern),
|
|
||||||
'-loglevel', 'info'
|
|
||||||
]
|
|
||||||
|
|
||||||
try:
|
try:
|
||||||
result = subprocess.run(cmd, capture_output=True, text=True, check=True)
|
# Use FFmpeg's scene detection filter with downscaling
|
||||||
|
stream = ffmpeg.input(self.video_path)
|
||||||
|
stream = ffmpeg.filter(stream, 'select', f'gt(scene,{threshold/100})')
|
||||||
|
stream = ffmpeg.filter(stream, 'showinfo')
|
||||||
|
# Scale to 1600px width (maintains aspect ratio, still readable)
|
||||||
|
# Use simple conditional: if width > 1600, scale to 1600, else keep original
|
||||||
|
stream = ffmpeg.filter(stream, 'scale', w='min(1600,iw)', h=-1)
|
||||||
|
|
||||||
# Parse output to get frame timestamps
|
# Convert JPEG quality (0-100) to FFmpeg qscale (2-31, lower=better)
|
||||||
|
# Rough mapping: qscale ≈ (100 - quality) / 10, clamped to 2-31
|
||||||
|
qscale = max(2, min(31, int((100 - self.quality) / 10 + 2)))
|
||||||
|
|
||||||
|
stream = ffmpeg.output(
|
||||||
|
stream,
|
||||||
|
str(output_pattern),
|
||||||
|
vsync='vfr',
|
||||||
|
frame_pts=1,
|
||||||
|
**{'q:v': str(qscale)} # Matches configured quality
|
||||||
|
)
|
||||||
|
|
||||||
|
# Run with stderr capture to get showinfo output
|
||||||
|
_, stderr = ffmpeg.run(stream, capture_stderr=True, overwrite_output=True)
|
||||||
|
stderr = stderr.decode('utf-8')
|
||||||
|
|
||||||
|
# Parse FFmpeg output to get frame timestamps from showinfo filter
|
||||||
frames_info = []
|
frames_info = []
|
||||||
for img in sorted(self.output_dir.glob(f"{video_name}_*.jpg")):
|
|
||||||
# Extract timestamp from filename or use FFprobe
|
# Extract timestamps from stderr (showinfo outputs there)
|
||||||
frames_info.append((str(img), 0.0)) # Timestamp extraction can be enhanced
|
timestamp_pattern = r'pts_time:([\d.]+)'
|
||||||
|
timestamps = re.findall(timestamp_pattern, stderr)
|
||||||
|
|
||||||
|
# Match frames to timestamps
|
||||||
|
frame_files = sorted(self.output_dir.glob(f"{video_name}_*.jpg"))
|
||||||
|
|
||||||
|
for idx, img in enumerate(frame_files):
|
||||||
|
# Use extracted timestamp or fallback to index-based estimate
|
||||||
|
timestamp = float(timestamps[idx]) if idx < len(timestamps) else idx * 5.0
|
||||||
|
frames_info.append((str(img), timestamp))
|
||||||
|
|
||||||
logger.info(f"Extracted {len(frames_info)} frames at scene changes")
|
logger.info(f"Extracted {len(frames_info)} frames at scene changes")
|
||||||
return frames_info
|
return frames_info
|
||||||
|
|
||||||
except subprocess.CalledProcessError as e:
|
except ffmpeg.Error as e:
|
||||||
logger.error(f"FFmpeg error: {e.stderr}")
|
logger.error(f"FFmpeg error: {e.stderr.decode() if e.stderr else str(e)}")
|
||||||
# Fallback to interval extraction
|
# Fallback to interval extraction
|
||||||
logger.warning("Falling back to interval extraction...")
|
logger.warning("Falling back to interval extraction...")
|
||||||
return self.extract_by_interval()
|
return self.extract_by_interval()
|
||||||
|
except Exception as e:
|
||||||
|
logger.error(f"Unexpected error during scene extraction: {e}")
|
||||||
|
logger.warning("Falling back to interval extraction...")
|
||||||
|
return self.extract_by_interval()
|
||||||
|
|
||||||
def get_video_duration(self) -> float:
|
def get_video_duration(self) -> float:
|
||||||
"""Get video duration in seconds."""
|
"""Get video duration in seconds."""
|
||||||
|
|||||||
355
meetus/hybrid_processor.py
Normal file
355
meetus/hybrid_processor.py
Normal file
@@ -0,0 +1,355 @@
|
|||||||
|
"""
|
||||||
|
Hybrid frame analysis: OpenCV text detection + OCR for accurate extraction.
|
||||||
|
Better than pure vision models which tend to hallucinate text content.
|
||||||
|
"""
|
||||||
|
from typing import List, Tuple, Dict, Optional
|
||||||
|
from pathlib import Path
|
||||||
|
import logging
|
||||||
|
import cv2
|
||||||
|
import numpy as np
|
||||||
|
from difflib import SequenceMatcher
|
||||||
|
|
||||||
|
logger = logging.getLogger(__name__)
|
||||||
|
|
||||||
|
|
||||||
|
class HybridProcessor:
|
||||||
|
"""Combine OpenCV text detection with OCR for accurate text extraction."""
|
||||||
|
|
||||||
|
def __init__(self, ocr_engine: str = "tesseract", min_confidence: float = 0.5,
|
||||||
|
use_llm_cleanup: bool = False, llm_model: Optional[str] = None):
|
||||||
|
"""
|
||||||
|
Initialize hybrid processor.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
ocr_engine: OCR engine to use ('tesseract', 'easyocr', 'paddleocr')
|
||||||
|
min_confidence: Minimum confidence for text detection (0-1)
|
||||||
|
use_llm_cleanup: Use LLM to clean up OCR output and preserve formatting
|
||||||
|
llm_model: Ollama model for cleanup (default: llama3.2:3b for speed)
|
||||||
|
"""
|
||||||
|
from .ocr_processor import OCRProcessor
|
||||||
|
|
||||||
|
self.ocr = OCRProcessor(engine=ocr_engine)
|
||||||
|
self.min_confidence = min_confidence
|
||||||
|
self.use_llm_cleanup = use_llm_cleanup
|
||||||
|
self.llm_model = llm_model or "llama3.2:3b"
|
||||||
|
self._llm_client = None
|
||||||
|
|
||||||
|
if use_llm_cleanup:
|
||||||
|
self._init_llm()
|
||||||
|
|
||||||
|
def _init_llm(self):
|
||||||
|
"""Initialize Ollama client for LLM cleanup."""
|
||||||
|
try:
|
||||||
|
import ollama
|
||||||
|
self._llm_client = ollama
|
||||||
|
logger.info(f"LLM cleanup enabled using {self.llm_model}")
|
||||||
|
except ImportError:
|
||||||
|
logger.warning("ollama package not installed. LLM cleanup disabled.")
|
||||||
|
self.use_llm_cleanup = False
|
||||||
|
|
||||||
|
def _cleanup_with_llm(self, raw_text: str) -> str:
|
||||||
|
"""
|
||||||
|
Use LLM to clean up OCR output and preserve code formatting.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
raw_text: Raw OCR output
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
Cleaned up text with proper formatting
|
||||||
|
"""
|
||||||
|
if not self.use_llm_cleanup or not self._llm_client:
|
||||||
|
return raw_text
|
||||||
|
|
||||||
|
prompt = """You are cleaning up OCR output from a code editor screenshot.
|
||||||
|
|
||||||
|
Your task:
|
||||||
|
1. Fix any obvious OCR errors (l→1, O→0, etc.)
|
||||||
|
2. Preserve or restore code indentation and structure
|
||||||
|
3. Keep the exact text content - don't add explanations or comments
|
||||||
|
4. If it's code, maintain proper spacing and formatting
|
||||||
|
5. Return ONLY the cleaned text, nothing else
|
||||||
|
|
||||||
|
OCR Text:
|
||||||
|
"""
|
||||||
|
|
||||||
|
try:
|
||||||
|
response = self._llm_client.generate(
|
||||||
|
model=self.llm_model,
|
||||||
|
prompt=prompt + raw_text,
|
||||||
|
options={"temperature": 0.1} # Low temperature for accuracy
|
||||||
|
)
|
||||||
|
cleaned = response['response'].strip()
|
||||||
|
logger.debug(f"LLM cleanup: {len(raw_text)} → {len(cleaned)} chars")
|
||||||
|
return cleaned
|
||||||
|
except Exception as e:
|
||||||
|
logger.warning(f"LLM cleanup failed: {e}, using raw OCR output")
|
||||||
|
return raw_text
|
||||||
|
|
||||||
|
def detect_text_regions(self, image_path: str, min_area: int = 100) -> List[Tuple[int, int, int, int]]:
|
||||||
|
"""
|
||||||
|
Detect text regions in image using OpenCV.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
image_path: Path to image file
|
||||||
|
min_area: Minimum area for text region (pixels)
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
List of bounding boxes (x, y, w, h)
|
||||||
|
"""
|
||||||
|
# Read image
|
||||||
|
img = cv2.imread(image_path)
|
||||||
|
if img is None:
|
||||||
|
logger.warning(f"Could not read image: {image_path}")
|
||||||
|
return []
|
||||||
|
|
||||||
|
gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
|
||||||
|
|
||||||
|
# Method 1: Morphological operations to find text regions
|
||||||
|
# Works well for solid text blocks
|
||||||
|
regions = self._detect_by_morphology(gray, min_area)
|
||||||
|
|
||||||
|
if not regions:
|
||||||
|
logger.debug(f"No text regions detected in {Path(image_path).name}")
|
||||||
|
|
||||||
|
return regions
|
||||||
|
|
||||||
|
def _detect_by_morphology(self, gray: np.ndarray, min_area: int) -> List[Tuple[int, int, int, int]]:
|
||||||
|
"""
|
||||||
|
Detect text regions using morphological operations.
|
||||||
|
Fast and works well for solid text blocks (code editors, terminals).
|
||||||
|
|
||||||
|
Args:
|
||||||
|
gray: Grayscale image
|
||||||
|
min_area: Minimum area for region
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
List of bounding boxes (x, y, w, h)
|
||||||
|
"""
|
||||||
|
# Apply adaptive threshold to handle varying lighting
|
||||||
|
binary = cv2.adaptiveThreshold(
|
||||||
|
gray, 255, cv2.ADAPTIVE_THRESH_GAUSSIAN_C,
|
||||||
|
cv2.THRESH_BINARY_INV, 11, 2
|
||||||
|
)
|
||||||
|
|
||||||
|
# Morphological operations to connect text regions
|
||||||
|
kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (15, 3)) # Horizontal kernel for text lines
|
||||||
|
dilated = cv2.dilate(binary, kernel, iterations=2)
|
||||||
|
|
||||||
|
# Find contours
|
||||||
|
contours, _ = cv2.findContours(dilated, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
|
||||||
|
|
||||||
|
# Filter and extract bounding boxes
|
||||||
|
regions = []
|
||||||
|
for contour in contours:
|
||||||
|
x, y, w, h = cv2.boundingRect(contour)
|
||||||
|
area = w * h
|
||||||
|
|
||||||
|
# Filter by area and aspect ratio
|
||||||
|
if area > min_area and w > 20 and h > 10: # Reasonable text dimensions
|
||||||
|
regions.append((x, y, w, h))
|
||||||
|
|
||||||
|
# Merge overlapping regions
|
||||||
|
regions = self._merge_overlapping_regions(regions)
|
||||||
|
|
||||||
|
logger.debug(f"Detected {len(regions)} text regions using morphology")
|
||||||
|
return regions
|
||||||
|
|
||||||
|
def _merge_overlapping_regions(
|
||||||
|
self, regions: List[Tuple[int, int, int, int]],
|
||||||
|
overlap_threshold: float = 0.3
|
||||||
|
) -> List[Tuple[int, int, int, int]]:
|
||||||
|
"""
|
||||||
|
Merge overlapping bounding boxes.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
regions: List of (x, y, w, h) tuples
|
||||||
|
overlap_threshold: Minimum overlap ratio to merge
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
Merged regions
|
||||||
|
"""
|
||||||
|
if not regions:
|
||||||
|
return []
|
||||||
|
|
||||||
|
# Sort by y-coordinate (top to bottom)
|
||||||
|
regions = sorted(regions, key=lambda r: r[1])
|
||||||
|
|
||||||
|
merged = []
|
||||||
|
current = list(regions[0])
|
||||||
|
|
||||||
|
for region in regions[1:]:
|
||||||
|
x, y, w, h = region
|
||||||
|
cx, cy, cw, ch = current
|
||||||
|
|
||||||
|
# Check for overlap
|
||||||
|
x_overlap = max(0, min(cx + cw, x + w) - max(cx, x))
|
||||||
|
y_overlap = max(0, min(cy + ch, y + h) - max(cy, y))
|
||||||
|
overlap_area = x_overlap * y_overlap
|
||||||
|
|
||||||
|
current_area = cw * ch
|
||||||
|
region_area = w * h
|
||||||
|
min_area = min(current_area, region_area)
|
||||||
|
|
||||||
|
if overlap_area / min_area > overlap_threshold:
|
||||||
|
# Merge regions
|
||||||
|
new_x = min(cx, x)
|
||||||
|
new_y = min(cy, y)
|
||||||
|
new_x2 = max(cx + cw, x + w)
|
||||||
|
new_y2 = max(cy + ch, y + h)
|
||||||
|
current = [new_x, new_y, new_x2 - new_x, new_y2 - new_y]
|
||||||
|
else:
|
||||||
|
merged.append(tuple(current))
|
||||||
|
current = list(region)
|
||||||
|
|
||||||
|
merged.append(tuple(current))
|
||||||
|
return merged
|
||||||
|
|
||||||
|
def extract_text_from_region(self, image_path: str, region: Tuple[int, int, int, int]) -> str:
|
||||||
|
"""
|
||||||
|
Extract text from a specific region using OCR.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
image_path: Path to image file
|
||||||
|
region: Bounding box (x, y, w, h)
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
Extracted text
|
||||||
|
"""
|
||||||
|
from PIL import Image
|
||||||
|
|
||||||
|
# Load image and crop region
|
||||||
|
img = Image.open(image_path)
|
||||||
|
x, y, w, h = region
|
||||||
|
cropped = img.crop((x, y, x + w, y + h))
|
||||||
|
|
||||||
|
# Save to temp file for OCR (or use in-memory)
|
||||||
|
import tempfile
|
||||||
|
with tempfile.NamedTemporaryFile(suffix='.png', delete=False) as tmp:
|
||||||
|
cropped.save(tmp.name)
|
||||||
|
text = self.ocr.extract_text(tmp.name)
|
||||||
|
|
||||||
|
# Clean up temp file
|
||||||
|
Path(tmp.name).unlink()
|
||||||
|
|
||||||
|
return text
|
||||||
|
|
||||||
|
def analyze_frame(self, image_path: str) -> str:
|
||||||
|
"""
|
||||||
|
Analyze a frame: detect text regions and OCR them.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
image_path: Path to image file
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
Combined text from all detected regions
|
||||||
|
"""
|
||||||
|
# Detect text regions
|
||||||
|
regions = self.detect_text_regions(image_path)
|
||||||
|
|
||||||
|
if not regions:
|
||||||
|
# Fallback to full-frame OCR if no regions detected
|
||||||
|
logger.debug(f"No regions detected, using full-frame OCR for {Path(image_path).name}")
|
||||||
|
raw_text = self.ocr.extract_text(image_path)
|
||||||
|
return self._cleanup_with_llm(raw_text) if self.use_llm_cleanup else raw_text
|
||||||
|
|
||||||
|
# Sort regions by reading order (top-to-bottom, left-to-right)
|
||||||
|
regions = self._sort_regions_by_reading_order(regions)
|
||||||
|
|
||||||
|
# Extract text from each region
|
||||||
|
texts = []
|
||||||
|
for idx, region in enumerate(regions):
|
||||||
|
x, y, w, h = region
|
||||||
|
text = self.extract_text_from_region(image_path, region)
|
||||||
|
if text.strip():
|
||||||
|
# Add visual separator with region info
|
||||||
|
section_header = f"[Region {idx+1} at y={y}]"
|
||||||
|
texts.append(f"{section_header}\n{text.strip()}")
|
||||||
|
logger.debug(f"Region {idx+1}/{len(regions)} (y={y}): Extracted {len(text)} chars")
|
||||||
|
|
||||||
|
combined = ("\n\n" + "="*60 + "\n\n").join(texts)
|
||||||
|
logger.debug(f"Total extracted from {len(regions)} regions: {len(combined)} chars")
|
||||||
|
|
||||||
|
# Apply LLM cleanup if enabled
|
||||||
|
if self.use_llm_cleanup:
|
||||||
|
combined = self._cleanup_with_llm(combined)
|
||||||
|
|
||||||
|
return combined
|
||||||
|
|
||||||
|
def _sort_regions_by_reading_order(self, regions: List[Tuple[int, int, int, int]]) -> List[Tuple[int, int, int, int]]:
|
||||||
|
"""
|
||||||
|
Sort regions in reading order (top-to-bottom, left-to-right).
|
||||||
|
|
||||||
|
Args:
|
||||||
|
regions: List of (x, y, w, h) tuples
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
Sorted regions
|
||||||
|
"""
|
||||||
|
# Sort primarily by y (top to bottom), secondarily by x (left to right)
|
||||||
|
# Group regions that are on roughly the same line (within 20px)
|
||||||
|
sorted_regions = sorted(regions, key=lambda r: (r[1] // 20, r[0]))
|
||||||
|
return sorted_regions
|
||||||
|
|
||||||
|
def process_frames(
|
||||||
|
self,
|
||||||
|
frames_info: List[Tuple[str, float]],
|
||||||
|
deduplicate: bool = True,
|
||||||
|
similarity_threshold: float = 0.85
|
||||||
|
) -> List[Dict]:
|
||||||
|
"""
|
||||||
|
Process multiple frames with hybrid analysis.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
frames_info: List of (frame_path, timestamp) tuples
|
||||||
|
deduplicate: Whether to remove similar consecutive analyses
|
||||||
|
similarity_threshold: Threshold for considering analyses as duplicates (0-1)
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
List of dicts with 'timestamp', 'text', and 'frame_path'
|
||||||
|
"""
|
||||||
|
results = []
|
||||||
|
prev_text = ""
|
||||||
|
|
||||||
|
total = len(frames_info)
|
||||||
|
logger.info(f"Starting hybrid analysis of {total} frames...")
|
||||||
|
|
||||||
|
for idx, (frame_path, timestamp) in enumerate(frames_info, 1):
|
||||||
|
logger.info(f"Analyzing frame {idx}/{total} at {timestamp:.2f}s...")
|
||||||
|
|
||||||
|
text = self.analyze_frame(frame_path)
|
||||||
|
|
||||||
|
if not text:
|
||||||
|
logger.warning(f"No content extracted from frame at {timestamp:.2f}s")
|
||||||
|
continue
|
||||||
|
|
||||||
|
# Debug: Show what was extracted
|
||||||
|
logger.debug(f"Frame {idx} ({timestamp:.2f}s): Extracted {len(text)} chars")
|
||||||
|
logger.debug(f"Content preview: {text[:150]}{'...' if len(text) > 150 else ''}")
|
||||||
|
|
||||||
|
# Deduplicate similar consecutive frames
|
||||||
|
if deduplicate and prev_text:
|
||||||
|
similarity = self._text_similarity(prev_text, text)
|
||||||
|
logger.debug(f"Similarity to previous frame: {similarity:.2f} (threshold: {similarity_threshold})")
|
||||||
|
if similarity > similarity_threshold:
|
||||||
|
logger.debug(f"⊘ Skipping duplicate frame at {timestamp:.2f}s (similarity: {similarity:.2f})")
|
||||||
|
continue
|
||||||
|
|
||||||
|
results.append({
|
||||||
|
'timestamp': timestamp,
|
||||||
|
'text': text,
|
||||||
|
'frame_path': frame_path
|
||||||
|
})
|
||||||
|
|
||||||
|
prev_text = text
|
||||||
|
|
||||||
|
logger.info(f"Extracted content from {len(results)} frames (deduplication: {deduplicate})")
|
||||||
|
return results
|
||||||
|
|
||||||
|
def _text_similarity(self, text1: str, text2: str) -> float:
|
||||||
|
"""
|
||||||
|
Calculate similarity between two texts.
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
Similarity score between 0 and 1
|
||||||
|
"""
|
||||||
|
return SequenceMatcher(None, text1, text2).ratio()
|
||||||
@@ -53,20 +53,25 @@ class OCRProcessor:
|
|||||||
else:
|
else:
|
||||||
raise ValueError(f"Unknown OCR engine: {self.engine}")
|
raise ValueError(f"Unknown OCR engine: {self.engine}")
|
||||||
|
|
||||||
def extract_text(self, image_path: str) -> str:
|
def extract_text(self, image_path: str, preserve_layout: bool = True) -> str:
|
||||||
"""
|
"""
|
||||||
Extract text from a single image.
|
Extract text from a single image.
|
||||||
|
|
||||||
Args:
|
Args:
|
||||||
image_path: Path to image file
|
image_path: Path to image file
|
||||||
|
preserve_layout: Try to preserve whitespace and layout
|
||||||
|
|
||||||
Returns:
|
Returns:
|
||||||
Extracted text
|
Extracted text
|
||||||
"""
|
"""
|
||||||
if self.engine == "tesseract":
|
if self.engine == "tesseract":
|
||||||
from PIL import Image
|
from PIL import Image
|
||||||
|
import pytesseract
|
||||||
image = Image.open(image_path)
|
image = Image.open(image_path)
|
||||||
text = self._ocr_engine.image_to_string(image)
|
|
||||||
|
# Use PSM 6 (uniform block of text) to preserve layout better
|
||||||
|
config = '--psm 6' if preserve_layout else ''
|
||||||
|
text = pytesseract.image_to_string(image, config=config)
|
||||||
|
|
||||||
elif self.engine == "easyocr":
|
elif self.engine == "easyocr":
|
||||||
result = self._ocr_engine.readtext(image_path, detail=0)
|
result = self._ocr_engine.readtext(image_path, detail=0)
|
||||||
@@ -81,12 +86,31 @@ class OCRProcessor:
|
|||||||
|
|
||||||
return self._clean_text(text)
|
return self._clean_text(text)
|
||||||
|
|
||||||
def _clean_text(self, text: str) -> str:
|
def _clean_text(self, text: str, preserve_indentation: bool = True) -> str:
|
||||||
"""Clean up OCR output."""
|
"""
|
||||||
# Remove excessive whitespace
|
Clean up OCR output.
|
||||||
text = re.sub(r'\n\s*\n', '\n', text)
|
|
||||||
text = re.sub(r' +', ' ', text)
|
Args:
|
||||||
return text.strip()
|
text: Raw OCR text
|
||||||
|
preserve_indentation: Keep leading whitespace on lines
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
Cleaned text
|
||||||
|
"""
|
||||||
|
if preserve_indentation:
|
||||||
|
# Remove excessive blank lines but preserve indentation
|
||||||
|
lines = text.split('\n')
|
||||||
|
cleaned_lines = []
|
||||||
|
for line in lines:
|
||||||
|
# Keep line if it has content or is single empty line
|
||||||
|
if line.strip() or (cleaned_lines and cleaned_lines[-1].strip()):
|
||||||
|
cleaned_lines.append(line)
|
||||||
|
return '\n'.join(cleaned_lines).strip()
|
||||||
|
else:
|
||||||
|
# Original aggressive cleaning
|
||||||
|
text = re.sub(r'\n\s*\n', '\n', text)
|
||||||
|
text = re.sub(r' +', ' ', text)
|
||||||
|
return text.strip()
|
||||||
|
|
||||||
def process_frames(
|
def process_frames(
|
||||||
self,
|
self,
|
||||||
@@ -108,18 +132,24 @@ class OCRProcessor:
|
|||||||
results = []
|
results = []
|
||||||
prev_text = ""
|
prev_text = ""
|
||||||
|
|
||||||
for frame_path, timestamp in frames_info:
|
for idx, (frame_path, timestamp) in enumerate(frames_info, 1):
|
||||||
logger.debug(f"Processing frame at {timestamp:.2f}s...")
|
logger.debug(f"Processing frame {idx}/{len(frames_info)} at {timestamp:.2f}s...")
|
||||||
text = self.extract_text(frame_path)
|
text = self.extract_text(frame_path)
|
||||||
|
|
||||||
if not text:
|
if not text:
|
||||||
|
logger.debug(f"No text extracted from frame at {timestamp:.2f}s")
|
||||||
continue
|
continue
|
||||||
|
|
||||||
|
# Debug: Show what was extracted
|
||||||
|
logger.debug(f"Frame {idx} ({timestamp:.2f}s): Extracted {len(text)} chars")
|
||||||
|
logger.debug(f"Content preview: {text[:150]}{'...' if len(text) > 150 else ''}")
|
||||||
|
|
||||||
# Deduplicate similar consecutive frames
|
# Deduplicate similar consecutive frames
|
||||||
if deduplicate:
|
if deduplicate and prev_text:
|
||||||
similarity = self._text_similarity(prev_text, text)
|
similarity = self._text_similarity(prev_text, text)
|
||||||
|
logger.debug(f"Similarity to previous frame: {similarity:.2f} (threshold: {similarity_threshold})")
|
||||||
if similarity > similarity_threshold:
|
if similarity > similarity_threshold:
|
||||||
logger.debug(f"Skipping duplicate frame at {timestamp:.2f}s (similarity: {similarity:.2f})")
|
logger.debug(f"⊘ Skipping duplicate frame at {timestamp:.2f}s (similarity: {similarity:.2f})")
|
||||||
continue
|
continue
|
||||||
|
|
||||||
results.append({
|
results.append({
|
||||||
|
|||||||
155
meetus/output_manager.py
Normal file
155
meetus/output_manager.py
Normal file
@@ -0,0 +1,155 @@
|
|||||||
|
"""
|
||||||
|
Manage output directories and manifest files.
|
||||||
|
Creates timestamped folders for each video and tracks processing options.
|
||||||
|
"""
|
||||||
|
from pathlib import Path
|
||||||
|
from datetime import datetime
|
||||||
|
import json
|
||||||
|
import logging
|
||||||
|
from typing import Dict, Any, Optional
|
||||||
|
|
||||||
|
logger = logging.getLogger(__name__)
|
||||||
|
|
||||||
|
|
||||||
|
class OutputManager:
|
||||||
|
"""Manage output directories and manifest files for video processing."""
|
||||||
|
|
||||||
|
def __init__(self, video_path: Path, base_output_dir: str = "output", use_cache: bool = True):
|
||||||
|
"""
|
||||||
|
Initialize output manager.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
video_path: Path to the video file being processed
|
||||||
|
base_output_dir: Base directory for all outputs
|
||||||
|
use_cache: Whether to use existing directories if found
|
||||||
|
"""
|
||||||
|
self.video_path = video_path
|
||||||
|
self.base_output_dir = Path(base_output_dir)
|
||||||
|
self.use_cache = use_cache
|
||||||
|
|
||||||
|
# Find or create output directory
|
||||||
|
self.output_dir = self._get_or_create_output_dir()
|
||||||
|
self.frames_dir = self.output_dir / "frames"
|
||||||
|
self.frames_dir.mkdir(exist_ok=True)
|
||||||
|
|
||||||
|
logger.info(f"Output directory: {self.output_dir}")
|
||||||
|
|
||||||
|
def _get_or_create_output_dir(self) -> Path:
|
||||||
|
"""
|
||||||
|
Get existing output directory or create a new one with incremental number.
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
Path to output directory
|
||||||
|
"""
|
||||||
|
video_name = self.video_path.stem
|
||||||
|
|
||||||
|
# Look for existing directories if caching is enabled
|
||||||
|
if self.use_cache and self.base_output_dir.exists():
|
||||||
|
existing_dirs = sorted([
|
||||||
|
d for d in self.base_output_dir.iterdir()
|
||||||
|
if d.is_dir() and d.name.endswith(f"-{video_name}")
|
||||||
|
], reverse=True) # Most recent first
|
||||||
|
|
||||||
|
if existing_dirs:
|
||||||
|
logger.info(f"Found existing output: {existing_dirs[0].name}")
|
||||||
|
return existing_dirs[0]
|
||||||
|
|
||||||
|
# Create new directory with date + incremental number
|
||||||
|
date_str = datetime.now().strftime("%Y%m%d")
|
||||||
|
|
||||||
|
# Find existing runs for today
|
||||||
|
if self.base_output_dir.exists():
|
||||||
|
existing_today = [
|
||||||
|
d for d in self.base_output_dir.iterdir()
|
||||||
|
if d.is_dir() and d.name.startswith(date_str) and d.name.endswith(f"-{video_name}")
|
||||||
|
]
|
||||||
|
|
||||||
|
# Extract run numbers and find max
|
||||||
|
run_numbers = []
|
||||||
|
for d in existing_today:
|
||||||
|
# Format: YYYYMMDD-NNN-videoname
|
||||||
|
parts = d.name.split('-')
|
||||||
|
if len(parts) >= 2 and parts[1].isdigit():
|
||||||
|
run_numbers.append(int(parts[1]))
|
||||||
|
|
||||||
|
next_run = max(run_numbers) + 1 if run_numbers else 1
|
||||||
|
else:
|
||||||
|
next_run = 1
|
||||||
|
|
||||||
|
dir_name = f"{date_str}-{next_run:03d}-{video_name}"
|
||||||
|
output_dir = self.base_output_dir / dir_name
|
||||||
|
output_dir.mkdir(parents=True, exist_ok=True)
|
||||||
|
logger.info(f"Created new output directory: {dir_name}")
|
||||||
|
|
||||||
|
return output_dir
|
||||||
|
|
||||||
|
def get_path(self, filename: str) -> Path:
|
||||||
|
"""Get full path for a file in the output directory."""
|
||||||
|
return self.output_dir / filename
|
||||||
|
|
||||||
|
def get_frames_path(self, filename: str) -> Path:
|
||||||
|
"""Get full path for a file in the frames directory."""
|
||||||
|
return self.frames_dir / filename
|
||||||
|
|
||||||
|
def save_manifest(self, config: Dict[str, Any]):
|
||||||
|
"""
|
||||||
|
Save processing configuration to manifest.json.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
config: Dictionary of processing options
|
||||||
|
"""
|
||||||
|
manifest_path = self.output_dir / "manifest.json"
|
||||||
|
|
||||||
|
manifest = {
|
||||||
|
"video": {
|
||||||
|
"name": self.video_path.name,
|
||||||
|
"path": str(self.video_path.absolute()),
|
||||||
|
},
|
||||||
|
"processed_at": datetime.now().isoformat(),
|
||||||
|
"configuration": config,
|
||||||
|
"outputs": {
|
||||||
|
"frames": str(self.frames_dir.relative_to(self.output_dir)),
|
||||||
|
"enhanced_transcript": f"{self.video_path.stem}_enhanced.txt",
|
||||||
|
"whisper_transcript": f"{self.video_path.stem}.json" if config.get("run_whisper") else None,
|
||||||
|
"analysis": f"{self.video_path.stem}_{'vision' if config.get('use_vision') else 'ocr'}.json"
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
with open(manifest_path, 'w', encoding='utf-8') as f:
|
||||||
|
json.dump(manifest, f, indent=2, ensure_ascii=False)
|
||||||
|
|
||||||
|
logger.info(f"Saved manifest: {manifest_path}")
|
||||||
|
|
||||||
|
def load_manifest(self) -> Optional[Dict[str, Any]]:
|
||||||
|
"""
|
||||||
|
Load existing manifest if it exists.
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
Manifest dictionary or None
|
||||||
|
"""
|
||||||
|
manifest_path = self.output_dir / "manifest.json"
|
||||||
|
|
||||||
|
if manifest_path.exists():
|
||||||
|
with open(manifest_path, 'r', encoding='utf-8') as f:
|
||||||
|
return json.load(f)
|
||||||
|
|
||||||
|
return None
|
||||||
|
|
||||||
|
def list_outputs(self) -> Dict[str, Any]:
|
||||||
|
"""
|
||||||
|
List all output files in the directory.
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
Dictionary of output files and their status
|
||||||
|
"""
|
||||||
|
video_name = self.video_path.stem
|
||||||
|
|
||||||
|
return {
|
||||||
|
"output_dir": str(self.output_dir),
|
||||||
|
"manifest": (self.output_dir / "manifest.json").exists(),
|
||||||
|
"enhanced_transcript": (self.output_dir / f"{video_name}_enhanced.txt").exists(),
|
||||||
|
"whisper_transcript": (self.output_dir / f"{video_name}.json").exists(),
|
||||||
|
"vision_analysis": (self.output_dir / f"{video_name}_vision.json").exists(),
|
||||||
|
"ocr_analysis": (self.output_dir / f"{video_name}_ocr.json").exists(),
|
||||||
|
"frames": len(list(self.frames_dir.glob("*.jpg"))) if self.frames_dir.exists() else 0
|
||||||
|
}
|
||||||
5
meetus/prompts/code.txt
Normal file
5
meetus/prompts/code.txt
Normal file
@@ -0,0 +1,5 @@
|
|||||||
|
You are analyzing a code screenshot from a meeting recording.
|
||||||
|
|
||||||
|
Provide a brief description of what's being shown (1-2 sentences about the context), then extract the visible code exactly as it appears, preserving all formatting, indentation, and structure.
|
||||||
|
|
||||||
|
If there's no code visible, just describe what you see on screen.
|
||||||
5
meetus/prompts/console.txt
Normal file
5
meetus/prompts/console.txt
Normal file
@@ -0,0 +1,5 @@
|
|||||||
|
You are analyzing console/terminal output from a meeting recording.
|
||||||
|
|
||||||
|
Provide a brief description of what's happening (1-2 sentences), then extract the visible commands and output exactly as shown, preserving formatting.
|
||||||
|
|
||||||
|
Include any error messages, warnings, or important status information.
|
||||||
5
meetus/prompts/dashboard.txt
Normal file
5
meetus/prompts/dashboard.txt
Normal file
@@ -0,0 +1,5 @@
|
|||||||
|
You are analyzing a dashboard/monitoring panel from a meeting recording.
|
||||||
|
|
||||||
|
Provide a brief description of what's being monitored (1-2 sentences), then list the key panels, metrics, and their current values. Include any alerts, warnings, or notable trends.
|
||||||
|
|
||||||
|
Keep it concise and focused on the important information.
|
||||||
10
meetus/prompts/meeting.txt
Normal file
10
meetus/prompts/meeting.txt
Normal file
@@ -0,0 +1,10 @@
|
|||||||
|
You are analyzing a screen capture from a meeting recording.
|
||||||
|
|
||||||
|
Provide a brief description of what's being shown (1-2 sentences about the context). Then extract the key information:
|
||||||
|
- Any visible text, titles, or headings
|
||||||
|
- Code (preserve exact formatting if present)
|
||||||
|
- Metrics, data points, or dashboard information
|
||||||
|
- Terminal/console commands and output
|
||||||
|
- Application or UI elements
|
||||||
|
|
||||||
|
Be concise but capture all important details that help understand what was being discussed.
|
||||||
@@ -6,6 +6,8 @@ from typing import List, Dict, Optional
|
|||||||
import json
|
import json
|
||||||
from pathlib import Path
|
from pathlib import Path
|
||||||
import logging
|
import logging
|
||||||
|
import base64
|
||||||
|
from io import BytesIO
|
||||||
|
|
||||||
logger = logging.getLogger(__name__)
|
logger = logging.getLogger(__name__)
|
||||||
|
|
||||||
@@ -13,11 +15,18 @@ logger = logging.getLogger(__name__)
|
|||||||
class TranscriptMerger:
|
class TranscriptMerger:
|
||||||
"""Merge audio transcripts with screen OCR text."""
|
"""Merge audio transcripts with screen OCR text."""
|
||||||
|
|
||||||
def __init__(self):
|
def __init__(self, embed_images: bool = False, embed_quality: int = 80):
|
||||||
"""Initialize transcript merger."""
|
"""
|
||||||
pass
|
Initialize transcript merger.
|
||||||
|
|
||||||
def load_whisper_transcript(self, transcript_path: str) -> List[Dict]:
|
Args:
|
||||||
|
embed_images: Whether to embed frame images as base64
|
||||||
|
embed_quality: JPEG quality for embedded images (0-100)
|
||||||
|
"""
|
||||||
|
self.embed_images = embed_images
|
||||||
|
self.embed_quality = embed_quality
|
||||||
|
|
||||||
|
def load_whisper_transcript(self, transcript_path: str, group_interval: Optional[int] = None) -> List[Dict]:
|
||||||
"""
|
"""
|
||||||
Load Whisper transcript from file.
|
Load Whisper transcript from file.
|
||||||
|
|
||||||
@@ -25,6 +34,7 @@ class TranscriptMerger:
|
|||||||
|
|
||||||
Args:
|
Args:
|
||||||
transcript_path: Path to transcript file
|
transcript_path: Path to transcript file
|
||||||
|
group_interval: If specified, group audio segments into intervals (in seconds)
|
||||||
|
|
||||||
Returns:
|
Returns:
|
||||||
List of dicts with 'timestamp' (optional) and 'text'
|
List of dicts with 'timestamp' (optional) and 'text'
|
||||||
@@ -35,28 +45,39 @@ class TranscriptMerger:
|
|||||||
with open(path, 'r', encoding='utf-8') as f:
|
with open(path, 'r', encoding='utf-8') as f:
|
||||||
data = json.load(f)
|
data = json.load(f)
|
||||||
|
|
||||||
# Handle different Whisper output formats
|
# Handle different Whisper/WhisperX output formats
|
||||||
|
segments = []
|
||||||
if isinstance(data, dict) and 'segments' in data:
|
if isinstance(data, dict) and 'segments' in data:
|
||||||
# Standard Whisper JSON format
|
# Standard Whisper/WhisperX JSON format
|
||||||
return [
|
segments = [
|
||||||
{
|
{
|
||||||
'timestamp': seg.get('start', 0),
|
'timestamp': seg.get('start', 0),
|
||||||
'text': seg['text'].strip(),
|
'text': seg['text'].strip(),
|
||||||
|
'speaker': seg.get('speaker'), # WhisperX diarization
|
||||||
'type': 'audio'
|
'type': 'audio'
|
||||||
}
|
}
|
||||||
for seg in data['segments']
|
for seg in data['segments']
|
||||||
]
|
]
|
||||||
elif isinstance(data, list):
|
elif isinstance(data, list):
|
||||||
# List of segments
|
# List of segments
|
||||||
return [
|
segments = [
|
||||||
{
|
{
|
||||||
'timestamp': seg.get('start', seg.get('timestamp', 0)),
|
'timestamp': seg.get('start', seg.get('timestamp', 0)),
|
||||||
'text': seg['text'].strip(),
|
'text': seg['text'].strip(),
|
||||||
|
'speaker': seg.get('speaker'), # WhisperX diarization
|
||||||
'type': 'audio'
|
'type': 'audio'
|
||||||
}
|
}
|
||||||
for seg in data
|
for seg in data
|
||||||
]
|
]
|
||||||
|
|
||||||
|
# Group by interval if requested, but skip if we have speaker diarization
|
||||||
|
# (merge_transcripts will group by speaker instead)
|
||||||
|
has_speakers = any(seg.get('speaker') for seg in segments)
|
||||||
|
if group_interval and segments and not has_speakers:
|
||||||
|
segments = self.group_audio_by_intervals(segments, group_interval)
|
||||||
|
|
||||||
|
return segments
|
||||||
|
|
||||||
else:
|
else:
|
||||||
# Plain text file - no timestamps
|
# Plain text file - no timestamps
|
||||||
with open(path, 'r', encoding='utf-8') as f:
|
with open(path, 'r', encoding='utf-8') as f:
|
||||||
@@ -68,6 +89,76 @@ class TranscriptMerger:
|
|||||||
'type': 'audio'
|
'type': 'audio'
|
||||||
}]
|
}]
|
||||||
|
|
||||||
|
def group_audio_by_intervals(self, segments: List[Dict], interval_seconds: int = 30) -> List[Dict]:
|
||||||
|
"""
|
||||||
|
Group audio segments into regular time intervals.
|
||||||
|
|
||||||
|
Instead of word-level timestamps, this creates intervals (e.g., every 30 seconds)
|
||||||
|
with all text spoken during that interval concatenated together.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
segments: List of audio segments with timestamps
|
||||||
|
interval_seconds: Duration of each interval in seconds
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
List of grouped segments with interval timestamps
|
||||||
|
"""
|
||||||
|
if not segments:
|
||||||
|
return []
|
||||||
|
|
||||||
|
# Find the max timestamp to determine how many intervals we need
|
||||||
|
max_timestamp = max(seg['timestamp'] for seg in segments)
|
||||||
|
num_intervals = int(max_timestamp / interval_seconds) + 1
|
||||||
|
|
||||||
|
# Create interval buckets
|
||||||
|
intervals = []
|
||||||
|
for i in range(num_intervals):
|
||||||
|
interval_start = i * interval_seconds
|
||||||
|
interval_end = (i + 1) * interval_seconds
|
||||||
|
|
||||||
|
# Collect all text in this interval
|
||||||
|
texts = []
|
||||||
|
for seg in segments:
|
||||||
|
if interval_start <= seg['timestamp'] < interval_end:
|
||||||
|
texts.append(seg['text'])
|
||||||
|
|
||||||
|
# Only create interval if there's text
|
||||||
|
if texts:
|
||||||
|
intervals.append({
|
||||||
|
'timestamp': interval_start,
|
||||||
|
'text': ' '.join(texts),
|
||||||
|
'type': 'audio'
|
||||||
|
})
|
||||||
|
|
||||||
|
logger.info(f"Grouped {len(segments)} segments into {len(intervals)} intervals of {interval_seconds}s")
|
||||||
|
return intervals
|
||||||
|
|
||||||
|
def _encode_image_base64(self, image_path: str) -> tuple[str, int]:
|
||||||
|
"""
|
||||||
|
Encode image as base64 (image already at target quality/size).
|
||||||
|
|
||||||
|
Args:
|
||||||
|
image_path: Path to image file
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
Tuple of (base64_string, size_in_bytes)
|
||||||
|
"""
|
||||||
|
try:
|
||||||
|
# Read file directly (already at target quality/resolution)
|
||||||
|
with open(image_path, 'rb') as f:
|
||||||
|
img_bytes = f.read()
|
||||||
|
|
||||||
|
# Encode to base64
|
||||||
|
b64_string = base64.b64encode(img_bytes).decode('utf-8')
|
||||||
|
|
||||||
|
logger.debug(f"Encoded {Path(image_path).name}: {len(img_bytes)} bytes")
|
||||||
|
|
||||||
|
return b64_string, len(img_bytes)
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
logger.error(f"Failed to encode image {image_path}: {e}")
|
||||||
|
return "", 0
|
||||||
|
|
||||||
def merge_transcripts(
|
def merge_transcripts(
|
||||||
self,
|
self,
|
||||||
audio_segments: List[Dict],
|
audio_segments: List[Dict],
|
||||||
@@ -75,13 +166,14 @@ class TranscriptMerger:
|
|||||||
) -> List[Dict]:
|
) -> List[Dict]:
|
||||||
"""
|
"""
|
||||||
Merge audio and screen transcripts by timestamp.
|
Merge audio and screen transcripts by timestamp.
|
||||||
|
Groups consecutive audio from same speaker until a screen frame interrupts.
|
||||||
|
|
||||||
Args:
|
Args:
|
||||||
audio_segments: List of audio transcript segments
|
audio_segments: List of audio transcript segments
|
||||||
screen_segments: List of screen OCR segments
|
screen_segments: List of screen OCR segments
|
||||||
|
|
||||||
Returns:
|
Returns:
|
||||||
Merged list sorted by timestamp
|
Merged list sorted by timestamp, with audio grouped by speaker
|
||||||
"""
|
"""
|
||||||
# Mark segment types
|
# Mark segment types
|
||||||
for seg in audio_segments:
|
for seg in audio_segments:
|
||||||
@@ -93,7 +185,46 @@ class TranscriptMerger:
|
|||||||
all_segments = audio_segments + screen_segments
|
all_segments = audio_segments + screen_segments
|
||||||
all_segments.sort(key=lambda x: x['timestamp'])
|
all_segments.sort(key=lambda x: x['timestamp'])
|
||||||
|
|
||||||
return all_segments
|
# Group consecutive audio segments by speaker (screen frames break groups)
|
||||||
|
grouped = []
|
||||||
|
current_group = None
|
||||||
|
|
||||||
|
for seg in all_segments:
|
||||||
|
if seg['type'] == 'screen':
|
||||||
|
# Screen frame: flush current group and add frame
|
||||||
|
if current_group:
|
||||||
|
grouped.append(current_group)
|
||||||
|
current_group = None
|
||||||
|
grouped.append(seg)
|
||||||
|
else:
|
||||||
|
# Audio segment
|
||||||
|
speaker = seg.get('speaker')
|
||||||
|
if current_group is None:
|
||||||
|
# Start new group
|
||||||
|
current_group = {
|
||||||
|
'timestamp': seg['timestamp'],
|
||||||
|
'text': seg['text'],
|
||||||
|
'speaker': speaker,
|
||||||
|
'type': 'audio'
|
||||||
|
}
|
||||||
|
elif speaker == current_group.get('speaker'):
|
||||||
|
# Same speaker, append text
|
||||||
|
current_group['text'] += ' ' + seg['text']
|
||||||
|
else:
|
||||||
|
# Speaker changed, flush and start new group
|
||||||
|
grouped.append(current_group)
|
||||||
|
current_group = {
|
||||||
|
'timestamp': seg['timestamp'],
|
||||||
|
'text': seg['text'],
|
||||||
|
'speaker': speaker,
|
||||||
|
'type': 'audio'
|
||||||
|
}
|
||||||
|
|
||||||
|
# Don't forget last group
|
||||||
|
if current_group:
|
||||||
|
grouped.append(current_group)
|
||||||
|
|
||||||
|
return grouped
|
||||||
|
|
||||||
def format_for_claude(
|
def format_for_claude(
|
||||||
self,
|
self,
|
||||||
@@ -120,7 +251,7 @@ class TranscriptMerger:
|
|||||||
lines = []
|
lines = []
|
||||||
lines.append("=" * 80)
|
lines.append("=" * 80)
|
||||||
lines.append("ENHANCED MEETING TRANSCRIPT")
|
lines.append("ENHANCED MEETING TRANSCRIPT")
|
||||||
lines.append("Audio transcript + Screen content")
|
lines.append("Audio transcript + Screen frames")
|
||||||
lines.append("=" * 80)
|
lines.append("=" * 80)
|
||||||
lines.append("")
|
lines.append("")
|
||||||
|
|
||||||
@@ -128,15 +259,27 @@ class TranscriptMerger:
|
|||||||
timestamp = self._format_timestamp(seg['timestamp'])
|
timestamp = self._format_timestamp(seg['timestamp'])
|
||||||
|
|
||||||
if seg['type'] == 'audio':
|
if seg['type'] == 'audio':
|
||||||
lines.append(f"[{timestamp}] SPEAKER:")
|
speaker = seg.get('speaker', 'SPEAKER')
|
||||||
|
lines.append(f"[{timestamp}] {speaker}:")
|
||||||
lines.append(f" {seg['text']}")
|
lines.append(f" {seg['text']}")
|
||||||
lines.append("")
|
lines.append("")
|
||||||
|
|
||||||
else: # screen
|
else: # screen
|
||||||
lines.append(f"[{timestamp}] SCREEN CONTENT:")
|
lines.append(f"[{timestamp}] SCREEN CONTENT:")
|
||||||
# Indent screen text for visibility
|
|
||||||
screen_text = seg['text'].replace('\n', '\n | ')
|
# Show frame path if available
|
||||||
lines.append(f" | {screen_text}")
|
if 'frame_path' in seg:
|
||||||
|
# Get just the filename relative to the enhanced transcript
|
||||||
|
frame_path = Path(seg['frame_path'])
|
||||||
|
relative_path = f"frames/{frame_path.name}"
|
||||||
|
lines.append(f" Frame: {relative_path}")
|
||||||
|
|
||||||
|
# Include text content if available (fallback or additional context)
|
||||||
|
if 'text' in seg and seg['text'].strip():
|
||||||
|
screen_text = seg['text'].replace('\n', '\n | ')
|
||||||
|
lines.append(f" TEXT:")
|
||||||
|
lines.append(f" | {screen_text}")
|
||||||
|
|
||||||
lines.append("")
|
lines.append("")
|
||||||
|
|
||||||
return "\n".join(lines)
|
return "\n".join(lines)
|
||||||
@@ -147,7 +290,10 @@ class TranscriptMerger:
|
|||||||
|
|
||||||
for seg in segments:
|
for seg in segments:
|
||||||
timestamp = self._format_timestamp(seg['timestamp'])
|
timestamp = self._format_timestamp(seg['timestamp'])
|
||||||
prefix = "SPEAKER" if seg['type'] == 'audio' else "SCREEN"
|
if seg['type'] == 'audio':
|
||||||
|
prefix = seg.get('speaker', 'SPEAKER')
|
||||||
|
else:
|
||||||
|
prefix = "SCREEN"
|
||||||
text = seg['text'].replace('\n', ' ')[:200] # Truncate long screen text
|
text = seg['text'].replace('\n', ' ')[:200] # Truncate long screen text
|
||||||
lines.append(f"[{timestamp}] {prefix}: {text}")
|
lines.append(f"[{timestamp}] {prefix}: {text}")
|
||||||
|
|
||||||
|
|||||||
@@ -6,6 +6,7 @@ from typing import List, Tuple, Dict, Optional
|
|||||||
from pathlib import Path
|
from pathlib import Path
|
||||||
import logging
|
import logging
|
||||||
from difflib import SequenceMatcher
|
from difflib import SequenceMatcher
|
||||||
|
import os
|
||||||
|
|
||||||
logger = logging.getLogger(__name__)
|
logger = logging.getLogger(__name__)
|
||||||
|
|
||||||
@@ -13,15 +14,24 @@ logger = logging.getLogger(__name__)
|
|||||||
class VisionProcessor:
|
class VisionProcessor:
|
||||||
"""Process frames using local vision models via Ollama."""
|
"""Process frames using local vision models via Ollama."""
|
||||||
|
|
||||||
def __init__(self, model: str = "llava:13b"):
|
def __init__(self, model: str = "llava:13b", prompts_dir: Optional[str] = None):
|
||||||
"""
|
"""
|
||||||
Initialize vision processor.
|
Initialize vision processor.
|
||||||
|
|
||||||
Args:
|
Args:
|
||||||
model: Ollama vision model to use (llava:13b, llava:7b, llava-llama3, bakllava)
|
model: Ollama vision model to use (llava:13b, llava:7b, llava-llama3, bakllava)
|
||||||
|
prompts_dir: Directory containing prompt files (default: meetus/prompts/)
|
||||||
"""
|
"""
|
||||||
self.model = model
|
self.model = model
|
||||||
self._client = None
|
self._client = None
|
||||||
|
|
||||||
|
# Set prompts directory
|
||||||
|
if prompts_dir:
|
||||||
|
self.prompts_dir = Path(prompts_dir)
|
||||||
|
else:
|
||||||
|
# Default to meetus/prompts/ relative to this file
|
||||||
|
self.prompts_dir = Path(__file__).parent / "prompts"
|
||||||
|
|
||||||
self._init_client()
|
self._init_client()
|
||||||
|
|
||||||
def _init_client(self):
|
def _init_client(self):
|
||||||
@@ -53,61 +63,44 @@ class VisionProcessor:
|
|||||||
"Also install Ollama: https://ollama.ai/download"
|
"Also install Ollama: https://ollama.ai/download"
|
||||||
)
|
)
|
||||||
|
|
||||||
def analyze_frame(self, image_path: str, context: str = "meeting") -> str:
|
def _load_prompt(self, context: str) -> str:
|
||||||
|
"""
|
||||||
|
Load prompt from file.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
context: Context name (meeting, dashboard, code, console)
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
Prompt text
|
||||||
|
"""
|
||||||
|
prompt_file = self.prompts_dir / f"{context}.txt"
|
||||||
|
|
||||||
|
if prompt_file.exists():
|
||||||
|
with open(prompt_file, 'r', encoding='utf-8') as f:
|
||||||
|
return f.read().strip()
|
||||||
|
else:
|
||||||
|
# Fallback to default prompt
|
||||||
|
logger.warning(f"Prompt file not found: {prompt_file}, using default")
|
||||||
|
return "Analyze this image and describe what you see in detail."
|
||||||
|
|
||||||
|
def analyze_frame(self, image_path: str, context: str = "meeting", audio_context: str = "") -> str:
|
||||||
"""
|
"""
|
||||||
Analyze a single frame using local vision model.
|
Analyze a single frame using local vision model.
|
||||||
|
|
||||||
Args:
|
Args:
|
||||||
image_path: Path to image file
|
image_path: Path to image file
|
||||||
context: Context hint for analysis (meeting, dashboard, code, console)
|
context: Context hint for analysis (meeting, dashboard, code, console)
|
||||||
|
audio_context: Optional audio transcript around this timestamp for context
|
||||||
|
|
||||||
Returns:
|
Returns:
|
||||||
Analyzed content description
|
Analyzed content description
|
||||||
"""
|
"""
|
||||||
# Context-specific prompts
|
# Load prompt from file
|
||||||
prompts = {
|
prompt = self._load_prompt(context)
|
||||||
"meeting": """Analyze this screen capture from a meeting recording. Extract:
|
|
||||||
1. Any visible text (titles, labels, headings)
|
|
||||||
2. Key metrics, numbers, or data points shown
|
|
||||||
3. Dashboard panels or visualizations (describe what they show)
|
|
||||||
4. Code snippets (preserve formatting and context)
|
|
||||||
5. Console/terminal output (commands and results)
|
|
||||||
6. Application names or UI elements
|
|
||||||
|
|
||||||
Focus on information that would help someone understand what was being discussed.
|
# Add audio context if available
|
||||||
Be concise but include all important details. If there's code, preserve it exactly.""",
|
if audio_context:
|
||||||
|
prompt = f"Audio context (what's being discussed around this time):\n{audio_context}\n\n{prompt}"
|
||||||
"dashboard": """Analyze this dashboard/monitoring panel. Extract:
|
|
||||||
1. Panel titles and metrics names
|
|
||||||
2. Current values and units
|
|
||||||
3. Trends (up/down/stable)
|
|
||||||
4. Alerts or warnings
|
|
||||||
5. Time ranges shown
|
|
||||||
6. Any anomalies or notable patterns
|
|
||||||
|
|
||||||
Format as structured data.""",
|
|
||||||
|
|
||||||
"code": """Analyze this code screenshot. Extract:
|
|
||||||
1. Programming language
|
|
||||||
2. File name or path (if visible)
|
|
||||||
3. Code content (preserve exact formatting)
|
|
||||||
4. Comments
|
|
||||||
5. Function/class names
|
|
||||||
6. Any error messages or warnings
|
|
||||||
|
|
||||||
Preserve code exactly as shown.""",
|
|
||||||
|
|
||||||
"console": """Analyze this console/terminal output. Extract:
|
|
||||||
1. Commands executed
|
|
||||||
2. Output/results
|
|
||||||
3. Error messages
|
|
||||||
4. Warnings or status messages
|
|
||||||
5. File paths or URLs
|
|
||||||
|
|
||||||
Preserve formatting and structure."""
|
|
||||||
}
|
|
||||||
|
|
||||||
prompt = prompts.get(context, prompts["meeting"])
|
|
||||||
|
|
||||||
try:
|
try:
|
||||||
# Use Ollama's chat API with vision
|
# Use Ollama's chat API with vision
|
||||||
@@ -135,7 +128,8 @@ Preserve formatting and structure."""
|
|||||||
frames_info: List[Tuple[str, float]],
|
frames_info: List[Tuple[str, float]],
|
||||||
context: str = "meeting",
|
context: str = "meeting",
|
||||||
deduplicate: bool = True,
|
deduplicate: bool = True,
|
||||||
similarity_threshold: float = 0.85
|
similarity_threshold: float = 0.85,
|
||||||
|
audio_segments: Optional[List[Dict]] = None
|
||||||
) -> List[Dict]:
|
) -> List[Dict]:
|
||||||
"""
|
"""
|
||||||
Process multiple frames with vision analysis.
|
Process multiple frames with vision analysis.
|
||||||
@@ -158,17 +152,25 @@ Preserve formatting and structure."""
|
|||||||
for idx, (frame_path, timestamp) in enumerate(frames_info, 1):
|
for idx, (frame_path, timestamp) in enumerate(frames_info, 1):
|
||||||
logger.info(f"Analyzing frame {idx}/{total} at {timestamp:.2f}s...")
|
logger.info(f"Analyzing frame {idx}/{total} at {timestamp:.2f}s...")
|
||||||
|
|
||||||
text = self.analyze_frame(frame_path, context)
|
# Get audio context around this timestamp (±30 seconds)
|
||||||
|
audio_context = self._get_audio_context(timestamp, audio_segments, window=30)
|
||||||
|
|
||||||
|
text = self.analyze_frame(frame_path, context, audio_context)
|
||||||
|
|
||||||
if not text:
|
if not text:
|
||||||
logger.warning(f"No content extracted from frame at {timestamp:.2f}s")
|
logger.warning(f"No content extracted from frame at {timestamp:.2f}s")
|
||||||
continue
|
continue
|
||||||
|
|
||||||
|
# Debug: Show what was extracted
|
||||||
|
logger.debug(f"Frame {idx} ({timestamp:.2f}s): Extracted {len(text)} chars")
|
||||||
|
logger.debug(f"Content preview: {text[:150]}{'...' if len(text) > 150 else ''}")
|
||||||
|
|
||||||
# Deduplicate similar consecutive frames
|
# Deduplicate similar consecutive frames
|
||||||
if deduplicate:
|
if deduplicate and prev_text:
|
||||||
similarity = self._text_similarity(prev_text, text)
|
similarity = self._text_similarity(prev_text, text)
|
||||||
|
logger.debug(f"Similarity to previous frame: {similarity:.2f} (threshold: {similarity_threshold})")
|
||||||
if similarity > similarity_threshold:
|
if similarity > similarity_threshold:
|
||||||
logger.debug(f"Skipping duplicate frame at {timestamp:.2f}s (similarity: {similarity:.2f})")
|
logger.debug(f"⊘ Skipping duplicate frame at {timestamp:.2f}s (similarity: {similarity:.2f})")
|
||||||
continue
|
continue
|
||||||
|
|
||||||
results.append({
|
results.append({
|
||||||
@@ -182,6 +184,29 @@ Preserve formatting and structure."""
|
|||||||
logger.info(f"Extracted content from {len(results)} frames (deduplication: {deduplicate})")
|
logger.info(f"Extracted content from {len(results)} frames (deduplication: {deduplicate})")
|
||||||
return results
|
return results
|
||||||
|
|
||||||
|
def _get_audio_context(self, timestamp: float, audio_segments: Optional[List[Dict]], window: int = 30) -> str:
|
||||||
|
"""
|
||||||
|
Get audio transcript around a given timestamp.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
timestamp: Target timestamp in seconds
|
||||||
|
audio_segments: List of audio segments with 'timestamp' and 'text' keys
|
||||||
|
window: Time window in seconds (±window around timestamp)
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
Concatenated audio text from the time window
|
||||||
|
"""
|
||||||
|
if not audio_segments:
|
||||||
|
return ""
|
||||||
|
|
||||||
|
relevant = [seg for seg in audio_segments
|
||||||
|
if abs(seg.get('timestamp', 0) - timestamp) <= window]
|
||||||
|
|
||||||
|
if not relevant:
|
||||||
|
return ""
|
||||||
|
|
||||||
|
return " ".join([seg['text'] for seg in relevant])
|
||||||
|
|
||||||
def _text_similarity(self, text1: str, text2: str) -> float:
|
def _text_similarity(self, text1: str, text2: str) -> float:
|
||||||
"""
|
"""
|
||||||
Calculate similarity between two texts.
|
Calculate similarity between two texts.
|
||||||
|
|||||||
523
meetus/workflow.py
Normal file
523
meetus/workflow.py
Normal file
@@ -0,0 +1,523 @@
|
|||||||
|
"""
|
||||||
|
Orchestrate the video processing workflow.
|
||||||
|
Coordinates frame extraction, analysis, and transcript merging.
|
||||||
|
"""
|
||||||
|
from pathlib import Path
|
||||||
|
import logging
|
||||||
|
import os
|
||||||
|
import subprocess
|
||||||
|
import shutil
|
||||||
|
from typing import Dict, Any, Optional
|
||||||
|
|
||||||
|
from .output_manager import OutputManager
|
||||||
|
from .cache_manager import CacheManager
|
||||||
|
from .frame_extractor import FrameExtractor
|
||||||
|
from .ocr_processor import OCRProcessor
|
||||||
|
from .vision_processor import VisionProcessor
|
||||||
|
from .transcript_merger import TranscriptMerger
|
||||||
|
|
||||||
|
logger = logging.getLogger(__name__)
|
||||||
|
|
||||||
|
|
||||||
|
class WorkflowConfig:
|
||||||
|
"""Configuration for the processing workflow."""
|
||||||
|
|
||||||
|
def __init__(self, **kwargs):
|
||||||
|
"""Initialize configuration from keyword arguments."""
|
||||||
|
# Video and paths
|
||||||
|
self.video_path = Path(kwargs['video'])
|
||||||
|
self.transcript_path = kwargs.get('transcript')
|
||||||
|
self.output_dir = kwargs.get('output_dir', 'output')
|
||||||
|
self.custom_output = kwargs.get('output')
|
||||||
|
|
||||||
|
# Whisper options
|
||||||
|
self.run_whisper = kwargs.get('run_whisper', False)
|
||||||
|
self.whisper_model = kwargs.get('whisper_model', 'medium')
|
||||||
|
self.diarize = kwargs.get('diarize', False)
|
||||||
|
|
||||||
|
# Frame extraction
|
||||||
|
self.scene_detection = kwargs.get('scene_detection', False)
|
||||||
|
self.scene_threshold = kwargs.get('scene_threshold', 15.0)
|
||||||
|
self.interval = kwargs.get('interval', 5)
|
||||||
|
|
||||||
|
# Analysis options
|
||||||
|
self.use_vision = kwargs.get('use_vision', False)
|
||||||
|
self.use_hybrid = kwargs.get('use_hybrid', False)
|
||||||
|
self.hybrid_llm_cleanup = kwargs.get('hybrid_llm_cleanup', False)
|
||||||
|
self.hybrid_llm_model = kwargs.get('hybrid_llm_model', 'llama3.2:3b')
|
||||||
|
self.vision_model = kwargs.get('vision_model', 'llava:13b')
|
||||||
|
self.vision_context = kwargs.get('vision_context', 'meeting')
|
||||||
|
self.ocr_engine = kwargs.get('ocr_engine', 'tesseract')
|
||||||
|
|
||||||
|
# Validation: can't use both vision and hybrid
|
||||||
|
if self.use_vision and self.use_hybrid:
|
||||||
|
raise ValueError("Cannot use both --use-vision and --use-hybrid. Choose one.")
|
||||||
|
|
||||||
|
# Validation: LLM cleanup requires hybrid mode
|
||||||
|
if self.hybrid_llm_cleanup and not self.use_hybrid:
|
||||||
|
raise ValueError("--hybrid-llm-cleanup requires --use-hybrid")
|
||||||
|
|
||||||
|
# Processing options
|
||||||
|
self.no_deduplicate = kwargs.get('no_deduplicate', False)
|
||||||
|
self.no_cache = kwargs.get('no_cache', False)
|
||||||
|
self.skip_cache_frames = kwargs.get('skip_cache_frames', False)
|
||||||
|
self.skip_cache_whisper = kwargs.get('skip_cache_whisper', False)
|
||||||
|
self.skip_cache_analysis = kwargs.get('skip_cache_analysis', False)
|
||||||
|
self.extract_only = kwargs.get('extract_only', False)
|
||||||
|
self.format = kwargs.get('format', 'detailed')
|
||||||
|
self.embed_images = kwargs.get('embed_images', False)
|
||||||
|
self.embed_quality = kwargs.get('embed_quality', 80)
|
||||||
|
|
||||||
|
def to_dict(self) -> Dict[str, Any]:
|
||||||
|
"""Convert config to dictionary for manifest."""
|
||||||
|
return {
|
||||||
|
"whisper": {
|
||||||
|
"enabled": self.run_whisper,
|
||||||
|
"model": self.whisper_model
|
||||||
|
},
|
||||||
|
"frame_extraction": {
|
||||||
|
"method": "scene_detection" if self.scene_detection else "interval",
|
||||||
|
"interval_seconds": self.interval if not self.scene_detection else None,
|
||||||
|
"scene_threshold": self.scene_threshold if self.scene_detection else None
|
||||||
|
},
|
||||||
|
"analysis": {
|
||||||
|
"method": "vision" if self.use_vision else ("hybrid" if self.use_hybrid else "ocr"),
|
||||||
|
"vision_model": self.vision_model if self.use_vision else None,
|
||||||
|
"vision_context": self.vision_context if self.use_vision else None,
|
||||||
|
"ocr_engine": self.ocr_engine if (not self.use_vision) else None,
|
||||||
|
"deduplication": not self.no_deduplicate
|
||||||
|
},
|
||||||
|
"output_format": self.format
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
class ProcessingWorkflow:
|
||||||
|
"""Orchestrate the complete video processing workflow."""
|
||||||
|
|
||||||
|
def __init__(self, config: WorkflowConfig):
|
||||||
|
"""
|
||||||
|
Initialize workflow.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
config: Workflow configuration
|
||||||
|
"""
|
||||||
|
self.config = config
|
||||||
|
self.output_mgr = OutputManager(
|
||||||
|
config.video_path,
|
||||||
|
config.output_dir,
|
||||||
|
use_cache=not config.no_cache
|
||||||
|
)
|
||||||
|
self.cache_mgr = CacheManager(
|
||||||
|
self.output_mgr.output_dir,
|
||||||
|
self.output_mgr.frames_dir,
|
||||||
|
config.video_path.stem,
|
||||||
|
use_cache=not config.no_cache,
|
||||||
|
skip_cache_frames=config.skip_cache_frames,
|
||||||
|
skip_cache_whisper=config.skip_cache_whisper,
|
||||||
|
skip_cache_analysis=config.skip_cache_analysis
|
||||||
|
)
|
||||||
|
|
||||||
|
def run(self) -> Dict[str, Any]:
|
||||||
|
"""
|
||||||
|
Run the complete processing workflow.
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
Dictionary with output paths and status
|
||||||
|
"""
|
||||||
|
logger.info("=" * 80)
|
||||||
|
logger.info("MEETING PROCESSOR")
|
||||||
|
logger.info("=" * 80)
|
||||||
|
logger.info(f"Video: {self.config.video_path.name}")
|
||||||
|
|
||||||
|
# Determine analysis method
|
||||||
|
if self.config.use_vision:
|
||||||
|
analysis_method = f"Vision Model ({self.config.vision_model})"
|
||||||
|
logger.info(f"Analysis: {analysis_method}")
|
||||||
|
logger.info(f"Context: {self.config.vision_context}")
|
||||||
|
elif self.config.use_hybrid:
|
||||||
|
analysis_method = f"Hybrid (OpenCV + {self.config.ocr_engine})"
|
||||||
|
logger.info(f"Analysis: {analysis_method}")
|
||||||
|
else:
|
||||||
|
analysis_method = f"OCR ({self.config.ocr_engine})"
|
||||||
|
logger.info(f"Analysis: {analysis_method}")
|
||||||
|
|
||||||
|
logger.info(f"Frame extraction: {'Scene detection' if self.config.scene_detection else f'Every {self.config.interval}s'}")
|
||||||
|
logger.info(f"Caching: {'Disabled' if self.config.no_cache else 'Enabled'}")
|
||||||
|
logger.info("=" * 80)
|
||||||
|
|
||||||
|
# Step 0: Whisper transcription
|
||||||
|
transcript_path = self._run_whisper()
|
||||||
|
|
||||||
|
# Step 1: Extract frames
|
||||||
|
frames_info = self._extract_frames()
|
||||||
|
|
||||||
|
if not frames_info:
|
||||||
|
logger.error("No frames extracted")
|
||||||
|
raise RuntimeError("Frame extraction failed")
|
||||||
|
|
||||||
|
# Step 2: Analyze frames
|
||||||
|
screen_segments = self._analyze_frames(frames_info)
|
||||||
|
|
||||||
|
if self.config.extract_only:
|
||||||
|
logger.info("Done! (extract-only mode)")
|
||||||
|
return self._build_result(transcript_path, screen_segments)
|
||||||
|
|
||||||
|
# Step 3: Merge with transcript
|
||||||
|
enhanced_transcript = self._merge_transcripts(transcript_path, screen_segments)
|
||||||
|
|
||||||
|
# Save manifest
|
||||||
|
self.output_mgr.save_manifest(self.config.to_dict())
|
||||||
|
|
||||||
|
# Build final result
|
||||||
|
return self._build_result(transcript_path, screen_segments, enhanced_transcript)
|
||||||
|
|
||||||
|
def _run_whisper(self) -> Optional[str]:
|
||||||
|
"""Run Whisper transcription if requested, or use cached/provided transcript."""
|
||||||
|
# First, check cache (regardless of run_whisper flag)
|
||||||
|
cached = self.cache_mgr.get_whisper_cache()
|
||||||
|
if cached:
|
||||||
|
return str(cached)
|
||||||
|
|
||||||
|
# If no cache and not running whisper/diarize, use provided transcript path (if any)
|
||||||
|
if not self.config.run_whisper and not self.config.diarize:
|
||||||
|
return self.config.transcript_path
|
||||||
|
|
||||||
|
logger.info("=" * 80)
|
||||||
|
logger.info("STEP 0: Running Whisper Transcription")
|
||||||
|
logger.info("=" * 80)
|
||||||
|
|
||||||
|
# Determine which transcription tool to use
|
||||||
|
use_diarize = getattr(self.config, 'diarize', False)
|
||||||
|
|
||||||
|
if use_diarize:
|
||||||
|
if not shutil.which("whisperx"):
|
||||||
|
logger.error("WhisperX is not installed. Install it with: pip install whisperx")
|
||||||
|
raise RuntimeError("WhisperX not installed (required for --diarize)")
|
||||||
|
transcribe_cmd = "whisperx"
|
||||||
|
else:
|
||||||
|
if not shutil.which("whisper"):
|
||||||
|
logger.error("Whisper is not installed. Install it with: pip install openai-whisper")
|
||||||
|
raise RuntimeError("Whisper not installed")
|
||||||
|
transcribe_cmd = "whisper"
|
||||||
|
|
||||||
|
# Unload Ollama model to free GPU memory for Whisper (if using vision)
|
||||||
|
if self.config.use_vision:
|
||||||
|
logger.info("Freeing GPU memory for Whisper...")
|
||||||
|
try:
|
||||||
|
subprocess.run(["ollama", "stop", self.config.vision_model],
|
||||||
|
capture_output=True, check=False)
|
||||||
|
logger.info("✓ Ollama model unloaded")
|
||||||
|
except Exception as e:
|
||||||
|
logger.warning(f"Could not unload Ollama model: {e}")
|
||||||
|
|
||||||
|
if use_diarize:
|
||||||
|
logger.info(f"Running WhisperX transcription with diarization (model: {self.config.whisper_model})...")
|
||||||
|
else:
|
||||||
|
logger.info(f"Running Whisper transcription (model: {self.config.whisper_model})...")
|
||||||
|
logger.info("This may take a few minutes depending on video length...")
|
||||||
|
|
||||||
|
# Build command
|
||||||
|
cmd = [
|
||||||
|
transcribe_cmd,
|
||||||
|
str(self.config.video_path),
|
||||||
|
"--model", self.config.whisper_model,
|
||||||
|
"--output_format", "json",
|
||||||
|
"--output_dir", str(self.output_mgr.output_dir),
|
||||||
|
]
|
||||||
|
if use_diarize:
|
||||||
|
cmd.append("--diarize")
|
||||||
|
|
||||||
|
try:
|
||||||
|
# Set up environment with cuDNN library path for whisperx
|
||||||
|
env = os.environ.copy()
|
||||||
|
if use_diarize:
|
||||||
|
import site
|
||||||
|
site_packages = site.getsitepackages()[0]
|
||||||
|
cudnn_path = Path(site_packages) / "nvidia" / "cudnn" / "lib"
|
||||||
|
if cudnn_path.exists():
|
||||||
|
env["LD_LIBRARY_PATH"] = str(cudnn_path) + ":" + env.get("LD_LIBRARY_PATH", "")
|
||||||
|
|
||||||
|
subprocess.run(cmd, check=True, capture_output=True, text=True, env=env)
|
||||||
|
|
||||||
|
transcript_path = self.output_mgr.get_path(f"{self.config.video_path.stem}.json")
|
||||||
|
|
||||||
|
if transcript_path.exists():
|
||||||
|
logger.info(f"✓ Whisper transcription completed: {transcript_path.name}")
|
||||||
|
|
||||||
|
# Debug: Show transcript preview
|
||||||
|
try:
|
||||||
|
import json
|
||||||
|
with open(transcript_path, 'r', encoding='utf-8') as f:
|
||||||
|
whisper_data = json.load(f)
|
||||||
|
|
||||||
|
if 'segments' in whisper_data:
|
||||||
|
logger.debug(f"Whisper produced {len(whisper_data['segments'])} segments")
|
||||||
|
if whisper_data['segments']:
|
||||||
|
logger.debug(f"First segment: {whisper_data['segments'][0]}")
|
||||||
|
logger.debug(f"Last segment: {whisper_data['segments'][-1]}")
|
||||||
|
|
||||||
|
if 'text' in whisper_data:
|
||||||
|
text_preview = whisper_data['text'][:200] + "..." if len(whisper_data.get('text', '')) > 200 else whisper_data.get('text', '')
|
||||||
|
logger.debug(f"Transcript preview: {text_preview}")
|
||||||
|
except Exception as e:
|
||||||
|
logger.debug(f"Could not parse whisper output for debug: {e}")
|
||||||
|
|
||||||
|
logger.info("")
|
||||||
|
return str(transcript_path)
|
||||||
|
else:
|
||||||
|
logger.error("Whisper completed but transcript file not found")
|
||||||
|
raise RuntimeError("Whisper output missing")
|
||||||
|
|
||||||
|
except subprocess.CalledProcessError as e:
|
||||||
|
logger.error(f"Whisper failed: {e.stderr}")
|
||||||
|
raise
|
||||||
|
|
||||||
|
def _extract_frames(self):
|
||||||
|
"""Extract frames from video."""
|
||||||
|
logger.info("Step 1: Extracting frames from video...")
|
||||||
|
|
||||||
|
# Check cache
|
||||||
|
cached_frames = self.cache_mgr.get_frames_cache()
|
||||||
|
if cached_frames:
|
||||||
|
return cached_frames
|
||||||
|
|
||||||
|
# Clean up old frames if regenerating
|
||||||
|
if self.config.skip_cache_frames and self.output_mgr.frames_dir.exists():
|
||||||
|
old_frames = list(self.output_mgr.frames_dir.glob("*.jpg"))
|
||||||
|
if old_frames:
|
||||||
|
logger.info(f"Cleaning up {len(old_frames)} old frames...")
|
||||||
|
for old_frame in old_frames:
|
||||||
|
old_frame.unlink()
|
||||||
|
logger.info("✓ Cleanup complete")
|
||||||
|
|
||||||
|
# Extract frames (use embed quality so saved files match embedded images)
|
||||||
|
if self.config.scene_detection:
|
||||||
|
logger.info(f"Extracting frames with scene detection (threshold={self.config.scene_threshold})...")
|
||||||
|
else:
|
||||||
|
logger.info(f"Extracting frames every {self.config.interval}s...")
|
||||||
|
|
||||||
|
extractor = FrameExtractor(
|
||||||
|
str(self.config.video_path),
|
||||||
|
str(self.output_mgr.frames_dir),
|
||||||
|
quality=self.config.embed_quality
|
||||||
|
)
|
||||||
|
|
||||||
|
if self.config.scene_detection:
|
||||||
|
frames_info = extractor.extract_scene_changes(threshold=self.config.scene_threshold)
|
||||||
|
else:
|
||||||
|
frames_info = extractor.extract_by_interval(self.config.interval)
|
||||||
|
|
||||||
|
logger.info(f"✓ Extracted {len(frames_info)} frames")
|
||||||
|
return frames_info
|
||||||
|
|
||||||
|
def _analyze_frames(self, frames_info):
|
||||||
|
"""Analyze frames with vision, hybrid, or OCR."""
|
||||||
|
# Skip analysis if just embedding images
|
||||||
|
if self.config.embed_images:
|
||||||
|
logger.info("Step 2: Skipping analysis (images will be embedded)")
|
||||||
|
# Create minimal segments with just frame paths and timestamps
|
||||||
|
screen_segments = [
|
||||||
|
{
|
||||||
|
'timestamp': timestamp,
|
||||||
|
'text': '', # No text extraction needed
|
||||||
|
'frame_path': frame_path
|
||||||
|
}
|
||||||
|
for frame_path, timestamp in frames_info
|
||||||
|
]
|
||||||
|
logger.info(f"✓ Prepared {len(screen_segments)} frames for embedding")
|
||||||
|
return screen_segments
|
||||||
|
|
||||||
|
# Determine analysis type
|
||||||
|
if self.config.use_vision:
|
||||||
|
analysis_type = 'vision'
|
||||||
|
elif self.config.use_hybrid:
|
||||||
|
analysis_type = 'hybrid'
|
||||||
|
else:
|
||||||
|
analysis_type = 'ocr'
|
||||||
|
|
||||||
|
# Check cache
|
||||||
|
cached_analysis = self.cache_mgr.get_analysis_cache(analysis_type)
|
||||||
|
if cached_analysis:
|
||||||
|
return cached_analysis
|
||||||
|
|
||||||
|
if self.config.use_vision:
|
||||||
|
return self._run_vision_analysis(frames_info)
|
||||||
|
elif self.config.use_hybrid:
|
||||||
|
return self._run_hybrid_analysis(frames_info)
|
||||||
|
else:
|
||||||
|
return self._run_ocr_analysis(frames_info)
|
||||||
|
|
||||||
|
def _run_vision_analysis(self, frames_info):
|
||||||
|
"""Run vision analysis on frames."""
|
||||||
|
logger.info("Step 2: Running vision analysis on extracted frames...")
|
||||||
|
logger.info(f"Loading vision model {self.config.vision_model} to GPU...")
|
||||||
|
|
||||||
|
# Load audio segments for context if transcript exists
|
||||||
|
audio_segments = []
|
||||||
|
transcript_path = self.config.transcript_path or self._get_cached_transcript()
|
||||||
|
|
||||||
|
if transcript_path:
|
||||||
|
transcript_file = Path(transcript_path)
|
||||||
|
if transcript_file.exists():
|
||||||
|
logger.info("Loading audio transcript for context...")
|
||||||
|
merger = TranscriptMerger()
|
||||||
|
audio_segments = merger.load_whisper_transcript(str(transcript_file))
|
||||||
|
logger.info(f"✓ Loaded {len(audio_segments)} audio segments for context")
|
||||||
|
|
||||||
|
try:
|
||||||
|
vision = VisionProcessor(model=self.config.vision_model)
|
||||||
|
screen_segments = vision.process_frames(
|
||||||
|
frames_info,
|
||||||
|
context=self.config.vision_context,
|
||||||
|
deduplicate=not self.config.no_deduplicate,
|
||||||
|
audio_segments=audio_segments
|
||||||
|
)
|
||||||
|
logger.info(f"✓ Analyzed {len(screen_segments)} frames with vision model")
|
||||||
|
|
||||||
|
# Debug: Show sample analysis results
|
||||||
|
if screen_segments:
|
||||||
|
logger.debug(f"First analysis result: timestamp={screen_segments[0].get('timestamp')}, text_length={len(screen_segments[0].get('text', ''))}")
|
||||||
|
logger.debug(f"First analysis text preview: {screen_segments[0].get('text', '')[:200]}...")
|
||||||
|
if len(screen_segments) > 1:
|
||||||
|
logger.debug(f"Last analysis result: timestamp={screen_segments[-1].get('timestamp')}, text_length={len(screen_segments[-1].get('text', ''))}")
|
||||||
|
|
||||||
|
# Cache results
|
||||||
|
self.cache_mgr.save_analysis('vision', screen_segments)
|
||||||
|
return screen_segments
|
||||||
|
|
||||||
|
except ImportError as e:
|
||||||
|
logger.error(f"{e}")
|
||||||
|
raise
|
||||||
|
|
||||||
|
def _get_cached_transcript(self) -> Optional[str]:
|
||||||
|
"""Get cached Whisper transcript if available."""
|
||||||
|
cached = self.cache_mgr.get_whisper_cache()
|
||||||
|
return str(cached) if cached else None
|
||||||
|
|
||||||
|
def _run_hybrid_analysis(self, frames_info):
|
||||||
|
"""Run hybrid analysis on frames (OpenCV + OCR)."""
|
||||||
|
if self.config.hybrid_llm_cleanup:
|
||||||
|
logger.info("Step 2: Running hybrid analysis (OpenCV + OCR + LLM cleanup)...")
|
||||||
|
else:
|
||||||
|
logger.info("Step 2: Running hybrid analysis (OpenCV text detection + OCR)...")
|
||||||
|
|
||||||
|
try:
|
||||||
|
from .hybrid_processor import HybridProcessor
|
||||||
|
|
||||||
|
hybrid = HybridProcessor(
|
||||||
|
ocr_engine=self.config.ocr_engine,
|
||||||
|
use_llm_cleanup=self.config.hybrid_llm_cleanup,
|
||||||
|
llm_model=self.config.hybrid_llm_model
|
||||||
|
)
|
||||||
|
screen_segments = hybrid.process_frames(
|
||||||
|
frames_info,
|
||||||
|
deduplicate=not self.config.no_deduplicate
|
||||||
|
)
|
||||||
|
logger.info(f"✓ Processed {len(screen_segments)} frames with hybrid analysis")
|
||||||
|
|
||||||
|
# Debug: Show sample hybrid results
|
||||||
|
if screen_segments:
|
||||||
|
logger.debug(f"First hybrid result: timestamp={screen_segments[0].get('timestamp')}, text_length={len(screen_segments[0].get('text', ''))}")
|
||||||
|
logger.debug(f"First hybrid text preview: {screen_segments[0].get('text', '')[:200]}...")
|
||||||
|
if len(screen_segments) > 1:
|
||||||
|
logger.debug(f"Last hybrid result: timestamp={screen_segments[-1].get('timestamp')}, text_length={len(screen_segments[-1].get('text', ''))}")
|
||||||
|
|
||||||
|
# Cache results
|
||||||
|
self.cache_mgr.save_analysis('hybrid', screen_segments)
|
||||||
|
return screen_segments
|
||||||
|
|
||||||
|
except ImportError as e:
|
||||||
|
logger.error(f"{e}")
|
||||||
|
raise
|
||||||
|
|
||||||
|
def _run_ocr_analysis(self, frames_info):
|
||||||
|
"""Run OCR analysis on frames."""
|
||||||
|
logger.info("Step 2: Running OCR on extracted frames...")
|
||||||
|
|
||||||
|
try:
|
||||||
|
ocr = OCRProcessor(engine=self.config.ocr_engine)
|
||||||
|
screen_segments = ocr.process_frames(
|
||||||
|
frames_info,
|
||||||
|
deduplicate=not self.config.no_deduplicate
|
||||||
|
)
|
||||||
|
logger.info(f"✓ Processed {len(screen_segments)} frames with OCR")
|
||||||
|
|
||||||
|
# Debug: Show sample OCR results
|
||||||
|
if screen_segments:
|
||||||
|
logger.debug(f"First OCR result: timestamp={screen_segments[0].get('timestamp')}, text_length={len(screen_segments[0].get('text', ''))}")
|
||||||
|
logger.debug(f"First OCR text preview: {screen_segments[0].get('text', '')[:200]}...")
|
||||||
|
if len(screen_segments) > 1:
|
||||||
|
logger.debug(f"Last OCR result: timestamp={screen_segments[-1].get('timestamp')}, text_length={len(screen_segments[-1].get('text', ''))}")
|
||||||
|
|
||||||
|
# Cache results
|
||||||
|
self.cache_mgr.save_analysis('ocr', screen_segments)
|
||||||
|
return screen_segments
|
||||||
|
|
||||||
|
except ImportError as e:
|
||||||
|
logger.error(f"{e}")
|
||||||
|
logger.error(f"To install {self.config.ocr_engine}:")
|
||||||
|
logger.error(f" pip install {self.config.ocr_engine}")
|
||||||
|
raise
|
||||||
|
|
||||||
|
def _merge_transcripts(self, transcript_path, screen_segments):
|
||||||
|
"""Merge audio and screen transcripts."""
|
||||||
|
merger = TranscriptMerger(
|
||||||
|
embed_images=self.config.embed_images,
|
||||||
|
embed_quality=self.config.embed_quality
|
||||||
|
)
|
||||||
|
|
||||||
|
# Load audio transcript if available
|
||||||
|
audio_segments = []
|
||||||
|
if transcript_path:
|
||||||
|
logger.info("Step 3: Merging with Whisper transcript...")
|
||||||
|
transcript_file = Path(transcript_path)
|
||||||
|
|
||||||
|
if not transcript_file.exists():
|
||||||
|
logger.warning(f"Transcript not found: {transcript_path}")
|
||||||
|
logger.info("Proceeding with screen content only...")
|
||||||
|
else:
|
||||||
|
# Group audio into 30-second intervals for cleaner reference timestamps
|
||||||
|
audio_segments = merger.load_whisper_transcript(str(transcript_file), group_interval=30)
|
||||||
|
logger.info(f"✓ Loaded {len(audio_segments)} audio segments")
|
||||||
|
else:
|
||||||
|
logger.info("No transcript provided, using screen content only...")
|
||||||
|
|
||||||
|
# Merge and format
|
||||||
|
merged = merger.merge_transcripts(audio_segments, screen_segments)
|
||||||
|
formatted = merger.format_for_claude(merged, format_style=self.config.format)
|
||||||
|
|
||||||
|
# Save output
|
||||||
|
if self.config.custom_output:
|
||||||
|
output_path = self.config.custom_output
|
||||||
|
else:
|
||||||
|
output_path = self.output_mgr.get_path(f"{self.config.video_path.stem}_enhanced.txt")
|
||||||
|
|
||||||
|
merger.save_transcript(formatted, str(output_path))
|
||||||
|
|
||||||
|
logger.info("=" * 80)
|
||||||
|
logger.info("✓ PROCESSING COMPLETE!")
|
||||||
|
logger.info("=" * 80)
|
||||||
|
logger.info(f"Output directory: {self.output_mgr.output_dir}")
|
||||||
|
logger.info(f"Enhanced transcript: {Path(output_path).name}")
|
||||||
|
logger.info("")
|
||||||
|
|
||||||
|
return output_path
|
||||||
|
|
||||||
|
def _build_result(self, transcript_path=None, screen_segments=None, enhanced_transcript=None):
|
||||||
|
"""Build result dictionary."""
|
||||||
|
# Determine analysis filename
|
||||||
|
if self.config.use_vision:
|
||||||
|
analysis_type = 'vision'
|
||||||
|
elif self.config.use_hybrid:
|
||||||
|
analysis_type = 'hybrid'
|
||||||
|
else:
|
||||||
|
analysis_type = 'ocr'
|
||||||
|
|
||||||
|
return {
|
||||||
|
"output_dir": str(self.output_mgr.output_dir),
|
||||||
|
"transcript": transcript_path,
|
||||||
|
"analysis": f"{self.config.video_path.stem}_{analysis_type}.json",
|
||||||
|
"frames_count": len(screen_segments) if screen_segments else 0,
|
||||||
|
"enhanced_transcript": enhanced_transcript,
|
||||||
|
"manifest": str(self.output_mgr.get_path("manifest.json"))
|
||||||
|
}
|
||||||
@@ -1,34 +1,19 @@
|
|||||||
#!/usr/bin/env python3
|
#!/usr/bin/env python3
|
||||||
"""
|
"""
|
||||||
Process meeting recordings to extract audio + screen content.
|
Process meeting recordings to extract audio + screen content.
|
||||||
Combines Whisper transcripts with OCR from screen shares.
|
Combines Whisper transcripts with vision analysis or OCR from screen shares.
|
||||||
"""
|
"""
|
||||||
import argparse
|
import argparse
|
||||||
from pathlib import Path
|
|
||||||
import sys
|
import sys
|
||||||
import json
|
|
||||||
import logging
|
import logging
|
||||||
import subprocess
|
|
||||||
import shutil
|
|
||||||
|
|
||||||
from meetus.frame_extractor import FrameExtractor
|
from meetus.workflow import WorkflowConfig, ProcessingWorkflow
|
||||||
from meetus.ocr_processor import OCRProcessor
|
|
||||||
from meetus.vision_processor import VisionProcessor
|
|
||||||
from meetus.transcript_merger import TranscriptMerger
|
|
||||||
|
|
||||||
logger = logging.getLogger(__name__)
|
|
||||||
|
|
||||||
|
|
||||||
def setup_logging(verbose: bool = False):
|
def setup_logging(verbose: bool = False):
|
||||||
"""
|
"""Configure logging for the application."""
|
||||||
Configure logging for the application.
|
|
||||||
|
|
||||||
Args:
|
|
||||||
verbose: If True, set DEBUG level, otherwise INFO
|
|
||||||
"""
|
|
||||||
level = logging.DEBUG if verbose else logging.INFO
|
level = logging.DEBUG if verbose else logging.INFO
|
||||||
|
|
||||||
# Configure root logger
|
|
||||||
logging.basicConfig(
|
logging.basicConfig(
|
||||||
level=level,
|
level=level,
|
||||||
format='%(asctime)s - %(levelname)s - %(message)s',
|
format='%(asctime)s - %(levelname)s - %(message)s',
|
||||||
@@ -41,158 +26,121 @@ def setup_logging(verbose: bool = False):
|
|||||||
logging.getLogger('paddleocr').setLevel(logging.WARNING)
|
logging.getLogger('paddleocr').setLevel(logging.WARNING)
|
||||||
|
|
||||||
|
|
||||||
def run_whisper(video_path: Path, model: str = "base", output_dir: str = "output") -> Path:
|
|
||||||
"""
|
|
||||||
Run Whisper transcription on video file.
|
|
||||||
|
|
||||||
Args:
|
|
||||||
video_path: Path to video file
|
|
||||||
model: Whisper model to use (tiny, base, small, medium, large)
|
|
||||||
output_dir: Directory to save output
|
|
||||||
|
|
||||||
Returns:
|
|
||||||
Path to generated JSON transcript
|
|
||||||
"""
|
|
||||||
# Check if whisper is installed
|
|
||||||
if not shutil.which("whisper"):
|
|
||||||
logger.error("Whisper is not installed. Install it with: pip install openai-whisper")
|
|
||||||
sys.exit(1)
|
|
||||||
|
|
||||||
logger.info(f"Running Whisper transcription (model: {model})...")
|
|
||||||
logger.info("This may take a few minutes depending on video length...")
|
|
||||||
|
|
||||||
# Run whisper command
|
|
||||||
cmd = [
|
|
||||||
"whisper",
|
|
||||||
str(video_path),
|
|
||||||
"--model", model,
|
|
||||||
"--output_format", "json",
|
|
||||||
"--output_dir", output_dir
|
|
||||||
]
|
|
||||||
|
|
||||||
try:
|
|
||||||
result = subprocess.run(
|
|
||||||
cmd,
|
|
||||||
check=True,
|
|
||||||
capture_output=True,
|
|
||||||
text=True
|
|
||||||
)
|
|
||||||
|
|
||||||
# Whisper outputs to <output_dir>/<video_stem>.json
|
|
||||||
transcript_path = Path(output_dir) / f"{video_path.stem}.json"
|
|
||||||
|
|
||||||
if transcript_path.exists():
|
|
||||||
logger.info(f"✓ Whisper transcription completed: {transcript_path}")
|
|
||||||
return transcript_path
|
|
||||||
else:
|
|
||||||
logger.error("Whisper completed but transcript file not found")
|
|
||||||
sys.exit(1)
|
|
||||||
|
|
||||||
except subprocess.CalledProcessError as e:
|
|
||||||
logger.error(f"Whisper failed: {e.stderr}")
|
|
||||||
sys.exit(1)
|
|
||||||
|
|
||||||
|
|
||||||
def main():
|
def main():
|
||||||
parser = argparse.ArgumentParser(
|
parser = argparse.ArgumentParser(
|
||||||
description="Extract screen content from meeting recordings and merge with transcripts",
|
description="Extract screen content from meeting recordings and merge with transcripts",
|
||||||
formatter_class=argparse.RawDescriptionHelpFormatter,
|
formatter_class=argparse.RawDescriptionHelpFormatter,
|
||||||
epilog="""
|
epilog="""
|
||||||
Examples:
|
Examples:
|
||||||
# Run Whisper + vision analysis (recommended for code/dashboards)
|
# Reference frames for LLM analysis (recommended - transcript includes frame paths)
|
||||||
python process_meeting.py samples/meeting.mkv --run-whisper --use-vision
|
python process_meeting.py samples/meeting.mkv --run-whisper --embed-images --scene-detection
|
||||||
|
|
||||||
# Use vision with specific context hint
|
# Adjust frame extraction quality (lower = smaller files)
|
||||||
python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --vision-context code
|
python process_meeting.py samples/meeting.mkv --run-whisper --embed-images --embed-quality 60 --scene-detection
|
||||||
|
|
||||||
# Traditional OCR approach
|
# Hybrid approach: OpenCV + OCR (extracts text from frames)
|
||||||
python process_meeting.py samples/meeting.mkv --run-whisper
|
python process_meeting.py samples/meeting.mkv --run-whisper --use-hybrid --scene-detection
|
||||||
|
|
||||||
# Re-run analysis using cached frames and transcript
|
# Hybrid + LLM cleanup (best for code formatting)
|
||||||
python process_meeting.py samples/meeting.mkv --use-vision
|
python process_meeting.py samples/meeting.mkv --run-whisper --use-hybrid --hybrid-llm-cleanup --scene-detection
|
||||||
|
|
||||||
# Force reprocessing (ignore cache)
|
# Iterate on scene threshold (reuse whisper transcript)
|
||||||
python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --no-cache
|
python process_meeting.py samples/meeting.mkv --embed-images --scene-detection --scene-threshold 5 --skip-cache-frames --skip-cache-analysis
|
||||||
|
|
||||||
# Use scene detection for fewer frames
|
|
||||||
python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --scene-detection
|
|
||||||
"""
|
"""
|
||||||
)
|
)
|
||||||
|
|
||||||
|
# Required arguments
|
||||||
parser.add_argument(
|
parser.add_argument(
|
||||||
'video',
|
'video',
|
||||||
help='Path to video file'
|
help='Path to video file'
|
||||||
)
|
)
|
||||||
|
|
||||||
|
# Whisper options
|
||||||
parser.add_argument(
|
parser.add_argument(
|
||||||
'--transcript', '-t',
|
'--transcript', '-t',
|
||||||
help='Path to Whisper transcript (JSON or TXT)',
|
help='Path to Whisper transcript (JSON or TXT)',
|
||||||
default=None
|
default=None
|
||||||
)
|
)
|
||||||
|
|
||||||
parser.add_argument(
|
parser.add_argument(
|
||||||
'--run-whisper',
|
'--run-whisper',
|
||||||
action='store_true',
|
action='store_true',
|
||||||
help='Run Whisper transcription before processing'
|
help='Run Whisper transcription before processing'
|
||||||
)
|
)
|
||||||
|
|
||||||
parser.add_argument(
|
parser.add_argument(
|
||||||
'--whisper-model',
|
'--whisper-model',
|
||||||
choices=['tiny', 'base', 'small', 'medium', 'large'],
|
choices=['tiny', 'base', 'small', 'medium', 'large'],
|
||||||
help='Whisper model to use (default: base)',
|
help='Whisper model to use (default: medium)',
|
||||||
default='base'
|
default='medium'
|
||||||
|
)
|
||||||
|
parser.add_argument(
|
||||||
|
'--diarize',
|
||||||
|
action='store_true',
|
||||||
|
help='Use WhisperX with speaker diarization (requires whisperx and HuggingFace token)'
|
||||||
)
|
)
|
||||||
|
|
||||||
|
# Output options
|
||||||
parser.add_argument(
|
parser.add_argument(
|
||||||
'--output', '-o',
|
'--output', '-o',
|
||||||
help='Output file for enhanced transcript (default: output/<video>_enhanced.txt)',
|
help='Output file for enhanced transcript (default: auto-generated in output directory)',
|
||||||
default=None
|
default=None
|
||||||
)
|
)
|
||||||
|
|
||||||
parser.add_argument(
|
parser.add_argument(
|
||||||
'--output-dir',
|
'--output-dir',
|
||||||
help='Directory for output files (default: output/)',
|
help='Base directory for outputs (default: output/)',
|
||||||
default='output'
|
default='output'
|
||||||
)
|
)
|
||||||
|
|
||||||
parser.add_argument(
|
# Frame extraction options
|
||||||
'--frames-dir',
|
|
||||||
help='Directory to save extracted frames (default: frames/)',
|
|
||||||
default='frames'
|
|
||||||
)
|
|
||||||
|
|
||||||
parser.add_argument(
|
parser.add_argument(
|
||||||
'--interval',
|
'--interval',
|
||||||
type=int,
|
type=int,
|
||||||
help='Extract frame every N seconds (default: 5)',
|
help='Extract frame every N seconds (default: 5)',
|
||||||
default=5
|
default=5
|
||||||
)
|
)
|
||||||
|
|
||||||
parser.add_argument(
|
parser.add_argument(
|
||||||
'--scene-detection',
|
'--scene-detection',
|
||||||
action='store_true',
|
action='store_true',
|
||||||
help='Use scene detection instead of interval extraction'
|
help='Use scene detection instead of interval extraction'
|
||||||
)
|
)
|
||||||
|
parser.add_argument(
|
||||||
|
'--scene-threshold',
|
||||||
|
type=float,
|
||||||
|
help='Scene detection threshold (0-100, lower=more sensitive, default: 15)',
|
||||||
|
default=15.0
|
||||||
|
)
|
||||||
|
|
||||||
|
# Analysis options
|
||||||
parser.add_argument(
|
parser.add_argument(
|
||||||
'--ocr-engine',
|
'--ocr-engine',
|
||||||
choices=['tesseract', 'easyocr', 'paddleocr'],
|
choices=['tesseract', 'easyocr', 'paddleocr'],
|
||||||
help='OCR engine to use (default: tesseract)',
|
help='OCR engine to use (default: tesseract)',
|
||||||
default='tesseract'
|
default='tesseract'
|
||||||
)
|
)
|
||||||
|
|
||||||
parser.add_argument(
|
parser.add_argument(
|
||||||
'--use-vision',
|
'--use-vision',
|
||||||
action='store_true',
|
action='store_true',
|
||||||
help='Use local vision model (Ollama) instead of OCR for better context understanding'
|
help='Use local vision model (Ollama) instead of OCR for better context understanding'
|
||||||
)
|
)
|
||||||
|
parser.add_argument(
|
||||||
|
'--use-hybrid',
|
||||||
|
action='store_true',
|
||||||
|
help='Use hybrid approach: OpenCV text detection + OCR (more accurate than vision models)'
|
||||||
|
)
|
||||||
|
parser.add_argument(
|
||||||
|
'--hybrid-llm-cleanup',
|
||||||
|
action='store_true',
|
||||||
|
help='Use LLM to clean up OCR output and preserve code formatting (requires --use-hybrid)'
|
||||||
|
)
|
||||||
|
parser.add_argument(
|
||||||
|
'--hybrid-llm-model',
|
||||||
|
help='LLM model for cleanup (default: llama3.2:3b)',
|
||||||
|
default='llama3.2:3b'
|
||||||
|
)
|
||||||
parser.add_argument(
|
parser.add_argument(
|
||||||
'--vision-model',
|
'--vision-model',
|
||||||
help='Vision model to use with Ollama (default: llava:13b)',
|
help='Vision model to use with Ollama (default: llava:13b)',
|
||||||
default='llava:13b'
|
default='llava:13b'
|
||||||
)
|
)
|
||||||
|
|
||||||
parser.add_argument(
|
parser.add_argument(
|
||||||
'--vision-context',
|
'--vision-context',
|
||||||
choices=['meeting', 'dashboard', 'code', 'console'],
|
choices=['meeting', 'dashboard', 'code', 'console'],
|
||||||
@@ -200,31 +148,56 @@ Examples:
|
|||||||
default='meeting'
|
default='meeting'
|
||||||
)
|
)
|
||||||
|
|
||||||
|
# Processing options
|
||||||
parser.add_argument(
|
parser.add_argument(
|
||||||
'--no-cache',
|
'--no-cache',
|
||||||
action='store_true',
|
action='store_true',
|
||||||
help='Disable caching - reprocess everything even if outputs exist'
|
help='Disable caching - reprocess everything even if outputs exist'
|
||||||
)
|
)
|
||||||
|
parser.add_argument(
|
||||||
|
'--skip-cache-frames',
|
||||||
|
action='store_true',
|
||||||
|
help='Skip cached frames, re-extract from video (but keep whisper/analysis cache)'
|
||||||
|
)
|
||||||
|
parser.add_argument(
|
||||||
|
'--skip-cache-whisper',
|
||||||
|
action='store_true',
|
||||||
|
help='Skip cached whisper transcript, re-run transcription (but keep frames/analysis cache)'
|
||||||
|
)
|
||||||
|
parser.add_argument(
|
||||||
|
'--skip-cache-analysis',
|
||||||
|
action='store_true',
|
||||||
|
help='Skip cached analysis, re-run OCR/vision (but keep frames/whisper cache)'
|
||||||
|
)
|
||||||
parser.add_argument(
|
parser.add_argument(
|
||||||
'--no-deduplicate',
|
'--no-deduplicate',
|
||||||
action='store_true',
|
action='store_true',
|
||||||
help='Disable text deduplication'
|
help='Disable text deduplication'
|
||||||
)
|
)
|
||||||
|
|
||||||
parser.add_argument(
|
parser.add_argument(
|
||||||
'--extract-only',
|
'--extract-only',
|
||||||
action='store_true',
|
action='store_true',
|
||||||
help='Only extract frames and OCR, skip transcript merging'
|
help='Only extract frames and analyze, skip transcript merging'
|
||||||
)
|
)
|
||||||
|
|
||||||
parser.add_argument(
|
parser.add_argument(
|
||||||
'--format',
|
'--format',
|
||||||
choices=['detailed', 'compact'],
|
choices=['detailed', 'compact'],
|
||||||
help='Output format style (default: detailed)',
|
help='Output format style (default: detailed)',
|
||||||
default='detailed'
|
default='detailed'
|
||||||
)
|
)
|
||||||
|
parser.add_argument(
|
||||||
|
'--embed-images',
|
||||||
|
action='store_true',
|
||||||
|
help='Skip OCR/vision analysis and reference frame files directly (faster, lets LLM analyze images)'
|
||||||
|
)
|
||||||
|
parser.add_argument(
|
||||||
|
'--embed-quality',
|
||||||
|
type=int,
|
||||||
|
help='JPEG quality for extracted frames (default: 80, lower = smaller files)',
|
||||||
|
default=80
|
||||||
|
)
|
||||||
|
|
||||||
|
# Logging
|
||||||
parser.add_argument(
|
parser.add_argument(
|
||||||
'--verbose', '-v',
|
'--verbose', '-v',
|
||||||
action='store_true',
|
action='store_true',
|
||||||
@@ -236,166 +209,38 @@ Examples:
|
|||||||
# Setup logging
|
# Setup logging
|
||||||
setup_logging(args.verbose)
|
setup_logging(args.verbose)
|
||||||
|
|
||||||
# Validate video path
|
try:
|
||||||
video_path = Path(args.video)
|
# Create workflow configuration
|
||||||
if not video_path.exists():
|
config = WorkflowConfig(**vars(args))
|
||||||
logger.error(f"Video file not found: {args.video}")
|
|
||||||
sys.exit(1)
|
|
||||||
|
|
||||||
# Create output directory
|
# Run processing workflow
|
||||||
output_dir = Path(args.output_dir)
|
workflow = ProcessingWorkflow(config)
|
||||||
output_dir.mkdir(parents=True, exist_ok=True)
|
result = workflow.run()
|
||||||
|
|
||||||
# Set default output path
|
# Print final summary
|
||||||
if args.output is None:
|
print("\n" + "=" * 80)
|
||||||
args.output = str(output_dir / f"{video_path.stem}_enhanced.txt")
|
print("✓ SUCCESS!")
|
||||||
|
print("=" * 80)
|
||||||
|
print(f"Output directory: {result['output_dir']}")
|
||||||
|
if result.get('enhanced_transcript'):
|
||||||
|
print(f"Enhanced transcript ready for AI summarization!")
|
||||||
|
print("=" * 80)
|
||||||
|
|
||||||
# Define cache paths
|
return 0
|
||||||
whisper_cache = output_dir / f"{video_path.stem}.json"
|
|
||||||
analysis_cache = output_dir / f"{video_path.stem}_{'vision' if args.use_vision else 'ocr'}.json"
|
|
||||||
frames_cache_dir = Path(args.frames_dir)
|
|
||||||
|
|
||||||
# Check for cached Whisper transcript
|
except FileNotFoundError as e:
|
||||||
if args.run_whisper:
|
logging.error(f"File not found: {e}")
|
||||||
if not args.no_cache and whisper_cache.exists():
|
return 1
|
||||||
logger.info(f"✓ Found cached Whisper transcript: {whisper_cache}")
|
except RuntimeError as e:
|
||||||
args.transcript = str(whisper_cache)
|
logging.error(f"Processing failed: {e}")
|
||||||
else:
|
return 1
|
||||||
logger.info("=" * 80)
|
except KeyboardInterrupt:
|
||||||
logger.info("STEP 0: Running Whisper Transcription")
|
logging.warning("\nProcessing interrupted by user")
|
||||||
logger.info("=" * 80)
|
return 130
|
||||||
transcript_path = run_whisper(video_path, args.whisper_model, str(output_dir))
|
except Exception as e:
|
||||||
args.transcript = str(transcript_path)
|
logging.exception(f"Unexpected error: {e}")
|
||||||
logger.info("")
|
return 1
|
||||||
|
|
||||||
logger.info("=" * 80)
|
|
||||||
logger.info("MEETING PROCESSOR")
|
|
||||||
logger.info("=" * 80)
|
|
||||||
logger.info(f"Video: {video_path.name}")
|
|
||||||
logger.info(f"Analysis: {'Vision Model' if args.use_vision else f'OCR ({args.ocr_engine})'}")
|
|
||||||
if args.use_vision:
|
|
||||||
logger.info(f"Vision Model: {args.vision_model}")
|
|
||||||
logger.info(f"Context: {args.vision_context}")
|
|
||||||
logger.info(f"Frame extraction: {'Scene detection' if args.scene_detection else f'Every {args.interval}s'}")
|
|
||||||
if args.transcript:
|
|
||||||
logger.info(f"Transcript: {args.transcript}")
|
|
||||||
logger.info(f"Caching: {'Disabled' if args.no_cache else 'Enabled'}")
|
|
||||||
logger.info("=" * 80)
|
|
||||||
|
|
||||||
# Step 1: Extract frames (with caching)
|
|
||||||
logger.info("Step 1: Extracting frames from video...")
|
|
||||||
|
|
||||||
# Check if frames already exist
|
|
||||||
existing_frames = list(frames_cache_dir.glob(f"{video_path.stem}_*.jpg")) if frames_cache_dir.exists() else []
|
|
||||||
|
|
||||||
if not args.no_cache and existing_frames and len(existing_frames) > 0:
|
|
||||||
logger.info(f"✓ Found {len(existing_frames)} cached frames in {args.frames_dir}/")
|
|
||||||
# Build frames_info from existing files
|
|
||||||
frames_info = []
|
|
||||||
for frame_path in sorted(existing_frames):
|
|
||||||
# Try to extract timestamp from filename (e.g., video_00001_12.34s.jpg)
|
|
||||||
try:
|
|
||||||
timestamp_str = frame_path.stem.split('_')[-1].rstrip('s')
|
|
||||||
timestamp = float(timestamp_str)
|
|
||||||
except:
|
|
||||||
timestamp = 0.0
|
|
||||||
frames_info.append((str(frame_path), timestamp))
|
|
||||||
else:
|
|
||||||
extractor = FrameExtractor(str(video_path), args.frames_dir)
|
|
||||||
|
|
||||||
if args.scene_detection:
|
|
||||||
frames_info = extractor.extract_scene_changes()
|
|
||||||
else:
|
|
||||||
frames_info = extractor.extract_by_interval(args.interval)
|
|
||||||
|
|
||||||
if not frames_info:
|
|
||||||
logger.error("No frames extracted")
|
|
||||||
sys.exit(1)
|
|
||||||
|
|
||||||
logger.info(f"✓ Extracted {len(frames_info)} frames")
|
|
||||||
|
|
||||||
# Step 2: Run analysis on frames (with caching)
|
|
||||||
if not args.no_cache and analysis_cache.exists():
|
|
||||||
logger.info(f"✓ Found cached analysis results: {analysis_cache}")
|
|
||||||
with open(analysis_cache, 'r', encoding='utf-8') as f:
|
|
||||||
screen_segments = json.load(f)
|
|
||||||
logger.info(f"✓ Loaded {len(screen_segments)} analyzed frames from cache")
|
|
||||||
else:
|
|
||||||
if args.use_vision:
|
|
||||||
# Use vision model
|
|
||||||
logger.info("Step 2: Running vision analysis on extracted frames...")
|
|
||||||
try:
|
|
||||||
vision = VisionProcessor(model=args.vision_model)
|
|
||||||
screen_segments = vision.process_frames(
|
|
||||||
frames_info,
|
|
||||||
context=args.vision_context,
|
|
||||||
deduplicate=not args.no_deduplicate
|
|
||||||
)
|
|
||||||
logger.info(f"✓ Analyzed {len(screen_segments)} frames with vision model")
|
|
||||||
|
|
||||||
except ImportError as e:
|
|
||||||
logger.error(f"{e}")
|
|
||||||
sys.exit(1)
|
|
||||||
else:
|
|
||||||
# Use OCR
|
|
||||||
logger.info("Step 2: Running OCR on extracted frames...")
|
|
||||||
try:
|
|
||||||
ocr = OCRProcessor(engine=args.ocr_engine)
|
|
||||||
screen_segments = ocr.process_frames(
|
|
||||||
frames_info,
|
|
||||||
deduplicate=not args.no_deduplicate
|
|
||||||
)
|
|
||||||
logger.info(f"✓ Processed {len(screen_segments)} frames with OCR")
|
|
||||||
|
|
||||||
except ImportError as e:
|
|
||||||
logger.error(f"{e}")
|
|
||||||
logger.error(f"To install {args.ocr_engine}:")
|
|
||||||
logger.error(f" pip install {args.ocr_engine}")
|
|
||||||
sys.exit(1)
|
|
||||||
|
|
||||||
# Save analysis results as JSON
|
|
||||||
with open(analysis_cache, 'w', encoding='utf-8') as f:
|
|
||||||
json.dump(screen_segments, f, indent=2, ensure_ascii=False)
|
|
||||||
logger.info(f"✓ Saved analysis results to: {analysis_cache}")
|
|
||||||
|
|
||||||
if args.extract_only:
|
|
||||||
logger.info("Done! (extract-only mode)")
|
|
||||||
return
|
|
||||||
|
|
||||||
# Step 3: Merge with transcript (if provided)
|
|
||||||
merger = TranscriptMerger()
|
|
||||||
|
|
||||||
if args.transcript:
|
|
||||||
logger.info("Step 3: Merging with Whisper transcript...")
|
|
||||||
transcript_path = Path(args.transcript)
|
|
||||||
|
|
||||||
if not transcript_path.exists():
|
|
||||||
logger.warning(f"Transcript not found: {args.transcript}")
|
|
||||||
logger.info("Proceeding with screen content only...")
|
|
||||||
audio_segments = []
|
|
||||||
else:
|
|
||||||
audio_segments = merger.load_whisper_transcript(str(transcript_path))
|
|
||||||
logger.info(f"✓ Loaded {len(audio_segments)} audio segments")
|
|
||||||
else:
|
|
||||||
logger.info("No transcript provided, using screen content only...")
|
|
||||||
audio_segments = []
|
|
||||||
|
|
||||||
# Merge and format
|
|
||||||
merged = merger.merge_transcripts(audio_segments, screen_segments)
|
|
||||||
formatted = merger.format_for_claude(merged, format_style=args.format)
|
|
||||||
|
|
||||||
# Save output
|
|
||||||
merger.save_transcript(formatted, args.output)
|
|
||||||
|
|
||||||
logger.info("=" * 80)
|
|
||||||
logger.info("✓ PROCESSING COMPLETE!")
|
|
||||||
logger.info("=" * 80)
|
|
||||||
logger.info(f"Enhanced transcript: {args.output}")
|
|
||||||
logger.info(f"OCR data: {ocr_output}")
|
|
||||||
logger.info(f"Frames: {args.frames_dir}/")
|
|
||||||
logger.info("")
|
|
||||||
logger.info("You can now use the enhanced transcript with Claude for summarization!")
|
|
||||||
|
|
||||||
|
|
||||||
if __name__ == '__main__':
|
if __name__ == '__main__':
|
||||||
main()
|
sys.exit(main())
|
||||||
|
|||||||
@@ -1,6 +1,7 @@
|
|||||||
# Core dependencies
|
# Core dependencies
|
||||||
opencv-python>=4.8.0
|
opencv-python>=4.8.0
|
||||||
Pillow>=10.0.0
|
Pillow>=10.0.0
|
||||||
|
ffmpeg-python>=0.2.0
|
||||||
|
|
||||||
# Vision analysis (recommended for better results)
|
# Vision analysis (recommended for better results)
|
||||||
# Requires Ollama to be installed: https://ollama.ai/download
|
# Requires Ollama to be installed: https://ollama.ai/download
|
||||||
|
|||||||
Reference in New Issue
Block a user