updated readme

This commit is contained in:
Mariano Gabriel
2025-12-04 20:15:16 -03:00
parent 7d7ec15ff7
commit 331cccb15f

428
README.md
View File

@@ -1,34 +1,21 @@
# Meeting Processor # Meeting Processor
Extract screen content from meeting recordings and merge with Whisper transcripts for better AI summarization. Extract screen content from meeting recordings and merge with Whisper/WhisperX transcripts for better AI summarization.
## Overview ## Overview
This tool enhances meeting transcripts by combining: This tool enhances meeting transcripts by combining:
- **Audio transcription** (from Whisper) - **Audio transcription** (Whisper or WhisperX with speaker diarization)
- **Screen content analysis** (Vision models or OCR) - **Screen content extraction** via FFmpeg scene detection
- **Frame embedding** for direct LLM analysis
### Vision Analysis vs OCR The result is a rich, timestamped transcript with embedded screen frames that provides full context for AI summarization.
- **Vision Models** (recommended): Uses local LLaVA model via Ollama to understand context - great for dashboards, code, consoles
- **OCR**: Traditional text extraction - faster but less context-aware
The result is a rich, timestamped transcript that provides full context for AI summarization.
## Installation ## Installation
### 1. System Dependencies ### 1. System Dependencies
**Ollama** (required for vision analysis): **FFmpeg** (required for scene detection and frame extraction):
```bash
# Install from https://ollama.ai/download
# Then pull a vision model:
ollama pull llava:13b
# or for lighter model:
ollama pull llava:7b
```
**FFmpeg** (for scene detection):
```bash ```bash
# Ubuntu/Debian # Ubuntu/Debian
sudo apt-get install ffmpeg sudo apt-get install ffmpeg
@@ -37,149 +24,119 @@ sudo apt-get install ffmpeg
brew install ffmpeg brew install ffmpeg
``` ```
**Tesseract OCR** (optional, if not using vision):
```bash
# Ubuntu/Debian
sudo apt-get install tesseract-ocr
# macOS
brew install tesseract
# Arch Linux
sudo pacman -S tesseract
```
### 2. Python Dependencies ### 2. Python Dependencies
```bash ```bash
pip install -r requirements.txt pip install -r requirements.txt
``` ```
### 3. Whisper (for audio transcription) ### 3. Whisper or WhisperX (for audio transcription)
**Standard Whisper:**
```bash ```bash
pip install openai-whisper pip install openai-whisper
``` ```
### 4. Optional: Install Alternative OCR Engines **WhisperX** (recommended - includes speaker diarization):
If you prefer OCR over vision analysis:
```bash ```bash
# EasyOCR (better for rotated/handwritten text) pip install whisperx
pip install easyocr
# PaddleOCR (better for code/terminal screens)
pip install paddleocr
``` ```
For speaker diarization, you'll need a HuggingFace token with access to pyannote models.
## Quick Start ## Quick Start
### Recommended: Vision Analysis (Best for Code/Dashboards) ### Recommended: Embed Frames with Scene Detection
```bash ```bash
python process_meeting.py samples/meeting.mkv --run-whisper --use-vision python process_meeting.py samples/meeting.mkv --run-whisper --embed-images --scene-detection
``` ```
This will: This will:
1. Run Whisper transcription (audio → text) 1. Run Whisper transcription (audio → text)
2. Extract frames every 5 seconds 2. Extract frames at scene changes (smarter than fixed intervals)
3. Use LLaVA vision model to analyze frames with context 3. Embed frame references in the transcript for LLM analysis
4. Merge audio + screen content 4. Save everything to `output/` folder
5. Save everything to `output/` folder
### With Speaker Diarization (WhisperX)
```bash
python process_meeting.py samples/meeting.mkv --run-whisper --diarize --embed-images --scene-detection
```
This uses WhisperX to identify different speakers in the transcript.
### Re-run with Cached Results ### Re-run with Cached Results
Already ran it once? Re-run instantly using cached results: Already ran it once? Re-run instantly using cached results:
```bash ```bash
# Uses cached transcript, frames, and analysis # Uses cached transcript and frames
python process_meeting.py samples/meeting.mkv --use-vision python process_meeting.py samples/meeting.mkv --embed-images
# Force reprocessing # Skip only specific cached items
python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --no-cache python process_meeting.py samples/meeting.mkv --embed-images --skip-cache-frames
``` python process_meeting.py samples/meeting.mkv --embed-images --skip-cache-whisper
python process_meeting.py samples/meeting.mkv --embed-images --skip-cache-analysis
### Traditional OCR (Faster, Less Context-Aware) # Force complete reprocessing
python process_meeting.py samples/meeting.mkv --run-whisper --embed-images --no-cache
```bash
python process_meeting.py samples/meeting.mkv --run-whisper
``` ```
## Usage Examples ## Usage Examples
### Vision Analysis with Context Hints ### Scene Detection Options
```bash ```bash
# For code-heavy meetings # Default scene detection (threshold: 15)
python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --vision-context code python process_meeting.py samples/meeting.mkv --run-whisper --embed-images --scene-detection
# For dashboard/monitoring meetings (Grafana, GCP, etc.) # More sensitive (more frames captured, threshold: 5)
python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --vision-context dashboard python process_meeting.py samples/meeting.mkv --run-whisper --embed-images --scene-detection --scene-threshold 5
# For console/terminal sessions # Less sensitive (fewer frames, threshold: 30)
python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --vision-context console python process_meeting.py samples/meeting.mkv --run-whisper --embed-images --scene-detection --scene-threshold 30
``` ```
### Different Vision Models ### Fixed Interval Extraction (alternative to scene detection)
```bash
# Lighter/faster model (7B parameters)
python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --vision-model llava:7b
# Default model (13B parameters, better quality)
python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --vision-model llava:13b
# Alternative models
python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --vision-model bakllava
```
### Extract frames at different intervals
```bash ```bash
# Every 10 seconds # Every 10 seconds
python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --interval 10 python process_meeting.py samples/meeting.mkv --run-whisper --embed-images --interval 10
# Every 3 seconds (more detailed) # Every 3 seconds (more detailed)
python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --interval 3 python process_meeting.py samples/meeting.mkv --run-whisper --embed-images --interval 3
``` ```
### Use scene detection (smarter, fewer frames) ### Frame Quality Options
```bash ```bash
python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --scene-detection # Default quality (80)
``` python process_meeting.py samples/meeting.mkv --run-whisper --embed-images --scene-detection
### Traditional OCR (if you prefer) # Lower quality for smaller files (60)
```bash python process_meeting.py samples/meeting.mkv --run-whisper --embed-images --scene-detection --embed-quality 60
# Tesseract (default)
python process_meeting.py samples/meeting.mkv --run-whisper
# EasyOCR
python process_meeting.py samples/meeting.mkv --run-whisper --ocr-engine easyocr
# PaddleOCR
python process_meeting.py samples/meeting.mkv --run-whisper --ocr-engine paddleocr
``` ```
### Caching Examples ### Caching Examples
```bash ```bash
# First run - processes everything # First run - processes everything
python process_meeting.py samples/meeting.mkv --run-whisper --use-vision python process_meeting.py samples/meeting.mkv --run-whisper --embed-images --scene-detection
# Second run - uses cached transcript and frames, only re-merges # Iterate on scene threshold (reuse whisper transcript)
python process_meeting.py samples/meeting.mkv python process_meeting.py samples/meeting.mkv --embed-images --scene-detection --scene-threshold 5 --skip-cache-frames --skip-cache-analysis
# Switch from OCR to vision using existing frames # Re-run whisper only
python process_meeting.py samples/meeting.mkv --use-vision python process_meeting.py samples/meeting.mkv --embed-images --skip-cache-whisper
# Force complete reprocessing # Force complete reprocessing
python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --no-cache python process_meeting.py samples/meeting.mkv --run-whisper --embed-images --no-cache
``` ```
### Custom output location ### Custom output location
```bash ```bash
python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --output-dir my_outputs/ python process_meeting.py samples/meeting.mkv --run-whisper --embed-images --output-dir my_outputs/
``` ```
### Enable verbose logging ### Enable verbose logging
```bash ```bash
# Show detailed debug information python process_meeting.py samples/meeting.mkv --run-whisper --embed-images --scene-detection --verbose
python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --verbose
``` ```
## Output Files ## Output Files
@@ -191,87 +148,51 @@ output/
└── 20241019_143022-meeting/ └── 20241019_143022-meeting/
├── manifest.json # Processing configuration ├── manifest.json # Processing configuration
├── meeting_enhanced.txt # Enhanced transcript for AI ├── meeting_enhanced.txt # Enhanced transcript for AI
├── meeting.json # Whisper transcript ├── meeting.json # Whisper/WhisperX transcript
├── meeting_vision.json # Vision analysis results
└── frames/ # Extracted video frames └── frames/ # Extracted video frames
├── frame_00001_5.00s.jpg ├── frame_00001_5.00s.jpg
├── frame_00002_10.00s.jpg ├── frame_00002_10.00s.jpg
└── ... └── ...
``` ```
### Manifest File
Each processing run creates a `manifest.json` that tracks:
- Video information (name, path)
- Processing timestamp
- Configuration used (Whisper model, vision settings, etc.)
- Output file locations
Example manifest:
```json
{
"video": {
"name": "meeting.mkv",
"path": "/full/path/to/meeting.mkv"
},
"processed_at": "2024-10-19T14:30:22",
"configuration": {
"whisper": {"enabled": true, "model": "base"},
"analysis": {"method": "vision", "vision_model": "llava:13b", "vision_context": "code"}
}
}
```
### Caching Behavior ### Caching Behavior
The tool automatically reuses the most recent output directory for the same video: The tool automatically reuses the most recent output directory for the same video:
- **First run**: Creates new timestamped directory (e.g., `20241019_143022-meeting/`) - **First run**: Creates new timestamped directory (e.g., `20241019_143022-meeting/`)
- **Subsequent runs**: Reuses the same directory and cached results - **Subsequent runs**: Reuses the same directory and cached results
- **Cached items**: Whisper transcript, extracted frames, analysis results - **Cached items**: Whisper transcript, extracted frames, analysis results
- **Force new run**: Use `--no-cache` to create a fresh directory
This means you can instantly switch between OCR and vision analysis without re-extracting frames! **Fine-grained cache control:**
- `--no-cache`: Force complete reprocessing
- `--skip-cache-frames`: Re-extract frames only
- `--skip-cache-whisper`: Re-run transcription only
- `--skip-cache-analysis`: Re-run analysis only
This allows you to iterate on scene detection thresholds without re-running Whisper!
## Workflow for Meeting Analysis ## Workflow for Meeting Analysis
### Complete Workflow (One Command!) ### Complete Workflow (One Command!)
```bash ```bash
# Process everything in one step with vision analysis # Process everything in one step with scene detection
python process_meeting.py samples/alo-intro1.mkv --run-whisper --use-vision --scene-detection python process_meeting.py samples/meeting.mkv --run-whisper --embed-images --scene-detection
# Output will be in output/alo-intro1_enhanced.txt # With speaker diarization
python process_meeting.py samples/meeting.mkv --run-whisper --diarize --embed-images --scene-detection
``` ```
### Typical Iterative Workflow ### Typical Iterative Workflow
```bash ```bash
# First run - full processing # First run - full processing
python process_meeting.py samples/meeting.mkv --run-whisper --use-vision python process_meeting.py samples/meeting.mkv --run-whisper --embed-images --scene-detection
# Review results, then re-run with different context if needed # Adjust scene threshold (keeps cached whisper transcript)
python process_meeting.py samples/meeting.mkv --use-vision --vision-context code python process_meeting.py samples/meeting.mkv --embed-images --scene-detection --scene-threshold 10 --skip-cache-frames --skip-cache-analysis
# Or switch to a different vision model # Try different frame quality
python process_meeting.py samples/meeting.mkv --use-vision --vision-model llava:7b python process_meeting.py samples/meeting.mkv --embed-images --embed-quality 60 --skip-cache-frames --skip-cache-analysis
# All use cached frames and transcript!
```
### Traditional Workflow (Separate Steps)
```bash
# 1. Extract audio and transcribe with Whisper (optional, if not using --run-whisper)
whisper samples/alo-intro1.mkv --model base --output_format json --output_dir output
# 2. Process video to extract screen content with vision
python process_meeting.py samples/alo-intro1.mkv \
--transcript output/alo-intro1.json \
--use-vision \
--scene-detection
# 3. Use the enhanced transcript with AI
# Copy the content from output/alo-intro1_enhanced.txt and paste into Claude or your LLM
``` ```
### Example Prompt for Claude ### Example Prompt for Claude
@@ -291,73 +212,59 @@ Please summarize this meeting transcript. Pay special attention to:
``` ```
usage: process_meeting.py [-h] [--transcript TRANSCRIPT] [--run-whisper] usage: process_meeting.py [-h] [--transcript TRANSCRIPT] [--run-whisper]
[--whisper-model {tiny,base,small,medium,large}] [--whisper-model {tiny,base,small,medium,large}]
[--output OUTPUT] [--output-dir OUTPUT_DIR] [--diarize] [--output OUTPUT] [--output-dir OUTPUT_DIR]
[--frames-dir FRAMES_DIR] [--interval INTERVAL] [--interval INTERVAL] [--scene-detection]
[--scene-detection] [--scene-threshold SCENE_THRESHOLD]
[--ocr-engine {tesseract,easyocr,paddleocr}] [--embed-images] [--embed-quality EMBED_QUALITY]
[--no-deduplicate] [--extract-only] [--no-cache] [--skip-cache-frames] [--skip-cache-whisper]
[--format {detailed,compact}] [--verbose] [--skip-cache-analysis] [--no-deduplicate]
video [--extract-only] [--format {detailed,compact}]
[--verbose] video
Options: Main Options:
video Path to video file video Path to video file
--transcript, -t Path to Whisper transcript (JSON or TXT) --run-whisper Run Whisper transcription before processing
--run-whisper Run Whisper transcription before processing --whisper-model Whisper model: tiny, base, small, medium, large (default: medium)
--whisper-model Whisper model: tiny, base, small, medium, large (default: base) --diarize Use WhisperX with speaker diarization
--output, -o Output file for enhanced transcript --embed-images Embed frame references for LLM analysis (recommended)
--output-dir Directory for output files (default: output/) --embed-quality JPEG quality for frames (default: 80)
--frames-dir Directory to save extracted frames (default: frames/)
--interval Extract frame every N seconds (default: 5) Frame Extraction:
--scene-detection Use scene detection instead of interval extraction --scene-detection Use FFmpeg scene detection (recommended)
--ocr-engine OCR engine: tesseract, easyocr, paddleocr (default: tesseract) --scene-threshold Detection sensitivity 0-100 (default: 15, lower=more sensitive)
--no-deduplicate Disable text deduplication --interval Extract frame every N seconds (alternative to scene detection)
--extract-only Only extract frames and OCR, skip transcript merging
--format Output format: detailed or compact (default: detailed) Caching:
--verbose, -v Enable verbose logging (DEBUG level) --no-cache Force complete reprocessing
--skip-cache-frames Re-extract frames only
--skip-cache-whisper Re-run transcription only
--skip-cache-analysis Re-run analysis only
Other:
--transcript, -t Path to existing Whisper transcript (JSON or TXT)
--output, -o Output file for enhanced transcript
--output-dir Directory for output files (default: output/)
--verbose, -v Enable verbose logging
``` ```
## Tips for Best Results ## Tips for Best Results
### Vision vs OCR: When to Use Each
**Use Vision Models (`--use-vision`) when:**
- ✅ Analyzing dashboards (Grafana, GCP Console, monitoring tools)
- ✅ Code walkthroughs or debugging sessions
- ✅ Complex layouts with mixed content
- ✅ Need contextual understanding, not just text extraction
- ✅ Working with charts, graphs, or visualizations
- ⚠️ Trade-off: Slower (requires GPU/CPU for local model)
**Use OCR when:**
- ✅ Simple text extraction from slides or documents
- ✅ Need maximum speed
- ✅ Limited computational resources
- ✅ Presentations with mostly text
- ⚠️ Trade-off: Less context-aware, may miss visual relationships
### Context Hints for Vision Analysis
- **`--vision-context meeting`**: General purpose (default)
- **`--vision-context code`**: Optimized for code screenshots, preserves formatting
- **`--vision-context dashboard`**: Extracts metrics, trends, panel names
- **`--vision-context console`**: Captures commands, output, error messages
**Customizing Prompts:**
Prompts are stored as editable text files in `meetus/prompts/`:
- `meeting.txt` - General meeting analysis
- `code.txt` - Code screenshot analysis
- `dashboard.txt` - Dashboard/monitoring analysis
- `console.txt` - Terminal/console analysis
Just edit these files to customize how the vision model analyzes your frames!
### Scene Detection vs Interval ### Scene Detection vs Interval
- **Scene detection**: Better for presentations with distinct slides. More efficient. - **Scene detection** (`--scene-detection`): Recommended. Captures frames when content changes. More efficient.
- **Interval extraction**: Better for continuous screen sharing (coding, browsing). More thorough. - **Interval extraction** (`--interval N`): Alternative for continuous content. Captures every N seconds.
### Vision Model Selection ### Scene Detection Threshold
- **`llava:7b`**: Faster, lower memory (~4GB RAM), good quality - Lower values (5-10): More sensitive, captures more frames
- **`llava:13b`**: Better quality, slower, needs ~8GB RAM (default) - Default (15): Good balance for most meetings
- **`bakllava`**: Alternative with different strengths - Higher values (20-30): Less sensitive, fewer frames
### Whisper vs WhisperX
- **Whisper** (`--run-whisper`): Standard transcription, fast
- **WhisperX** (`--run-whisper --diarize`): Adds speaker identification, requires HuggingFace token
### Frame Quality
- Default quality (80) works well for most cases
- Use `--embed-quality 60` for smaller files if storage is a concern
### Deduplication ### Deduplication
- Enabled by default - removes similar consecutive frames - Enabled by default - removes similar consecutive frames
@@ -365,61 +272,56 @@ Just edit these files to customize how the vision model analyzes your frames!
## Troubleshooting ## Troubleshooting
### Vision Model Issues ### Frame Extraction Issues
**"ollama package not installed"**
```bash
pip install ollama
```
**"Ollama not found" or connection errors**
```bash
# Install Ollama first: https://ollama.ai/download
# Then pull a vision model:
ollama pull llava:13b
```
**Vision analysis is slow**
- Use lighter model: `--vision-model llava:7b`
- Reduce frame count: `--scene-detection` or `--interval 10`
- Check if Ollama is using GPU (much faster)
**Poor vision analysis results**
- Try different context hint: `--vision-context code` or `--vision-context dashboard`
- Use larger model: `--vision-model llava:13b`
- Ensure frames are clear (check video resolution)
### OCR Issues
**"pytesseract not installed"**
```bash
pip install pytesseract
sudo apt-get install tesseract-ocr # Don't forget system package!
```
**Poor OCR quality**
- **Solution**: Switch to vision analysis with `--use-vision`
- Or try different OCR engine: `--ocr-engine easyocr`
- Check if video resolution is sufficient
- Use `--no-deduplicate` to keep more frames
### General Issues
**"No frames extracted"** **"No frames extracted"**
- Check video file is valid: `ffmpeg -i video.mkv` - Check video file is valid: `ffmpeg -i video.mkv`
- Try lower interval: `--interval 3` - Try lower scene threshold: `--scene-threshold 5`
- Check disk space in frames directory - Try interval extraction: `--interval 3`
- Check disk space in output directory
**Scene detection not working** **Scene detection not working**
- Fallback to interval extraction automatically
- Ensure FFmpeg is installed - Ensure FFmpeg is installed
- Falls back to interval extraction automatically
- Try manual interval: `--interval 5` - Try manual interval: `--interval 5`
### Whisper/WhisperX Issues
**WhisperX diarization not working**
- Ensure you have a HuggingFace token set
- Token needs access to pyannote models
- Fall back to standard Whisper without `--diarize`
### Cache Issues
**Cache not being used** **Cache not being used**
- Ensure you're using the same video filename - Ensure you're using the same video filename
- Check that output directory contains cached files - Check that output directory contains cached files
- Use `--verbose` to see what's being cached/loaded - Use `--verbose` to see what's being cached/loaded
**Want to re-run specific steps**
- `--skip-cache-frames`: Re-extract frames
- `--skip-cache-whisper`: Re-run transcription
- `--skip-cache-analysis`: Re-run analysis
- `--no-cache`: Force complete reprocessing
## Experimental Features
### OCR and Vision Analysis
OCR (`--ocr-engine`) and Vision analysis (`--use-vision`) options are available but experimental. The recommended approach is to use `--embed-images` which embeds frame references directly in the transcript, letting your LLM analyze the images.
```bash
# Experimental: OCR extraction
python process_meeting.py samples/meeting.mkv --run-whisper --ocr-engine tesseract
# Experimental: Vision model analysis
python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --vision-model llava:13b
# Experimental: Hybrid OpenCV + OCR
python process_meeting.py samples/meeting.mkv --run-whisper --use-hybrid
```
## Project Structure ## Project Structure
``` ```
@@ -429,26 +331,18 @@ meetus/
│ ├── workflow.py # Processing orchestrator │ ├── workflow.py # Processing orchestrator
│ ├── output_manager.py # Output directory & manifest management │ ├── output_manager.py # Output directory & manifest management
│ ├── cache_manager.py # Caching logic │ ├── cache_manager.py # Caching logic
│ ├── frame_extractor.py # Video frame extraction │ ├── frame_extractor.py # Video frame extraction (FFmpeg scene detection)
│ ├── vision_processor.py # Vision model analysis (Ollama/LLaVA) │ ├── vision_processor.py # Vision model analysis (experimental)
│ ├── ocr_processor.py # OCR processing │ ├── ocr_processor.py # OCR processing (experimental)
── transcript_merger.py # Transcript merging ── transcript_merger.py # Transcript merging
│ └── prompts/ # Vision analysis prompts (editable!) ── process_meeting.py # Main CLI script
│ ├── meeting.txt # General meeting analysis
│ ├── code.txt # Code screenshot analysis
│ ├── dashboard.txt # Dashboard/monitoring analysis
│ └── console.txt # Terminal/console analysis
├── process_meeting.py # Main CLI script (thin wrapper)
├── requirements.txt # Python dependencies ├── requirements.txt # Python dependencies
├── output/ # Timestamped output directories ├── output/ # Timestamped output directories
│ ├── .gitkeep
│ └── YYYYMMDD_HHMMSS-video/ # Auto-generated per video │ └── YYYYMMDD_HHMMSS-video/ # Auto-generated per video
├── samples/ # Sample videos (gitignored) ├── samples/ # Sample videos (gitignored)
└── README.md # This file └── README.md # This file
``` ```
The code is modular and easy to extend - each module has a single responsibility.
## License ## License
For personal use. For personal use.