Files
mitus/README.md
Mariano Gabriel cd7b0aed07 refactor
2025-10-20 00:03:41 -03:00

455 lines
14 KiB
Markdown

# Meeting Processor
Extract screen content from meeting recordings and merge with Whisper transcripts for better AI summarization.
## Overview
This tool enhances meeting transcripts by combining:
- **Audio transcription** (from Whisper)
- **Screen content analysis** (Vision models or OCR)
### Vision Analysis vs OCR
- **Vision Models** (recommended): Uses local LLaVA model via Ollama to understand context - great for dashboards, code, consoles
- **OCR**: Traditional text extraction - faster but less context-aware
The result is a rich, timestamped transcript that provides full context for AI summarization.
## Installation
### 1. System Dependencies
**Ollama** (required for vision analysis):
```bash
# Install from https://ollama.ai/download
# Then pull a vision model:
ollama pull llava:13b
# or for lighter model:
ollama pull llava:7b
```
**FFmpeg** (for scene detection):
```bash
# Ubuntu/Debian
sudo apt-get install ffmpeg
# macOS
brew install ffmpeg
```
**Tesseract OCR** (optional, if not using vision):
```bash
# Ubuntu/Debian
sudo apt-get install tesseract-ocr
# macOS
brew install tesseract
# Arch Linux
sudo pacman -S tesseract
```
### 2. Python Dependencies
```bash
pip install -r requirements.txt
```
### 3. Whisper (for audio transcription)
```bash
pip install openai-whisper
```
### 4. Optional: Install Alternative OCR Engines
If you prefer OCR over vision analysis:
```bash
# EasyOCR (better for rotated/handwritten text)
pip install easyocr
# PaddleOCR (better for code/terminal screens)
pip install paddleocr
```
## Quick Start
### Recommended: Vision Analysis (Best for Code/Dashboards)
```bash
python process_meeting.py samples/meeting.mkv --run-whisper --use-vision
```
This will:
1. Run Whisper transcription (audio → text)
2. Extract frames every 5 seconds
3. Use LLaVA vision model to analyze frames with context
4. Merge audio + screen content
5. Save everything to `output/` folder
### Re-run with Cached Results
Already ran it once? Re-run instantly using cached results:
```bash
# Uses cached transcript, frames, and analysis
python process_meeting.py samples/meeting.mkv --use-vision
# Force reprocessing
python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --no-cache
```
### Traditional OCR (Faster, Less Context-Aware)
```bash
python process_meeting.py samples/meeting.mkv --run-whisper
```
## Usage Examples
### Vision Analysis with Context Hints
```bash
# For code-heavy meetings
python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --vision-context code
# For dashboard/monitoring meetings (Grafana, GCP, etc.)
python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --vision-context dashboard
# For console/terminal sessions
python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --vision-context console
```
### Different Vision Models
```bash
# Lighter/faster model (7B parameters)
python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --vision-model llava:7b
# Default model (13B parameters, better quality)
python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --vision-model llava:13b
# Alternative models
python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --vision-model bakllava
```
### Extract frames at different intervals
```bash
# Every 10 seconds
python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --interval 10
# Every 3 seconds (more detailed)
python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --interval 3
```
### Use scene detection (smarter, fewer frames)
```bash
python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --scene-detection
```
### Traditional OCR (if you prefer)
```bash
# Tesseract (default)
python process_meeting.py samples/meeting.mkv --run-whisper
# EasyOCR
python process_meeting.py samples/meeting.mkv --run-whisper --ocr-engine easyocr
# PaddleOCR
python process_meeting.py samples/meeting.mkv --run-whisper --ocr-engine paddleocr
```
### Caching Examples
```bash
# First run - processes everything
python process_meeting.py samples/meeting.mkv --run-whisper --use-vision
# Second run - uses cached transcript and frames, only re-merges
python process_meeting.py samples/meeting.mkv
# Switch from OCR to vision using existing frames
python process_meeting.py samples/meeting.mkv --use-vision
# Force complete reprocessing
python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --no-cache
```
### Custom output location
```bash
python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --output-dir my_outputs/
```
### Enable verbose logging
```bash
# Show detailed debug information
python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --verbose
```
## Output Files
Each video gets its own timestamped output directory:
```
output/
└── 20241019_143022-meeting/
├── manifest.json # Processing configuration
├── meeting_enhanced.txt # Enhanced transcript for AI
├── meeting.json # Whisper transcript
├── meeting_vision.json # Vision analysis results
└── frames/ # Extracted video frames
├── frame_00001_5.00s.jpg
├── frame_00002_10.00s.jpg
└── ...
```
### Manifest File
Each processing run creates a `manifest.json` that tracks:
- Video information (name, path)
- Processing timestamp
- Configuration used (Whisper model, vision settings, etc.)
- Output file locations
Example manifest:
```json
{
"video": {
"name": "meeting.mkv",
"path": "/full/path/to/meeting.mkv"
},
"processed_at": "2024-10-19T14:30:22",
"configuration": {
"whisper": {"enabled": true, "model": "base"},
"analysis": {"method": "vision", "vision_model": "llava:13b", "vision_context": "code"}
}
}
```
### Caching Behavior
The tool automatically reuses the most recent output directory for the same video:
- **First run**: Creates new timestamped directory (e.g., `20241019_143022-meeting/`)
- **Subsequent runs**: Reuses the same directory and cached results
- **Cached items**: Whisper transcript, extracted frames, analysis results
- **Force new run**: Use `--no-cache` to create a fresh directory
This means you can instantly switch between OCR and vision analysis without re-extracting frames!
## Workflow for Meeting Analysis
### Complete Workflow (One Command!)
```bash
# Process everything in one step with vision analysis
python process_meeting.py samples/alo-intro1.mkv --run-whisper --use-vision --scene-detection
# Output will be in output/alo-intro1_enhanced.txt
```
### Typical Iterative Workflow
```bash
# First run - full processing
python process_meeting.py samples/meeting.mkv --run-whisper --use-vision
# Review results, then re-run with different context if needed
python process_meeting.py samples/meeting.mkv --use-vision --vision-context code
# Or switch to a different vision model
python process_meeting.py samples/meeting.mkv --use-vision --vision-model llava:7b
# All use cached frames and transcript!
```
### Traditional Workflow (Separate Steps)
```bash
# 1. Extract audio and transcribe with Whisper (optional, if not using --run-whisper)
whisper samples/alo-intro1.mkv --model base --output_format json --output_dir output
# 2. Process video to extract screen content with vision
python process_meeting.py samples/alo-intro1.mkv \
--transcript output/alo-intro1.json \
--use-vision \
--scene-detection
# 3. Use the enhanced transcript with AI
# Copy the content from output/alo-intro1_enhanced.txt and paste into Claude or your LLM
```
### Example Prompt for Claude
```
Please summarize this meeting transcript. Pay special attention to:
1. Key decisions made
2. Action items
3. Technical details shown on screen
4. Any metrics or data presented
[Paste enhanced transcript here]
```
## Command Reference
```
usage: process_meeting.py [-h] [--transcript TRANSCRIPT] [--run-whisper]
[--whisper-model {tiny,base,small,medium,large}]
[--output OUTPUT] [--output-dir OUTPUT_DIR]
[--frames-dir FRAMES_DIR] [--interval INTERVAL]
[--scene-detection]
[--ocr-engine {tesseract,easyocr,paddleocr}]
[--no-deduplicate] [--extract-only]
[--format {detailed,compact}] [--verbose]
video
Options:
video Path to video file
--transcript, -t Path to Whisper transcript (JSON or TXT)
--run-whisper Run Whisper transcription before processing
--whisper-model Whisper model: tiny, base, small, medium, large (default: base)
--output, -o Output file for enhanced transcript
--output-dir Directory for output files (default: output/)
--frames-dir Directory to save extracted frames (default: frames/)
--interval Extract frame every N seconds (default: 5)
--scene-detection Use scene detection instead of interval extraction
--ocr-engine OCR engine: tesseract, easyocr, paddleocr (default: tesseract)
--no-deduplicate Disable text deduplication
--extract-only Only extract frames and OCR, skip transcript merging
--format Output format: detailed or compact (default: detailed)
--verbose, -v Enable verbose logging (DEBUG level)
```
## Tips for Best Results
### Vision vs OCR: When to Use Each
**Use Vision Models (`--use-vision`) when:**
- ✅ Analyzing dashboards (Grafana, GCP Console, monitoring tools)
- ✅ Code walkthroughs or debugging sessions
- ✅ Complex layouts with mixed content
- ✅ Need contextual understanding, not just text extraction
- ✅ Working with charts, graphs, or visualizations
- ⚠️ Trade-off: Slower (requires GPU/CPU for local model)
**Use OCR when:**
- ✅ Simple text extraction from slides or documents
- ✅ Need maximum speed
- ✅ Limited computational resources
- ✅ Presentations with mostly text
- ⚠️ Trade-off: Less context-aware, may miss visual relationships
### Context Hints for Vision Analysis
- **`--vision-context meeting`**: General purpose (default)
- **`--vision-context code`**: Optimized for code screenshots, preserves formatting
- **`--vision-context dashboard`**: Extracts metrics, trends, panel names
- **`--vision-context console`**: Captures commands, output, error messages
**Customizing Prompts:**
Prompts are stored as editable text files in `meetus/prompts/`:
- `meeting.txt` - General meeting analysis
- `code.txt` - Code screenshot analysis
- `dashboard.txt` - Dashboard/monitoring analysis
- `console.txt` - Terminal/console analysis
Just edit these files to customize how the vision model analyzes your frames!
### Scene Detection vs Interval
- **Scene detection**: Better for presentations with distinct slides. More efficient.
- **Interval extraction**: Better for continuous screen sharing (coding, browsing). More thorough.
### Vision Model Selection
- **`llava:7b`**: Faster, lower memory (~4GB RAM), good quality
- **`llava:13b`**: Better quality, slower, needs ~8GB RAM (default)
- **`bakllava`**: Alternative with different strengths
### Deduplication
- Enabled by default - removes similar consecutive frames
- Disable with `--no-deduplicate` if slides/screens change subtly
## Troubleshooting
### Vision Model Issues
**"ollama package not installed"**
```bash
pip install ollama
```
**"Ollama not found" or connection errors**
```bash
# Install Ollama first: https://ollama.ai/download
# Then pull a vision model:
ollama pull llava:13b
```
**Vision analysis is slow**
- Use lighter model: `--vision-model llava:7b`
- Reduce frame count: `--scene-detection` or `--interval 10`
- Check if Ollama is using GPU (much faster)
**Poor vision analysis results**
- Try different context hint: `--vision-context code` or `--vision-context dashboard`
- Use larger model: `--vision-model llava:13b`
- Ensure frames are clear (check video resolution)
### OCR Issues
**"pytesseract not installed"**
```bash
pip install pytesseract
sudo apt-get install tesseract-ocr # Don't forget system package!
```
**Poor OCR quality**
- **Solution**: Switch to vision analysis with `--use-vision`
- Or try different OCR engine: `--ocr-engine easyocr`
- Check if video resolution is sufficient
- Use `--no-deduplicate` to keep more frames
### General Issues
**"No frames extracted"**
- Check video file is valid: `ffmpeg -i video.mkv`
- Try lower interval: `--interval 3`
- Check disk space in frames directory
**Scene detection not working**
- Fallback to interval extraction automatically
- Ensure FFmpeg is installed
- Try manual interval: `--interval 5`
**Cache not being used**
- Ensure you're using the same video filename
- Check that output directory contains cached files
- Use `--verbose` to see what's being cached/loaded
## Project Structure
```
meetus/
├── meetus/ # Main package
│ ├── __init__.py
│ ├── workflow.py # Processing orchestrator
│ ├── output_manager.py # Output directory & manifest management
│ ├── cache_manager.py # Caching logic
│ ├── frame_extractor.py # Video frame extraction
│ ├── vision_processor.py # Vision model analysis (Ollama/LLaVA)
│ ├── ocr_processor.py # OCR processing
│ ├── transcript_merger.py # Transcript merging
│ └── prompts/ # Vision analysis prompts (editable!)
│ ├── meeting.txt # General meeting analysis
│ ├── code.txt # Code screenshot analysis
│ ├── dashboard.txt # Dashboard/monitoring analysis
│ └── console.txt # Terminal/console analysis
├── process_meeting.py # Main CLI script (thin wrapper)
├── requirements.txt # Python dependencies
├── output/ # Timestamped output directories
│ ├── .gitkeep
│ └── YYYYMMDD_HHMMSS-video/ # Auto-generated per video
├── samples/ # Sample videos (gitignored)
└── README.md # This file
```
The code is modular and easy to extend - each module has a single responsibility.
## License
For personal use.