updated readme
This commit is contained in:
420
README.md
420
README.md
@@ -1,34 +1,21 @@
|
||||
# Meeting Processor
|
||||
|
||||
Extract screen content from meeting recordings and merge with Whisper transcripts for better AI summarization.
|
||||
Extract screen content from meeting recordings and merge with Whisper/WhisperX transcripts for better AI summarization.
|
||||
|
||||
## Overview
|
||||
|
||||
This tool enhances meeting transcripts by combining:
|
||||
- **Audio transcription** (from Whisper)
|
||||
- **Screen content analysis** (Vision models or OCR)
|
||||
- **Audio transcription** (Whisper or WhisperX with speaker diarization)
|
||||
- **Screen content extraction** via FFmpeg scene detection
|
||||
- **Frame embedding** for direct LLM analysis
|
||||
|
||||
### Vision Analysis vs OCR
|
||||
|
||||
- **Vision Models** (recommended): Uses local LLaVA model via Ollama to understand context - great for dashboards, code, consoles
|
||||
- **OCR**: Traditional text extraction - faster but less context-aware
|
||||
|
||||
The result is a rich, timestamped transcript that provides full context for AI summarization.
|
||||
The result is a rich, timestamped transcript with embedded screen frames that provides full context for AI summarization.
|
||||
|
||||
## Installation
|
||||
|
||||
### 1. System Dependencies
|
||||
|
||||
**Ollama** (required for vision analysis):
|
||||
```bash
|
||||
# Install from https://ollama.ai/download
|
||||
# Then pull a vision model:
|
||||
ollama pull llava:13b
|
||||
# or for lighter model:
|
||||
ollama pull llava:7b
|
||||
```
|
||||
|
||||
**FFmpeg** (for scene detection):
|
||||
**FFmpeg** (required for scene detection and frame extraction):
|
||||
```bash
|
||||
# Ubuntu/Debian
|
||||
sudo apt-get install ffmpeg
|
||||
@@ -37,149 +24,119 @@ sudo apt-get install ffmpeg
|
||||
brew install ffmpeg
|
||||
```
|
||||
|
||||
**Tesseract OCR** (optional, if not using vision):
|
||||
```bash
|
||||
# Ubuntu/Debian
|
||||
sudo apt-get install tesseract-ocr
|
||||
|
||||
# macOS
|
||||
brew install tesseract
|
||||
|
||||
# Arch Linux
|
||||
sudo pacman -S tesseract
|
||||
```
|
||||
|
||||
### 2. Python Dependencies
|
||||
|
||||
```bash
|
||||
pip install -r requirements.txt
|
||||
```
|
||||
|
||||
### 3. Whisper (for audio transcription)
|
||||
### 3. Whisper or WhisperX (for audio transcription)
|
||||
|
||||
**Standard Whisper:**
|
||||
```bash
|
||||
pip install openai-whisper
|
||||
```
|
||||
|
||||
### 4. Optional: Install Alternative OCR Engines
|
||||
|
||||
If you prefer OCR over vision analysis:
|
||||
**WhisperX** (recommended - includes speaker diarization):
|
||||
```bash
|
||||
# EasyOCR (better for rotated/handwritten text)
|
||||
pip install easyocr
|
||||
|
||||
# PaddleOCR (better for code/terminal screens)
|
||||
pip install paddleocr
|
||||
pip install whisperx
|
||||
```
|
||||
|
||||
For speaker diarization, you'll need a HuggingFace token with access to pyannote models.
|
||||
|
||||
## Quick Start
|
||||
|
||||
### Recommended: Vision Analysis (Best for Code/Dashboards)
|
||||
### Recommended: Embed Frames with Scene Detection
|
||||
|
||||
```bash
|
||||
python process_meeting.py samples/meeting.mkv --run-whisper --use-vision
|
||||
python process_meeting.py samples/meeting.mkv --run-whisper --embed-images --scene-detection
|
||||
```
|
||||
|
||||
This will:
|
||||
1. Run Whisper transcription (audio → text)
|
||||
2. Extract frames every 5 seconds
|
||||
3. Use LLaVA vision model to analyze frames with context
|
||||
4. Merge audio + screen content
|
||||
5. Save everything to `output/` folder
|
||||
2. Extract frames at scene changes (smarter than fixed intervals)
|
||||
3. Embed frame references in the transcript for LLM analysis
|
||||
4. Save everything to `output/` folder
|
||||
|
||||
### With Speaker Diarization (WhisperX)
|
||||
|
||||
```bash
|
||||
python process_meeting.py samples/meeting.mkv --run-whisper --diarize --embed-images --scene-detection
|
||||
```
|
||||
|
||||
This uses WhisperX to identify different speakers in the transcript.
|
||||
|
||||
### Re-run with Cached Results
|
||||
|
||||
Already ran it once? Re-run instantly using cached results:
|
||||
```bash
|
||||
# Uses cached transcript, frames, and analysis
|
||||
python process_meeting.py samples/meeting.mkv --use-vision
|
||||
# Uses cached transcript and frames
|
||||
python process_meeting.py samples/meeting.mkv --embed-images
|
||||
|
||||
# Force reprocessing
|
||||
python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --no-cache
|
||||
```
|
||||
# Skip only specific cached items
|
||||
python process_meeting.py samples/meeting.mkv --embed-images --skip-cache-frames
|
||||
python process_meeting.py samples/meeting.mkv --embed-images --skip-cache-whisper
|
||||
python process_meeting.py samples/meeting.mkv --embed-images --skip-cache-analysis
|
||||
|
||||
### Traditional OCR (Faster, Less Context-Aware)
|
||||
|
||||
```bash
|
||||
python process_meeting.py samples/meeting.mkv --run-whisper
|
||||
# Force complete reprocessing
|
||||
python process_meeting.py samples/meeting.mkv --run-whisper --embed-images --no-cache
|
||||
```
|
||||
|
||||
## Usage Examples
|
||||
|
||||
### Vision Analysis with Context Hints
|
||||
### Scene Detection Options
|
||||
```bash
|
||||
# For code-heavy meetings
|
||||
python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --vision-context code
|
||||
# Default scene detection (threshold: 15)
|
||||
python process_meeting.py samples/meeting.mkv --run-whisper --embed-images --scene-detection
|
||||
|
||||
# For dashboard/monitoring meetings (Grafana, GCP, etc.)
|
||||
python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --vision-context dashboard
|
||||
# More sensitive (more frames captured, threshold: 5)
|
||||
python process_meeting.py samples/meeting.mkv --run-whisper --embed-images --scene-detection --scene-threshold 5
|
||||
|
||||
# For console/terminal sessions
|
||||
python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --vision-context console
|
||||
# Less sensitive (fewer frames, threshold: 30)
|
||||
python process_meeting.py samples/meeting.mkv --run-whisper --embed-images --scene-detection --scene-threshold 30
|
||||
```
|
||||
|
||||
### Different Vision Models
|
||||
```bash
|
||||
# Lighter/faster model (7B parameters)
|
||||
python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --vision-model llava:7b
|
||||
|
||||
# Default model (13B parameters, better quality)
|
||||
python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --vision-model llava:13b
|
||||
|
||||
# Alternative models
|
||||
python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --vision-model bakllava
|
||||
```
|
||||
|
||||
### Extract frames at different intervals
|
||||
### Fixed Interval Extraction (alternative to scene detection)
|
||||
```bash
|
||||
# Every 10 seconds
|
||||
python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --interval 10
|
||||
python process_meeting.py samples/meeting.mkv --run-whisper --embed-images --interval 10
|
||||
|
||||
# Every 3 seconds (more detailed)
|
||||
python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --interval 3
|
||||
python process_meeting.py samples/meeting.mkv --run-whisper --embed-images --interval 3
|
||||
```
|
||||
|
||||
### Use scene detection (smarter, fewer frames)
|
||||
### Frame Quality Options
|
||||
```bash
|
||||
python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --scene-detection
|
||||
```
|
||||
# Default quality (80)
|
||||
python process_meeting.py samples/meeting.mkv --run-whisper --embed-images --scene-detection
|
||||
|
||||
### Traditional OCR (if you prefer)
|
||||
```bash
|
||||
# Tesseract (default)
|
||||
python process_meeting.py samples/meeting.mkv --run-whisper
|
||||
|
||||
# EasyOCR
|
||||
python process_meeting.py samples/meeting.mkv --run-whisper --ocr-engine easyocr
|
||||
|
||||
# PaddleOCR
|
||||
python process_meeting.py samples/meeting.mkv --run-whisper --ocr-engine paddleocr
|
||||
# Lower quality for smaller files (60)
|
||||
python process_meeting.py samples/meeting.mkv --run-whisper --embed-images --scene-detection --embed-quality 60
|
||||
```
|
||||
|
||||
### Caching Examples
|
||||
```bash
|
||||
# First run - processes everything
|
||||
python process_meeting.py samples/meeting.mkv --run-whisper --use-vision
|
||||
python process_meeting.py samples/meeting.mkv --run-whisper --embed-images --scene-detection
|
||||
|
||||
# Second run - uses cached transcript and frames, only re-merges
|
||||
python process_meeting.py samples/meeting.mkv
|
||||
# Iterate on scene threshold (reuse whisper transcript)
|
||||
python process_meeting.py samples/meeting.mkv --embed-images --scene-detection --scene-threshold 5 --skip-cache-frames --skip-cache-analysis
|
||||
|
||||
# Switch from OCR to vision using existing frames
|
||||
python process_meeting.py samples/meeting.mkv --use-vision
|
||||
# Re-run whisper only
|
||||
python process_meeting.py samples/meeting.mkv --embed-images --skip-cache-whisper
|
||||
|
||||
# Force complete reprocessing
|
||||
python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --no-cache
|
||||
python process_meeting.py samples/meeting.mkv --run-whisper --embed-images --no-cache
|
||||
```
|
||||
|
||||
### Custom output location
|
||||
```bash
|
||||
python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --output-dir my_outputs/
|
||||
python process_meeting.py samples/meeting.mkv --run-whisper --embed-images --output-dir my_outputs/
|
||||
```
|
||||
|
||||
### Enable verbose logging
|
||||
```bash
|
||||
# Show detailed debug information
|
||||
python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --verbose
|
||||
python process_meeting.py samples/meeting.mkv --run-whisper --embed-images --scene-detection --verbose
|
||||
```
|
||||
|
||||
## Output Files
|
||||
@@ -191,87 +148,51 @@ output/
|
||||
└── 20241019_143022-meeting/
|
||||
├── manifest.json # Processing configuration
|
||||
├── meeting_enhanced.txt # Enhanced transcript for AI
|
||||
├── meeting.json # Whisper transcript
|
||||
├── meeting_vision.json # Vision analysis results
|
||||
├── meeting.json # Whisper/WhisperX transcript
|
||||
└── frames/ # Extracted video frames
|
||||
├── frame_00001_5.00s.jpg
|
||||
├── frame_00002_10.00s.jpg
|
||||
└── ...
|
||||
```
|
||||
|
||||
### Manifest File
|
||||
|
||||
Each processing run creates a `manifest.json` that tracks:
|
||||
- Video information (name, path)
|
||||
- Processing timestamp
|
||||
- Configuration used (Whisper model, vision settings, etc.)
|
||||
- Output file locations
|
||||
|
||||
Example manifest:
|
||||
```json
|
||||
{
|
||||
"video": {
|
||||
"name": "meeting.mkv",
|
||||
"path": "/full/path/to/meeting.mkv"
|
||||
},
|
||||
"processed_at": "2024-10-19T14:30:22",
|
||||
"configuration": {
|
||||
"whisper": {"enabled": true, "model": "base"},
|
||||
"analysis": {"method": "vision", "vision_model": "llava:13b", "vision_context": "code"}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### Caching Behavior
|
||||
|
||||
The tool automatically reuses the most recent output directory for the same video:
|
||||
- **First run**: Creates new timestamped directory (e.g., `20241019_143022-meeting/`)
|
||||
- **Subsequent runs**: Reuses the same directory and cached results
|
||||
- **Cached items**: Whisper transcript, extracted frames, analysis results
|
||||
- **Force new run**: Use `--no-cache` to create a fresh directory
|
||||
|
||||
This means you can instantly switch between OCR and vision analysis without re-extracting frames!
|
||||
**Fine-grained cache control:**
|
||||
- `--no-cache`: Force complete reprocessing
|
||||
- `--skip-cache-frames`: Re-extract frames only
|
||||
- `--skip-cache-whisper`: Re-run transcription only
|
||||
- `--skip-cache-analysis`: Re-run analysis only
|
||||
|
||||
This allows you to iterate on scene detection thresholds without re-running Whisper!
|
||||
|
||||
## Workflow for Meeting Analysis
|
||||
|
||||
### Complete Workflow (One Command!)
|
||||
|
||||
```bash
|
||||
# Process everything in one step with vision analysis
|
||||
python process_meeting.py samples/alo-intro1.mkv --run-whisper --use-vision --scene-detection
|
||||
# Process everything in one step with scene detection
|
||||
python process_meeting.py samples/meeting.mkv --run-whisper --embed-images --scene-detection
|
||||
|
||||
# Output will be in output/alo-intro1_enhanced.txt
|
||||
# With speaker diarization
|
||||
python process_meeting.py samples/meeting.mkv --run-whisper --diarize --embed-images --scene-detection
|
||||
```
|
||||
|
||||
### Typical Iterative Workflow
|
||||
|
||||
```bash
|
||||
# First run - full processing
|
||||
python process_meeting.py samples/meeting.mkv --run-whisper --use-vision
|
||||
python process_meeting.py samples/meeting.mkv --run-whisper --embed-images --scene-detection
|
||||
|
||||
# Review results, then re-run with different context if needed
|
||||
python process_meeting.py samples/meeting.mkv --use-vision --vision-context code
|
||||
# Adjust scene threshold (keeps cached whisper transcript)
|
||||
python process_meeting.py samples/meeting.mkv --embed-images --scene-detection --scene-threshold 10 --skip-cache-frames --skip-cache-analysis
|
||||
|
||||
# Or switch to a different vision model
|
||||
python process_meeting.py samples/meeting.mkv --use-vision --vision-model llava:7b
|
||||
|
||||
# All use cached frames and transcript!
|
||||
```
|
||||
|
||||
### Traditional Workflow (Separate Steps)
|
||||
|
||||
```bash
|
||||
# 1. Extract audio and transcribe with Whisper (optional, if not using --run-whisper)
|
||||
whisper samples/alo-intro1.mkv --model base --output_format json --output_dir output
|
||||
|
||||
# 2. Process video to extract screen content with vision
|
||||
python process_meeting.py samples/alo-intro1.mkv \
|
||||
--transcript output/alo-intro1.json \
|
||||
--use-vision \
|
||||
--scene-detection
|
||||
|
||||
# 3. Use the enhanced transcript with AI
|
||||
# Copy the content from output/alo-intro1_enhanced.txt and paste into Claude or your LLM
|
||||
# Try different frame quality
|
||||
python process_meeting.py samples/meeting.mkv --embed-images --embed-quality 60 --skip-cache-frames --skip-cache-analysis
|
||||
```
|
||||
|
||||
### Example Prompt for Claude
|
||||
@@ -291,73 +212,59 @@ Please summarize this meeting transcript. Pay special attention to:
|
||||
```
|
||||
usage: process_meeting.py [-h] [--transcript TRANSCRIPT] [--run-whisper]
|
||||
[--whisper-model {tiny,base,small,medium,large}]
|
||||
[--output OUTPUT] [--output-dir OUTPUT_DIR]
|
||||
[--frames-dir FRAMES_DIR] [--interval INTERVAL]
|
||||
[--scene-detection]
|
||||
[--ocr-engine {tesseract,easyocr,paddleocr}]
|
||||
[--no-deduplicate] [--extract-only]
|
||||
[--format {detailed,compact}] [--verbose]
|
||||
video
|
||||
[--diarize] [--output OUTPUT] [--output-dir OUTPUT_DIR]
|
||||
[--interval INTERVAL] [--scene-detection]
|
||||
[--scene-threshold SCENE_THRESHOLD]
|
||||
[--embed-images] [--embed-quality EMBED_QUALITY]
|
||||
[--no-cache] [--skip-cache-frames] [--skip-cache-whisper]
|
||||
[--skip-cache-analysis] [--no-deduplicate]
|
||||
[--extract-only] [--format {detailed,compact}]
|
||||
[--verbose] video
|
||||
|
||||
Options:
|
||||
Main Options:
|
||||
video Path to video file
|
||||
--transcript, -t Path to Whisper transcript (JSON or TXT)
|
||||
--run-whisper Run Whisper transcription before processing
|
||||
--whisper-model Whisper model: tiny, base, small, medium, large (default: base)
|
||||
--whisper-model Whisper model: tiny, base, small, medium, large (default: medium)
|
||||
--diarize Use WhisperX with speaker diarization
|
||||
--embed-images Embed frame references for LLM analysis (recommended)
|
||||
--embed-quality JPEG quality for frames (default: 80)
|
||||
|
||||
Frame Extraction:
|
||||
--scene-detection Use FFmpeg scene detection (recommended)
|
||||
--scene-threshold Detection sensitivity 0-100 (default: 15, lower=more sensitive)
|
||||
--interval Extract frame every N seconds (alternative to scene detection)
|
||||
|
||||
Caching:
|
||||
--no-cache Force complete reprocessing
|
||||
--skip-cache-frames Re-extract frames only
|
||||
--skip-cache-whisper Re-run transcription only
|
||||
--skip-cache-analysis Re-run analysis only
|
||||
|
||||
Other:
|
||||
--transcript, -t Path to existing Whisper transcript (JSON or TXT)
|
||||
--output, -o Output file for enhanced transcript
|
||||
--output-dir Directory for output files (default: output/)
|
||||
--frames-dir Directory to save extracted frames (default: frames/)
|
||||
--interval Extract frame every N seconds (default: 5)
|
||||
--scene-detection Use scene detection instead of interval extraction
|
||||
--ocr-engine OCR engine: tesseract, easyocr, paddleocr (default: tesseract)
|
||||
--no-deduplicate Disable text deduplication
|
||||
--extract-only Only extract frames and OCR, skip transcript merging
|
||||
--format Output format: detailed or compact (default: detailed)
|
||||
--verbose, -v Enable verbose logging (DEBUG level)
|
||||
--verbose, -v Enable verbose logging
|
||||
```
|
||||
|
||||
## Tips for Best Results
|
||||
|
||||
### Vision vs OCR: When to Use Each
|
||||
|
||||
**Use Vision Models (`--use-vision`) when:**
|
||||
- ✅ Analyzing dashboards (Grafana, GCP Console, monitoring tools)
|
||||
- ✅ Code walkthroughs or debugging sessions
|
||||
- ✅ Complex layouts with mixed content
|
||||
- ✅ Need contextual understanding, not just text extraction
|
||||
- ✅ Working with charts, graphs, or visualizations
|
||||
- ⚠️ Trade-off: Slower (requires GPU/CPU for local model)
|
||||
|
||||
**Use OCR when:**
|
||||
- ✅ Simple text extraction from slides or documents
|
||||
- ✅ Need maximum speed
|
||||
- ✅ Limited computational resources
|
||||
- ✅ Presentations with mostly text
|
||||
- ⚠️ Trade-off: Less context-aware, may miss visual relationships
|
||||
|
||||
### Context Hints for Vision Analysis
|
||||
- **`--vision-context meeting`**: General purpose (default)
|
||||
- **`--vision-context code`**: Optimized for code screenshots, preserves formatting
|
||||
- **`--vision-context dashboard`**: Extracts metrics, trends, panel names
|
||||
- **`--vision-context console`**: Captures commands, output, error messages
|
||||
|
||||
**Customizing Prompts:**
|
||||
Prompts are stored as editable text files in `meetus/prompts/`:
|
||||
- `meeting.txt` - General meeting analysis
|
||||
- `code.txt` - Code screenshot analysis
|
||||
- `dashboard.txt` - Dashboard/monitoring analysis
|
||||
- `console.txt` - Terminal/console analysis
|
||||
|
||||
Just edit these files to customize how the vision model analyzes your frames!
|
||||
|
||||
### Scene Detection vs Interval
|
||||
- **Scene detection**: Better for presentations with distinct slides. More efficient.
|
||||
- **Interval extraction**: Better for continuous screen sharing (coding, browsing). More thorough.
|
||||
- **Scene detection** (`--scene-detection`): Recommended. Captures frames when content changes. More efficient.
|
||||
- **Interval extraction** (`--interval N`): Alternative for continuous content. Captures every N seconds.
|
||||
|
||||
### Vision Model Selection
|
||||
- **`llava:7b`**: Faster, lower memory (~4GB RAM), good quality
|
||||
- **`llava:13b`**: Better quality, slower, needs ~8GB RAM (default)
|
||||
- **`bakllava`**: Alternative with different strengths
|
||||
### Scene Detection Threshold
|
||||
- Lower values (5-10): More sensitive, captures more frames
|
||||
- Default (15): Good balance for most meetings
|
||||
- Higher values (20-30): Less sensitive, fewer frames
|
||||
|
||||
### Whisper vs WhisperX
|
||||
- **Whisper** (`--run-whisper`): Standard transcription, fast
|
||||
- **WhisperX** (`--run-whisper --diarize`): Adds speaker identification, requires HuggingFace token
|
||||
|
||||
### Frame Quality
|
||||
- Default quality (80) works well for most cases
|
||||
- Use `--embed-quality 60` for smaller files if storage is a concern
|
||||
|
||||
### Deduplication
|
||||
- Enabled by default - removes similar consecutive frames
|
||||
@@ -365,61 +272,56 @@ Just edit these files to customize how the vision model analyzes your frames!
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Vision Model Issues
|
||||
|
||||
**"ollama package not installed"**
|
||||
```bash
|
||||
pip install ollama
|
||||
```
|
||||
|
||||
**"Ollama not found" or connection errors**
|
||||
```bash
|
||||
# Install Ollama first: https://ollama.ai/download
|
||||
# Then pull a vision model:
|
||||
ollama pull llava:13b
|
||||
```
|
||||
|
||||
**Vision analysis is slow**
|
||||
- Use lighter model: `--vision-model llava:7b`
|
||||
- Reduce frame count: `--scene-detection` or `--interval 10`
|
||||
- Check if Ollama is using GPU (much faster)
|
||||
|
||||
**Poor vision analysis results**
|
||||
- Try different context hint: `--vision-context code` or `--vision-context dashboard`
|
||||
- Use larger model: `--vision-model llava:13b`
|
||||
- Ensure frames are clear (check video resolution)
|
||||
|
||||
### OCR Issues
|
||||
|
||||
**"pytesseract not installed"**
|
||||
```bash
|
||||
pip install pytesseract
|
||||
sudo apt-get install tesseract-ocr # Don't forget system package!
|
||||
```
|
||||
|
||||
**Poor OCR quality**
|
||||
- **Solution**: Switch to vision analysis with `--use-vision`
|
||||
- Or try different OCR engine: `--ocr-engine easyocr`
|
||||
- Check if video resolution is sufficient
|
||||
- Use `--no-deduplicate` to keep more frames
|
||||
|
||||
### General Issues
|
||||
### Frame Extraction Issues
|
||||
|
||||
**"No frames extracted"**
|
||||
- Check video file is valid: `ffmpeg -i video.mkv`
|
||||
- Try lower interval: `--interval 3`
|
||||
- Check disk space in frames directory
|
||||
- Try lower scene threshold: `--scene-threshold 5`
|
||||
- Try interval extraction: `--interval 3`
|
||||
- Check disk space in output directory
|
||||
|
||||
**Scene detection not working**
|
||||
- Fallback to interval extraction automatically
|
||||
- Ensure FFmpeg is installed
|
||||
- Falls back to interval extraction automatically
|
||||
- Try manual interval: `--interval 5`
|
||||
|
||||
### Whisper/WhisperX Issues
|
||||
|
||||
**WhisperX diarization not working**
|
||||
- Ensure you have a HuggingFace token set
|
||||
- Token needs access to pyannote models
|
||||
- Fall back to standard Whisper without `--diarize`
|
||||
|
||||
### Cache Issues
|
||||
|
||||
**Cache not being used**
|
||||
- Ensure you're using the same video filename
|
||||
- Check that output directory contains cached files
|
||||
- Use `--verbose` to see what's being cached/loaded
|
||||
|
||||
**Want to re-run specific steps**
|
||||
- `--skip-cache-frames`: Re-extract frames
|
||||
- `--skip-cache-whisper`: Re-run transcription
|
||||
- `--skip-cache-analysis`: Re-run analysis
|
||||
- `--no-cache`: Force complete reprocessing
|
||||
|
||||
## Experimental Features
|
||||
|
||||
### OCR and Vision Analysis
|
||||
|
||||
OCR (`--ocr-engine`) and Vision analysis (`--use-vision`) options are available but experimental. The recommended approach is to use `--embed-images` which embeds frame references directly in the transcript, letting your LLM analyze the images.
|
||||
|
||||
```bash
|
||||
# Experimental: OCR extraction
|
||||
python process_meeting.py samples/meeting.mkv --run-whisper --ocr-engine tesseract
|
||||
|
||||
# Experimental: Vision model analysis
|
||||
python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --vision-model llava:13b
|
||||
|
||||
# Experimental: Hybrid OpenCV + OCR
|
||||
python process_meeting.py samples/meeting.mkv --run-whisper --use-hybrid
|
||||
```
|
||||
|
||||
## Project Structure
|
||||
|
||||
```
|
||||
@@ -429,26 +331,18 @@ meetus/
|
||||
│ ├── workflow.py # Processing orchestrator
|
||||
│ ├── output_manager.py # Output directory & manifest management
|
||||
│ ├── cache_manager.py # Caching logic
|
||||
│ ├── frame_extractor.py # Video frame extraction
|
||||
│ ├── vision_processor.py # Vision model analysis (Ollama/LLaVA)
|
||||
│ ├── ocr_processor.py # OCR processing
|
||||
│ ├── transcript_merger.py # Transcript merging
|
||||
│ └── prompts/ # Vision analysis prompts (editable!)
|
||||
│ ├── meeting.txt # General meeting analysis
|
||||
│ ├── code.txt # Code screenshot analysis
|
||||
│ ├── dashboard.txt # Dashboard/monitoring analysis
|
||||
│ └── console.txt # Terminal/console analysis
|
||||
├── process_meeting.py # Main CLI script (thin wrapper)
|
||||
│ ├── frame_extractor.py # Video frame extraction (FFmpeg scene detection)
|
||||
│ ├── vision_processor.py # Vision model analysis (experimental)
|
||||
│ ├── ocr_processor.py # OCR processing (experimental)
|
||||
│ └── transcript_merger.py # Transcript merging
|
||||
├── process_meeting.py # Main CLI script
|
||||
├── requirements.txt # Python dependencies
|
||||
├── output/ # Timestamped output directories
|
||||
│ ├── .gitkeep
|
||||
│ └── YYYYMMDD_HHMMSS-video/ # Auto-generated per video
|
||||
├── samples/ # Sample videos (gitignored)
|
||||
└── README.md # This file
|
||||
```
|
||||
|
||||
The code is modular and easy to extend - each module has a single responsibility.
|
||||
|
||||
## License
|
||||
|
||||
For personal use.
|
||||
|
||||
Reference in New Issue
Block a user