Vision Models (recommended): Uses local LLaVA model via Ollama to understand context - great for dashboards, code, consoles
OCR: Traditional text extraction - faster but less context-aware

The result is a rich, timestamped transcript that provides full context for AI summarization.

Installation

1. System Dependencies

Ollama (required for vision analysis):

# Install from https://ollama.ai/download
# Then pull a vision model:
ollama pull llava:13b
# or for lighter model:
ollama pull llava:7b

FFmpeg (for scene detection):

# Ubuntu/Debian
sudo apt-get install ffmpeg

# macOS
brew install ffmpeg

Tesseract OCR (optional, if not using vision):

# Ubuntu/Debian
sudo apt-get install tesseract-ocr

# macOS
brew install tesseract

# Arch Linux
sudo pacman -S tesseract

2. Python Dependencies

pip install -r requirements.txt

3. Whisper (for audio transcription)

pip install openai-whisper

4. Optional: Install Alternative OCR Engines

If you prefer OCR over vision analysis:

# EasyOCR (better for rotated/handwritten text)
pip install easyocr

# PaddleOCR (better for code/terminal screens)
pip install paddleocr

Quick Start

Recommended: Vision Analysis (Best for Code/Dashboards)

python process_meeting.py samples/meeting.mkv --run-whisper --use-vision

This will:

Run Whisper transcription (audio → text)
Extract frames every 5 seconds
Use LLaVA vision model to analyze frames with context
Merge audio + screen content
Save everything to output/ folder

Re-run with Cached Results

Already ran it once? Re-run instantly using cached results:

# Uses cached transcript, frames, and analysis
python process_meeting.py samples/meeting.mkv --use-vision

# Force reprocessing
python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --no-cache

Traditional OCR (Faster, Less Context-Aware)

python process_meeting.py samples/meeting.mkv --run-whisper

Usage Examples

Vision Analysis with Context Hints

# For code-heavy meetings
python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --vision-context code

# For dashboard/monitoring meetings (Grafana, GCP, etc.)
python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --vision-context dashboard

# For console/terminal sessions
python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --vision-context console

Different Vision Models

# Lighter/faster model (7B parameters)
python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --vision-model llava:7b

# Default model (13B parameters, better quality)
python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --vision-model llava:13b

# Alternative models
python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --vision-model bakllava

Extract frames at different intervals

# Every 10 seconds
python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --interval 10

# Every 3 seconds (more detailed)
python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --interval 3

Use scene detection (smarter, fewer frames)

python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --scene-detection

Traditional OCR (if you prefer)

# Tesseract (default)
python process_meeting.py samples/meeting.mkv --run-whisper

# EasyOCR
python process_meeting.py samples/meeting.mkv --run-whisper --ocr-engine easyocr

# PaddleOCR
python process_meeting.py samples/meeting.mkv --run-whisper --ocr-engine paddleocr

Caching Examples

# First run - processes everything
python process_meeting.py samples/meeting.mkv --run-whisper --use-vision

# Second run - uses cached transcript and frames, only re-merges
python process_meeting.py samples/meeting.mkv

# Switch from OCR to vision using existing frames
python process_meeting.py samples/meeting.mkv --use-vision

# Force complete reprocessing
python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --no-cache

Custom output location

python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --output-dir my_outputs/

Enable verbose logging

# Show detailed debug information
python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --verbose

Output Files

Each video gets its own timestamped output directory:

output/
└── 20241019_143022-meeting/
    ├── manifest.json                    # Processing configuration
    ├── meeting_enhanced.txt             # Enhanced transcript for AI
    ├── meeting.json                     # Whisper transcript
    ├── meeting_vision.json              # Vision analysis results
    └── frames/                          # Extracted video frames
        ├── frame_00001_5.00s.jpg
        ├── frame_00002_10.00s.jpg
        └── ...

Manifest File

Each processing run creates a manifest.json that tracks:

Video information (name, path)
Processing timestamp
Configuration used (Whisper model, vision settings, etc.)
Output file locations

Example manifest:

{
  "video": {
    "name": "meeting.mkv",
    "path": "/full/path/to/meeting.mkv"
  },
  "processed_at": "2024-10-19T14:30:22",
  "configuration": {
    "whisper": {"enabled": true, "model": "base"},
    "analysis": {"method": "vision", "vision_model": "llava:13b", "vision_context": "code"}
  }
}

Caching Behavior

The tool automatically reuses the most recent output directory for the same video:

First run: Creates new timestamped directory (e.g., 20241019_143022-meeting/)
Subsequent runs: Reuses the same directory and cached results
Cached items: Whisper transcript, extracted frames, analysis results
Force new run: Use --no-cache to create a fresh directory

This means you can instantly switch between OCR and vision analysis without re-extracting frames!

Workflow for Meeting Analysis

Complete Workflow (One Command!)

# Process everything in one step with vision analysis
python process_meeting.py samples/alo-intro1.mkv --run-whisper --use-vision --scene-detection

# Output will be in output/alo-intro1_enhanced.txt

Typical Iterative Workflow

# First run - full processing
python process_meeting.py samples/meeting.mkv --run-whisper --use-vision

# Review results, then re-run with different context if needed
python process_meeting.py samples/meeting.mkv --use-vision --vision-context code

# Or switch to a different vision model
python process_meeting.py samples/meeting.mkv --use-vision --vision-model llava:7b

# All use cached frames and transcript!

Traditional Workflow (Separate Steps)

# 1. Extract audio and transcribe with Whisper (optional, if not using --run-whisper)
whisper samples/alo-intro1.mkv --model base --output_format json --output_dir output

# 2. Process video to extract screen content with vision
python process_meeting.py samples/alo-intro1.mkv \
    --transcript output/alo-intro1.json \
    --use-vision \
    --scene-detection

# 3. Use the enhanced transcript with AI
# Copy the content from output/alo-intro1_enhanced.txt and paste into Claude or your LLM

Example Prompt for Claude

Please summarize this meeting transcript. Pay special attention to:
1. Key decisions made
2. Action items
3. Technical details shown on screen
4. Any metrics or data presented

[Paste enhanced transcript here]

Command Reference

usage: process_meeting.py [-h] [--transcript TRANSCRIPT] [--run-whisper]
                          [--whisper-model {tiny,base,small,medium,large}]
                          [--output OUTPUT] [--output-dir OUTPUT_DIR]
                          [--frames-dir FRAMES_DIR] [--interval INTERVAL]
                          [--scene-detection]
                          [--ocr-engine {tesseract,easyocr,paddleocr}]
                          [--no-deduplicate] [--extract-only]
                          [--format {detailed,compact}] [--verbose]
                          video

Options:
  video                 Path to video file
  --transcript, -t      Path to Whisper transcript (JSON or TXT)
  --run-whisper         Run Whisper transcription before processing
  --whisper-model       Whisper model: tiny, base, small, medium, large (default: base)
  --output, -o          Output file for enhanced transcript
  --output-dir          Directory for output files (default: output/)
  --frames-dir          Directory to save extracted frames (default: frames/)
  --interval            Extract frame every N seconds (default: 5)
  --scene-detection     Use scene detection instead of interval extraction
  --ocr-engine          OCR engine: tesseract, easyocr, paddleocr (default: tesseract)
  --no-deduplicate      Disable text deduplication
  --extract-only        Only extract frames and OCR, skip transcript merging
  --format              Output format: detailed or compact (default: detailed)
  --verbose, -v         Enable verbose logging (DEBUG level)

Tips for Best Results

Vision vs OCR: When to Use Each

Use Vision Models (--use-vision) when:

✅ Analyzing dashboards (Grafana, GCP Console, monitoring tools)
✅ Code walkthroughs or debugging sessions
✅ Complex layouts with mixed content
✅ Need contextual understanding, not just text extraction
✅ Working with charts, graphs, or visualizations
⚠️ Trade-off: Slower (requires GPU/CPU for local model)

Use OCR when:

✅ Simple text extraction from slides or documents
✅ Need maximum speed
✅ Limited computational resources
✅ Presentations with mostly text
⚠️ Trade-off: Less context-aware, may miss visual relationships

Context Hints for Vision Analysis

--vision-context meeting: General purpose (default)
--vision-context code: Optimized for code screenshots, preserves formatting
--vision-context dashboard: Extracts metrics, trends, panel names
--vision-context console: Captures commands, output, error messages

Customizing Prompts: Prompts are stored as editable text files in meetus/prompts/:

meeting.txt - General meeting analysis
code.txt - Code screenshot analysis
dashboard.txt - Dashboard/monitoring analysis
console.txt - Terminal/console analysis

Just edit these files to customize how the vision model analyzes your frames!

Scene Detection vs Interval

Scene detection: Better for presentations with distinct slides. More efficient.
Interval extraction: Better for continuous screen sharing (coding, browsing). More thorough.

Vision Model Selection

llava:7b: Faster, lower memory (~4GB RAM), good quality
llava:13b: Better quality, slower, needs ~8GB RAM (default)
bakllava: Alternative with different strengths

Deduplication

Enabled by default - removes similar consecutive frames
Disable with --no-deduplicate if slides/screens change subtly

Troubleshooting

Vision Model Issues

"ollama package not installed"

pip install ollama

"Ollama not found" or connection errors

# Install Ollama first: https://ollama.ai/download
# Then pull a vision model:
ollama pull llava:13b

Vision analysis is slow

Use lighter model: --vision-model llava:7b
Reduce frame count: --scene-detection or --interval 10
Check if Ollama is using GPU (much faster)

Poor vision analysis results

Try different context hint: --vision-context code or --vision-context dashboard
Use larger model: --vision-model llava:13b
Ensure frames are clear (check video resolution)

OCR Issues

"pytesseract not installed"

pip install pytesseract
sudo apt-get install tesseract-ocr  # Don't forget system package!

Poor OCR quality

Solution: Switch to vision analysis with --use-vision
Or try different OCR engine: --ocr-engine easyocr
Check if video resolution is sufficient
Use --no-deduplicate to keep more frames

General Issues

"No frames extracted"

Check video file is valid: ffmpeg -i video.mkv
Try lower interval: --interval 3
Check disk space in frames directory

Scene detection not working

Fallback to interval extraction automatically
Ensure FFmpeg is installed
Try manual interval: --interval 5

Cache not being used

Ensure you're using the same video filename
Check that output directory contains cached files
Use --verbose to see what's being cached/loaded

Project Structure

meetus/
├── meetus/                     # Main package
│   ├── __init__.py
│   ├── workflow.py             # Processing orchestrator
│   ├── output_manager.py       # Output directory & manifest management
│   ├── cache_manager.py        # Caching logic
│   ├── frame_extractor.py      # Video frame extraction
│   ├── vision_processor.py     # Vision model analysis (Ollama/LLaVA)
│   ├── ocr_processor.py        # OCR processing
│   ├── transcript_merger.py    # Transcript merging
│   └── prompts/                # Vision analysis prompts (editable!)
│       ├── meeting.txt         # General meeting analysis
│       ├── code.txt            # Code screenshot analysis
│       ├── dashboard.txt       # Dashboard/monitoring analysis
│       └── console.txt         # Terminal/console analysis
├── process_meeting.py          # Main CLI script (thin wrapper)
├── requirements.txt            # Python dependencies
├── output/                     # Timestamped output directories
│   ├── .gitkeep
│   └── YYYYMMDD_HHMMSS-video/  # Auto-generated per video
├── samples/                    # Sample videos (gitignored)
└── README.md                   # This file

The code is modular and easy to extend - each module has a single responsibility.

License

For personal use.