mitus/README.md

# Meeting Processor

Extract screen content from meeting recordings and merge with Whisper/WhisperX transcripts for better AI summarization.

## Overview

This tool enhances meeting transcripts by combining:
- **Audio transcription** (Whisper or WhisperX with speaker diarization)
- **Screen content extraction** via FFmpeg scene detection
- **Frame embedding** for direct LLM analysis

The result is a rich, timestamped transcript with embedded screen frames that provides full context for AI summarization.

## Installation

### 1. System Dependencies

**FFmpeg** (required for scene detection and frame extraction):
```bash
# Ubuntu/Debian
sudo apt-get install ffmpeg

# macOS
brew install ffmpeg
```

### 2. Python Dependencies

```bash
pip install -r requirements.txt
```

### 3. Whisper or WhisperX (for audio transcription)

**Standard Whisper:**
```bash
pip install openai-whisper
```

**WhisperX** (recommended - includes speaker diarization):
```bash
pip install whisperx
```

For speaker diarization, you'll need a HuggingFace token with access to pyannote models.

## Quick Start

### Recommended: Embed Frames with Scene Detection

```bash
python process_meeting.py samples/meeting.mkv --run-whisper --embed-images --scene-detection
```

This will:
1. Run Whisper transcription (audio → text)
2. Extract frames at scene changes (smarter than fixed intervals)
3. Embed frame references in the transcript for LLM analysis
4. Save everything to `output/` folder

### With Speaker Diarization (WhisperX)

```bash
python process_meeting.py samples/meeting.mkv --run-whisper --diarize --embed-images --scene-detection
```

This uses WhisperX to identify different speakers in the transcript.

### Re-run with Cached Results

Already ran it once? Re-run instantly using cached results:
```bash
# Uses cached transcript and frames
python process_meeting.py samples/meeting.mkv --embed-images

# Skip only specific cached items
python process_meeting.py samples/meeting.mkv --embed-images --skip-cache-frames
python process_meeting.py samples/meeting.mkv --embed-images --skip-cache-whisper
python process_meeting.py samples/meeting.mkv --embed-images --skip-cache-analysis

# Force complete reprocessing
python process_meeting.py samples/meeting.mkv --run-whisper --embed-images --no-cache
```

## Usage Examples

### Scene Detection Options
```bash
# Default scene detection (threshold: 15)
python process_meeting.py samples/meeting.mkv --run-whisper --embed-images --scene-detection

# More sensitive (more frames captured, threshold: 5)
python process_meeting.py samples/meeting.mkv --run-whisper --embed-images --scene-detection --scene-threshold 5

# Less sensitive (fewer frames, threshold: 30)
python process_meeting.py samples/meeting.mkv --run-whisper --embed-images --scene-detection --scene-threshold 30
```

### Fixed Interval Extraction (alternative to scene detection)
```bash
# Every 10 seconds
python process_meeting.py samples/meeting.mkv --run-whisper --embed-images --interval 10

# Every 3 seconds (more detailed)
python process_meeting.py samples/meeting.mkv --run-whisper --embed-images --interval 3
```

### Frame Quality Options
```bash
# Default quality (80)
python process_meeting.py samples/meeting.mkv --run-whisper --embed-images --scene-detection

# Lower quality for smaller files (60)
python process_meeting.py samples/meeting.mkv --run-whisper --embed-images --scene-detection --embed-quality 60
```

### Caching Examples
```bash
# First run - processes everything
python process_meeting.py samples/meeting.mkv --run-whisper --embed-images --scene-detection

# Iterate on scene threshold (reuse whisper transcript)
python process_meeting.py samples/meeting.mkv --embed-images --scene-detection --scene-threshold 5 --skip-cache-frames --skip-cache-analysis

# Re-run whisper only
python process_meeting.py samples/meeting.mkv --embed-images --skip-cache-whisper

# Force complete reprocessing
python process_meeting.py samples/meeting.mkv --run-whisper --embed-images --no-cache
```

### Custom output location
```bash
python process_meeting.py samples/meeting.mkv --run-whisper --embed-images --output-dir my_outputs/
```

### Enable verbose logging
```bash
python process_meeting.py samples/meeting.mkv --run-whisper --embed-images --scene-detection --verbose
```

## Output Files

Each video gets its own timestamped output directory:

```
output/
└── 20241019_143022-meeting/
    ├── manifest.json                    # Processing configuration
    ├── meeting_enhanced.txt             # Enhanced transcript for AI
    ├── meeting.json                     # Whisper/WhisperX transcript
    └── frames/                          # Extracted video frames
        ├── frame_00001_5.00s.jpg
        ├── frame_00002_10.00s.jpg
        └── ...
```

### Caching Behavior

The tool automatically reuses the most recent output directory for the same video:
- **First run**: Creates new timestamped directory (e.g., `20241019_143022-meeting/`)
- **Subsequent runs**: Reuses the same directory and cached results
- **Cached items**: Whisper transcript, extracted frames, analysis results

**Fine-grained cache control:**
- `--no-cache`: Force complete reprocessing
- `--skip-cache-frames`: Re-extract frames only
- `--skip-cache-whisper`: Re-run transcription only
- `--skip-cache-analysis`: Re-run analysis only

This allows you to iterate on scene detection thresholds without re-running Whisper!

## Workflow for Meeting Analysis

### Complete Workflow (One Command!)

```bash
# Process everything in one step with scene detection
python process_meeting.py samples/meeting.mkv --run-whisper --embed-images --scene-detection

# With speaker diarization
python process_meeting.py samples/meeting.mkv --run-whisper --diarize --embed-images --scene-detection
```

### Typical Iterative Workflow

```bash
# First run - full processing
python process_meeting.py samples/meeting.mkv --run-whisper --embed-images --scene-detection

# Adjust scene threshold (keeps cached whisper transcript)
python process_meeting.py samples/meeting.mkv --embed-images --scene-detection --scene-threshold 10 --skip-cache-frames --skip-cache-analysis

# Try different frame quality
python process_meeting.py samples/meeting.mkv --embed-images --embed-quality 60 --skip-cache-frames --skip-cache-analysis
```

### Example Prompt for Claude

```
Please summarize this meeting transcript. Pay special attention to:
1. Key decisions made
2. Action items
3. Technical details shown on screen
4. Any metrics or data presented

[Paste enhanced transcript here]
```

## Command Reference

```
usage: process_meeting.py [-h] [--transcript TRANSCRIPT] [--run-whisper]
                          [--whisper-model {tiny,base,small,medium,large}]
                          [--diarize] [--output OUTPUT] [--output-dir OUTPUT_DIR]
                          [--interval INTERVAL] [--scene-detection]
                          [--scene-threshold SCENE_THRESHOLD]
                          [--embed-images] [--embed-quality EMBED_QUALITY]
                          [--no-cache] [--skip-cache-frames] [--skip-cache-whisper]
                          [--skip-cache-analysis] [--no-deduplicate]
                          [--extract-only] [--format {detailed,compact}]
                          [--verbose] video

Main Options:
  video                   Path to video file
  --run-whisper           Run Whisper transcription before processing
  --whisper-model         Whisper model: tiny, base, small, medium, large (default: medium)
  --diarize               Use WhisperX with speaker diarization
  --embed-images          Embed frame references for LLM analysis (recommended)
  --embed-quality         JPEG quality for frames (default: 80)

Frame Extraction:
  --scene-detection       Use FFmpeg scene detection (recommended)
  --scene-threshold       Detection sensitivity 0-100 (default: 15, lower=more sensitive)
  --interval              Extract frame every N seconds (alternative to scene detection)

Caching:
  --no-cache              Force complete reprocessing
  --skip-cache-frames     Re-extract frames only
  --skip-cache-whisper    Re-run transcription only
  --skip-cache-analysis   Re-run analysis only

Other:
  --transcript, -t        Path to existing Whisper transcript (JSON or TXT)
  --output, -o            Output file for enhanced transcript
  --output-dir            Directory for output files (default: output/)
  --verbose, -v           Enable verbose logging
```

## Tips for Best Results

### Scene Detection vs Interval
- **Scene detection** (`--scene-detection`): Recommended. Captures frames when content changes. More efficient.
- **Interval extraction** (`--interval N`): Alternative for continuous content. Captures every N seconds.

### Scene Detection Threshold
- Lower values (5-10): More sensitive, captures more frames
- Default (15): Good balance for most meetings
- Higher values (20-30): Less sensitive, fewer frames

### Whisper vs WhisperX
- **Whisper** (`--run-whisper`): Standard transcription, fast
- **WhisperX** (`--run-whisper --diarize`): Adds speaker identification, requires HuggingFace token

### Frame Quality
- Default quality (80) works well for most cases
- Use `--embed-quality 60` for smaller files if storage is a concern

### Deduplication
- Enabled by default - removes similar consecutive frames
- Disable with `--no-deduplicate` if slides/screens change subtly

## Troubleshooting

### Frame Extraction Issues

**"No frames extracted"**
- Check video file is valid: `ffmpeg -i video.mkv`
- Try lower scene threshold: `--scene-threshold 5`
- Try interval extraction: `--interval 3`
- Check disk space in output directory

**Scene detection not working**
- Ensure FFmpeg is installed
- Falls back to interval extraction automatically
- Try manual interval: `--interval 5`

### Whisper/WhisperX Issues

**WhisperX diarization not working**
- Ensure you have a HuggingFace token set
- Token needs access to pyannote models
- Fall back to standard Whisper without `--diarize`

### Cache Issues

**Cache not being used**
- Ensure you're using the same video filename
- Check that output directory contains cached files
- Use `--verbose` to see what's being cached/loaded

**Want to re-run specific steps**
- `--skip-cache-frames`: Re-extract frames
- `--skip-cache-whisper`: Re-run transcription
- `--skip-cache-analysis`: Re-run analysis
- `--no-cache`: Force complete reprocessing

## Experimental Features

### OCR and Vision Analysis

OCR (`--ocr-engine`) and Vision analysis (`--use-vision`) options are available but experimental. The recommended approach is to use `--embed-images` which embeds frame references directly in the transcript, letting your LLM analyze the images.

```bash
# Experimental: OCR extraction
python process_meeting.py samples/meeting.mkv --run-whisper --ocr-engine tesseract

# Experimental: Vision model analysis
python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --vision-model llava:13b

# Experimental: Hybrid OpenCV + OCR
python process_meeting.py samples/meeting.mkv --run-whisper --use-hybrid
```

## Project Structure

```
meetus/
├── meetus/                     # Main package
│   ├── __init__.py
│   ├── workflow.py             # Processing orchestrator
│   ├── output_manager.py       # Output directory & manifest management
│   ├── cache_manager.py        # Caching logic
│   ├── frame_extractor.py      # Video frame extraction (FFmpeg scene detection)
│   ├── vision_processor.py     # Vision model analysis (experimental)
│   ├── ocr_processor.py        # OCR processing (experimental)
│   └── transcript_merger.py    # Transcript merging
├── process_meeting.py          # Main CLI script
├── requirements.txt            # Python dependencies
├── output/                     # Timestamped output directories
│   └── YYYYMMDD_HHMMSS-video/  # Auto-generated per video
├── samples/                    # Sample videos (gitignored)
└── README.md                   # This file
```

## License

For personal use.