Mariano Gabriel eb8b1f4f11 updated readme
2025-12-04 20:24:52 -03:00
2025-12-02 02:33:39 -03:00
2025-12-03 06:48:45 -03:00
2025-10-20 00:03:41 -03:00
2025-10-19 22:17:38 -03:00
2025-12-03 06:48:45 -03:00
2025-12-04 20:24:52 -03:00

Meeting Processor

Extract screen content from meeting recordings and merge with Whisper/WhisperX transcripts for better AI summarization.

Overview

This tool enhances meeting transcripts by combining:

  • Audio transcription (Whisper or WhisperX with speaker diarization)
  • Screen content extraction via FFmpeg scene detection
  • Frame embedding for direct LLM analysis

The result is a rich, timestamped transcript with embedded screen frames that provides full context for AI summarization.

Installation

1. System Dependencies

FFmpeg (required for scene detection and frame extraction):

# Ubuntu/Debian
sudo apt-get install ffmpeg

# macOS
brew install ffmpeg

2. Python Dependencies

pip install -r requirements.txt

3. Whisper or WhisperX (for audio transcription)

Standard Whisper:

pip install openai-whisper

WhisperX (recommended - includes speaker diarization):

pip install whisperx

For speaker diarization, you'll need a HuggingFace token with access to pyannote models.

Quick Start

python process_meeting.py samples/meeting.mkv --embed-images --scene-detection --scene-threshold 10 --diarize

This will:

  1. Run WhisperX transcription with speaker diarization
  2. Extract frames at scene changes (threshold 10 = moderately sensitive)
  3. Create an enhanced transcript with frame file references
  4. Save everything to output/ folder

The --embed-images flag adds frame paths to the transcript (e.g., Frame: frames/video_00257.jpg), keeping the transcript small while frames stay in frames/ folder for LLM access.

Re-run with Cached Results

Already ran it once? Re-run instantly using cached results:

# Uses cached transcript and frames
python process_meeting.py samples/meeting.mkv --embed-images

# Skip only specific cached items
python process_meeting.py samples/meeting.mkv --embed-images --skip-cache-frames
python process_meeting.py samples/meeting.mkv --embed-images --skip-cache-whisper

# Force complete reprocessing
python process_meeting.py samples/meeting.mkv --embed-images --scene-detection --diarize --no-cache

Usage Examples

Scene Detection Options

# Default threshold (15)
python process_meeting.py samples/meeting.mkv --embed-images --scene-detection --diarize

# More sensitive (more frames, threshold: 5)
python process_meeting.py samples/meeting.mkv --embed-images --scene-detection --scene-threshold 5 --diarize

# Less sensitive (fewer frames, threshold: 30)
python process_meeting.py samples/meeting.mkv --embed-images --scene-detection --scene-threshold 30 --diarize

Fixed Interval Extraction (alternative to scene detection)

# Every 10 seconds
python process_meeting.py samples/meeting.mkv --embed-images --interval 10 --diarize

# Every 3 seconds (more detailed)
python process_meeting.py samples/meeting.mkv --embed-images --interval 3 --diarize

Caching Examples

# First run - processes everything
python process_meeting.py samples/meeting.mkv --embed-images --scene-detection --scene-threshold 10 --diarize

# Iterate on scene threshold (reuse whisper transcript)
python process_meeting.py samples/meeting.mkv --embed-images --scene-detection --scene-threshold 5 --skip-cache-frames --skip-cache-analysis

# Re-run whisper only
python process_meeting.py samples/meeting.mkv --embed-images --skip-cache-whisper

# Force complete reprocessing
python process_meeting.py samples/meeting.mkv --embed-images --scene-detection --diarize --no-cache

Custom output location

python process_meeting.py samples/meeting.mkv --embed-images --scene-detection --diarize --output-dir my_outputs/

Enable verbose logging

python process_meeting.py samples/meeting.mkv --embed-images --scene-detection --diarize --verbose

Output Files

Each video gets its own timestamped output directory:

output/
└── 20241019_143022-meeting/
    ├── manifest.json                    # Processing configuration
    ├── meeting_enhanced.txt             # Enhanced transcript for AI
    ├── meeting.json                     # Whisper/WhisperX transcript
    └── frames/                          # Extracted video frames
        ├── frame_00001_5.00s.jpg
        ├── frame_00002_10.00s.jpg
        └── ...

Caching Behavior

The tool automatically reuses the most recent output directory for the same video:

  • First run: Creates new timestamped directory (e.g., 20241019_143022-meeting/)
  • Subsequent runs: Reuses the same directory and cached results
  • Cached items: Whisper transcript, extracted frames, analysis results

Fine-grained cache control:

  • --no-cache: Force complete reprocessing
  • --skip-cache-frames: Re-extract frames only
  • --skip-cache-whisper: Re-run transcription only
  • --skip-cache-analysis: Re-run analysis only

This allows you to iterate on scene detection thresholds without re-running Whisper!

Workflow for Meeting Analysis

Complete Workflow (One Command!)

python process_meeting.py samples/meeting.mkv --embed-images --scene-detection --scene-threshold 10 --diarize

Typical Iterative Workflow

# First run - full processing
python process_meeting.py samples/meeting.mkv --embed-images --scene-detection --scene-threshold 10 --diarize

# Adjust scene threshold (keeps cached whisper transcript)
python process_meeting.py samples/meeting.mkv --embed-images --scene-detection --scene-threshold 5 --skip-cache-frames --skip-cache-analysis

Example Prompt for Claude

Please summarize this meeting transcript. Pay special attention to:
1. Key decisions made
2. Action items
3. Technical details shown on screen
4. Any metrics or data presented

[Paste enhanced transcript here]

Command Reference

usage: process_meeting.py [-h] [--transcript TRANSCRIPT] [--run-whisper]
                          [--whisper-model {tiny,base,small,medium,large}]
                          [--diarize] [--output OUTPUT] [--output-dir OUTPUT_DIR]
                          [--interval INTERVAL] [--scene-detection]
                          [--scene-threshold SCENE_THRESHOLD]
                          [--embed-images] [--embed-quality EMBED_QUALITY]
                          [--no-cache] [--skip-cache-frames] [--skip-cache-whisper]
                          [--skip-cache-analysis] [--no-deduplicate]
                          [--extract-only] [--format {detailed,compact}]
                          [--verbose] video

Main Options:
  video                   Path to video file
  --diarize               Use WhisperX with speaker diarization
  --embed-images          Add frame file references to transcript (recommended)

Frame Extraction:
  --scene-detection       Use FFmpeg scene detection (recommended)
  --scene-threshold       Detection sensitivity 0-100 (default: 15, lower=more sensitive)
  --interval              Extract frame every N seconds (alternative to scene detection)

Caching:
  --no-cache              Force complete reprocessing
  --skip-cache-frames     Re-extract frames only
  --skip-cache-whisper    Re-run transcription only
  --skip-cache-analysis   Re-run analysis only

Other:
  --run-whisper           Run Whisper (without diarization)
  --whisper-model         Whisper model: tiny, base, small, medium, large (default: medium)
  --transcript, -t        Path to existing Whisper transcript (JSON or TXT)
  --output, -o            Output file for enhanced transcript
  --output-dir            Directory for output files (default: output/)
  --verbose, -v           Enable verbose logging

Tips for Best Results

Scene Detection vs Interval

  • Scene detection (--scene-detection): Recommended. Captures frames when content changes. More efficient.
  • Interval extraction (--interval N): Alternative for continuous content. Captures every N seconds.

Scene Detection Threshold

  • Lower values (5-10): More sensitive, captures more frames
  • Default (15): Good balance for most meetings
  • Higher values (20-30): Less sensitive, fewer frames

Whisper vs WhisperX

  • Whisper (--run-whisper): Standard transcription, fast
  • WhisperX (--run-whisper --diarize): Adds speaker identification, requires HuggingFace token

Deduplication

  • Enabled by default - removes similar consecutive frames
  • Disable with --no-deduplicate if slides/screens change subtly

Troubleshooting

Frame Extraction Issues

"No frames extracted"

  • Check video file is valid: ffmpeg -i video.mkv
  • Try lower scene threshold: --scene-threshold 5
  • Try interval extraction: --interval 3
  • Check disk space in output directory

Scene detection not working

  • Ensure FFmpeg is installed
  • Falls back to interval extraction automatically
  • Try manual interval: --interval 5

Whisper/WhisperX Issues

WhisperX diarization not working

  • Ensure you have a HuggingFace token set
  • Token needs access to pyannote models
  • Fall back to standard Whisper without --diarize

Cache Issues

Cache not being used

  • Ensure you're using the same video filename
  • Check that output directory contains cached files
  • Use --verbose to see what's being cached/loaded

Want to re-run specific steps

  • --skip-cache-frames: Re-extract frames
  • --skip-cache-whisper: Re-run transcription
  • --skip-cache-analysis: Re-run analysis
  • --no-cache: Force complete reprocessing

Experimental Features

OCR and Vision Analysis

OCR (--ocr-engine) and Vision analysis (--use-vision) options are available but experimental. The recommended approach is to use --embed-images which embeds frame references directly in the transcript, letting your LLM analyze the images.

# Experimental: OCR extraction
python process_meeting.py samples/meeting.mkv --run-whisper --ocr-engine tesseract

# Experimental: Vision model analysis
python process_meeting.py samples/meeting.mkv --run-whisper --use-vision --vision-model llava:13b

# Experimental: Hybrid OpenCV + OCR
python process_meeting.py samples/meeting.mkv --run-whisper --use-hybrid

Project Structure

meetus/
├── meetus/                     # Main package
│   ├── __init__.py
│   ├── workflow.py             # Processing orchestrator
│   ├── output_manager.py       # Output directory & manifest management
│   ├── cache_manager.py        # Caching logic
│   ├── frame_extractor.py      # Video frame extraction (FFmpeg scene detection)
│   ├── vision_processor.py     # Vision model analysis (experimental)
│   ├── ocr_processor.py        # OCR processing (experimental)
│   └── transcript_merger.py    # Transcript merging
├── process_meeting.py          # Main CLI script
├── requirements.txt            # Python dependencies
├── output/                     # Timestamped output directories
│   └── YYYYMMDD_HHMMSS-video/  # Auto-generated per video
├── samples/                    # Sample videos (gitignored)
└── README.md                   # This file

License

For personal use.

Description
No description provided
Readme 84 KiB
Languages
Python 100%