init commit
This commit is contained in:
2
.gitignore
vendored
Normal file
2
.gitignore
vendored
Normal file
@@ -0,0 +1,2 @@
|
||||
!samples/.gitkeep
|
||||
samples/*
|
||||
50
MARIAN.md
Normal file
50
MARIAN.md
Normal file
@@ -0,0 +1,50 @@
|
||||
I have several recorded meetings that the interlocutor is showing his/her screen I had good success using openIAs whisper for transcribing the audio and asking claude to summarized it providing caveats and observations (no image involved, trasncript local, sumarization online) .. id like to include something (locally) to extract information from what is shown in the screen to provide more context for the summarization
|
||||
|
||||
|
||||
Great workflow! Adding visual context from screen shares will definitely improve the summarization quality. Here are some local options for extracting information from screen recordings:
|
||||
|
||||
## OCR & Screen Analysis Tools
|
||||
|
||||
**For extracting text from frames:**
|
||||
- **Tesseract OCR** - Open source, highly accurate, supports 100+ languages
|
||||
- Use with Python: `pytesseract` wrapper
|
||||
- Sample every N seconds of video to extract text
|
||||
|
||||
- **EasyOCR** - Python library, good for detecting text in various orientations
|
||||
- More robust for presentations/slides than Tesseract
|
||||
|
||||
- **PaddleOCR** - Fast, accurate, works well with code/terminal screens
|
||||
|
||||
**For video processing:**
|
||||
- **FFmpeg** - Extract frames at intervals (e.g., every 5-10 seconds or on scene changes)
|
||||
- **OpenCV** - Python library for video manipulation and frame extraction
|
||||
|
||||
## Suggested Workflow
|
||||
|
||||
```python
|
||||
# Rough approach:
|
||||
1. Use FFmpeg to detect scene changes or sample frames
|
||||
2. Run OCR on extracted frames
|
||||
3. Deduplicate similar text (consecutive slides often similar)
|
||||
4. Timestamp each extracted text segment
|
||||
5. Combine with Whisper transcript, aligned by timestamp
|
||||
6. Send enhanced transcript to Claude
|
||||
```
|
||||
|
||||
## Tips for Better Results
|
||||
|
||||
- **Scene detection**: Extract frames only when screen content changes significantly (saves processing time)
|
||||
- **Deduplication**: Use similarity algorithms to avoid duplicate text from static slides
|
||||
- **Screenshot types**: Different OCR tools excel at different content:
|
||||
- Tesseract: Documents, slides
|
||||
- EasyOCR: Handwriting, rotated text
|
||||
- PaddleOCR: Code, mixed languages
|
||||
|
||||
- **Format for Claude**: Structure as:
|
||||
```
|
||||
[00:05] Speaker: "Let me show you the dashboard..."
|
||||
[SCREEN: Dashboard showing Q4 Revenue: $2.3M, Growth: 45%]
|
||||
[00:12] Speaker: "As you can see, we exceeded targets..."
|
||||
```
|
||||
|
||||
Would you like specific code examples for any of these approaches?
|
||||
239
README.md
Normal file
239
README.md
Normal file
@@ -0,0 +1,239 @@
|
||||
# Meeting Processor
|
||||
|
||||
Extract screen content from meeting recordings and merge with Whisper transcripts for better Claude summarization.
|
||||
|
||||
## Overview
|
||||
|
||||
This tool enhances meeting transcripts by combining:
|
||||
- **Audio transcription** (from Whisper)
|
||||
- **Screen content** (OCR from screen shares)
|
||||
|
||||
The result is a rich, timestamped transcript that provides full context for AI summarization.
|
||||
|
||||
## Installation
|
||||
|
||||
### 1. System Dependencies
|
||||
|
||||
**Tesseract OCR** (recommended):
|
||||
```bash
|
||||
# Ubuntu/Debian
|
||||
sudo apt-get install tesseract-ocr
|
||||
|
||||
# macOS
|
||||
brew install tesseract
|
||||
|
||||
# Arch Linux
|
||||
sudo pacman -S tesseract
|
||||
```
|
||||
|
||||
**FFmpeg** (for scene detection):
|
||||
```bash
|
||||
# Ubuntu/Debian
|
||||
sudo apt-get install ffmpeg
|
||||
|
||||
# macOS
|
||||
brew install ffmpeg
|
||||
```
|
||||
|
||||
### 2. Python Dependencies
|
||||
|
||||
```bash
|
||||
pip install -r requirements.txt
|
||||
```
|
||||
|
||||
### 3. Optional: Install Alternative OCR Engines
|
||||
|
||||
```bash
|
||||
# EasyOCR (better for rotated/handwritten text)
|
||||
pip install easyocr
|
||||
|
||||
# PaddleOCR (better for code/terminal screens)
|
||||
pip install paddleocr
|
||||
```
|
||||
|
||||
## Quick Start
|
||||
|
||||
### Basic Usage (Screen Content Only)
|
||||
|
||||
```bash
|
||||
python process_meeting.py samples/meeting.mkv
|
||||
```
|
||||
|
||||
This will:
|
||||
1. Extract frames every 5 seconds
|
||||
2. Run OCR to extract screen text
|
||||
3. Save enhanced transcript to `meeting_enhanced.txt`
|
||||
|
||||
### With Whisper Transcript
|
||||
|
||||
First, generate a Whisper transcript:
|
||||
```bash
|
||||
whisper samples/meeting.mkv --model base --output_format json
|
||||
```
|
||||
|
||||
Then process with screen content:
|
||||
```bash
|
||||
python process_meeting.py samples/meeting.mkv --transcript samples/meeting.json
|
||||
```
|
||||
|
||||
## Usage Examples
|
||||
|
||||
### Extract frames at different intervals
|
||||
```bash
|
||||
# Every 10 seconds
|
||||
python process_meeting.py samples/meeting.mkv --interval 10
|
||||
|
||||
# Every 3 seconds (more detailed)
|
||||
python process_meeting.py samples/meeting.mkv --interval 3
|
||||
```
|
||||
|
||||
### Use scene detection (smarter, fewer frames)
|
||||
```bash
|
||||
python process_meeting.py samples/meeting.mkv --scene-detection
|
||||
```
|
||||
|
||||
### Use different OCR engines
|
||||
```bash
|
||||
# EasyOCR (good for varied layouts)
|
||||
python process_meeting.py samples/meeting.mkv --ocr-engine easyocr
|
||||
|
||||
# PaddleOCR (good for code/terminal)
|
||||
python process_meeting.py samples/meeting.mkv --ocr-engine paddleocr
|
||||
```
|
||||
|
||||
### Extract frames only (no merging)
|
||||
```bash
|
||||
python process_meeting.py samples/meeting.mkv --extract-only
|
||||
```
|
||||
|
||||
### Custom output location
|
||||
```bash
|
||||
python process_meeting.py samples/meeting.mkv --output my_meeting.txt --frames-dir my_frames/
|
||||
```
|
||||
|
||||
### Enable verbose logging
|
||||
```bash
|
||||
# Show detailed debug information
|
||||
python process_meeting.py samples/meeting.mkv --verbose
|
||||
|
||||
# Short form
|
||||
python process_meeting.py samples/meeting.mkv -v
|
||||
```
|
||||
|
||||
## Output Files
|
||||
|
||||
After processing, you'll get:
|
||||
|
||||
- **`<video>_enhanced.txt`** - Enhanced transcript ready for Claude
|
||||
- **`<video>_ocr.json`** - Raw OCR data with timestamps
|
||||
- **`frames/`** - Extracted video frames (JPG files)
|
||||
|
||||
## Workflow for Meeting Analysis
|
||||
|
||||
### Complete Workflow
|
||||
|
||||
```bash
|
||||
# 1. Extract audio and transcribe with Whisper
|
||||
whisper samples/alo-intro1.mkv --model base --output_format json
|
||||
|
||||
# 2. Process video to extract screen content
|
||||
python process_meeting.py samples/alo-intro1.mkv \
|
||||
--transcript samples/alo-intro1.json \
|
||||
--scene-detection
|
||||
|
||||
# 3. Use the enhanced transcript with Claude
|
||||
# Copy the content from alo-intro1_enhanced.txt and paste into Claude
|
||||
```
|
||||
|
||||
### Example Prompt for Claude
|
||||
|
||||
```
|
||||
Please summarize this meeting transcript. Pay special attention to:
|
||||
1. Key decisions made
|
||||
2. Action items
|
||||
3. Technical details shown on screen
|
||||
4. Any metrics or data presented
|
||||
|
||||
[Paste enhanced transcript here]
|
||||
```
|
||||
|
||||
## Command Reference
|
||||
|
||||
```
|
||||
usage: process_meeting.py [-h] [--transcript TRANSCRIPT] [--output OUTPUT]
|
||||
[--frames-dir FRAMES_DIR] [--interval INTERVAL]
|
||||
[--scene-detection]
|
||||
[--ocr-engine {tesseract,easyocr,paddleocr}]
|
||||
[--no-deduplicate] [--extract-only]
|
||||
[--format {detailed,compact}] [--verbose]
|
||||
video
|
||||
|
||||
Options:
|
||||
video Path to video file
|
||||
--transcript, -t Path to Whisper transcript (JSON or TXT)
|
||||
--output, -o Output file for enhanced transcript
|
||||
--frames-dir Directory to save extracted frames (default: frames/)
|
||||
--interval Extract frame every N seconds (default: 5)
|
||||
--scene-detection Use scene detection instead of interval extraction
|
||||
--ocr-engine OCR engine: tesseract, easyocr, paddleocr (default: tesseract)
|
||||
--no-deduplicate Disable text deduplication
|
||||
--extract-only Only extract frames and OCR, skip transcript merging
|
||||
--format Output format: detailed or compact (default: detailed)
|
||||
--verbose, -v Enable verbose logging (DEBUG level)
|
||||
```
|
||||
|
||||
## Tips for Best Results
|
||||
|
||||
### Scene Detection vs Interval
|
||||
- **Scene detection**: Better for presentations with distinct slides. More efficient.
|
||||
- **Interval extraction**: Better for continuous screen sharing (coding, browsing). More thorough.
|
||||
|
||||
### OCR Engine Selection
|
||||
- **Tesseract**: Best for clean slides, documents, presentations. Fast and lightweight.
|
||||
- **EasyOCR**: Better for handwriting, rotated text, or varied fonts.
|
||||
- **PaddleOCR**: Excellent for code, terminal outputs, and mixed languages.
|
||||
|
||||
### Deduplication
|
||||
- Enabled by default - removes similar consecutive frames
|
||||
- Disable with `--no-deduplicate` if slides change subtly
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### "pytesseract not installed"
|
||||
```bash
|
||||
pip install pytesseract
|
||||
sudo apt-get install tesseract-ocr # Don't forget system package!
|
||||
```
|
||||
|
||||
### "No frames extracted"
|
||||
- Check video file is valid: `ffmpeg -i video.mkv`
|
||||
- Try lower interval: `--interval 3`
|
||||
- Check disk space in frames directory
|
||||
|
||||
### Poor OCR quality
|
||||
- Try different OCR engine
|
||||
- Check if video resolution is sufficient
|
||||
- Use `--no-deduplicate` to keep more frames
|
||||
|
||||
### Scene detection not working
|
||||
- Fallback to interval extraction automatically
|
||||
- Ensure FFmpeg is installed
|
||||
- Try manual interval: `--interval 5`
|
||||
|
||||
## Project Structure
|
||||
|
||||
```
|
||||
meetus/
|
||||
├── meetus/ # Main package
|
||||
│ ├── __init__.py
|
||||
│ ├── frame_extractor.py # Video frame extraction
|
||||
│ ├── ocr_processor.py # OCR processing
|
||||
│ └── transcript_merger.py # Transcript merging
|
||||
├── process_meeting.py # Main CLI script
|
||||
├── requirements.txt # Python dependencies
|
||||
└── README.md # This file
|
||||
```
|
||||
|
||||
## License
|
||||
|
||||
For personal use.
|
||||
0
meetus/__init__.py
Normal file
0
meetus/__init__.py
Normal file
119
meetus/frame_extractor.py
Normal file
119
meetus/frame_extractor.py
Normal file
@@ -0,0 +1,119 @@
|
||||
"""
|
||||
Extract frames from video files for OCR processing.
|
||||
Supports both regular interval sampling and scene change detection.
|
||||
"""
|
||||
import cv2
|
||||
import os
|
||||
from pathlib import Path
|
||||
from typing import List, Tuple, Optional
|
||||
import subprocess
|
||||
import json
|
||||
import logging
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
|
||||
class FrameExtractor:
|
||||
"""Extract frames from video files."""
|
||||
|
||||
def __init__(self, video_path: str, output_dir: str = "frames"):
|
||||
"""
|
||||
Initialize frame extractor.
|
||||
|
||||
Args:
|
||||
video_path: Path to video file
|
||||
output_dir: Directory to save extracted frames
|
||||
"""
|
||||
self.video_path = video_path
|
||||
self.output_dir = Path(output_dir)
|
||||
self.output_dir.mkdir(parents=True, exist_ok=True)
|
||||
|
||||
def extract_by_interval(self, interval_seconds: int = 5) -> List[Tuple[str, float]]:
|
||||
"""
|
||||
Extract frames at regular intervals.
|
||||
|
||||
Args:
|
||||
interval_seconds: Seconds between frame extractions
|
||||
|
||||
Returns:
|
||||
List of (frame_path, timestamp) tuples
|
||||
"""
|
||||
cap = cv2.VideoCapture(self.video_path)
|
||||
fps = cap.get(cv2.CAP_PROP_FPS)
|
||||
frame_interval = int(fps * interval_seconds)
|
||||
|
||||
frames_info = []
|
||||
frame_count = 0
|
||||
saved_count = 0
|
||||
|
||||
while cap.isOpened():
|
||||
ret, frame = cap.read()
|
||||
if not ret:
|
||||
break
|
||||
|
||||
if frame_count % frame_interval == 0:
|
||||
timestamp = frame_count / fps
|
||||
frame_filename = f"frame_{saved_count:05d}_{timestamp:.2f}s.jpg"
|
||||
frame_path = self.output_dir / frame_filename
|
||||
|
||||
cv2.imwrite(str(frame_path), frame)
|
||||
frames_info.append((str(frame_path), timestamp))
|
||||
saved_count += 1
|
||||
|
||||
frame_count += 1
|
||||
|
||||
cap.release()
|
||||
logger.info(f"Extracted {saved_count} frames at {interval_seconds}s intervals")
|
||||
return frames_info
|
||||
|
||||
def extract_scene_changes(self, threshold: float = 30.0) -> List[Tuple[str, float]]:
|
||||
"""
|
||||
Extract frames only on scene changes using FFmpeg.
|
||||
More efficient than interval-based extraction.
|
||||
|
||||
Args:
|
||||
threshold: Scene change detection threshold (0-100, lower = more sensitive)
|
||||
|
||||
Returns:
|
||||
List of (frame_path, timestamp) tuples
|
||||
"""
|
||||
video_name = Path(self.video_path).stem
|
||||
output_pattern = self.output_dir / f"{video_name}_%05d.jpg"
|
||||
|
||||
# Use FFmpeg's scene detection filter
|
||||
cmd = [
|
||||
'ffmpeg',
|
||||
'-i', self.video_path,
|
||||
'-vf', f'select=gt(scene\\,{threshold/100}),showinfo',
|
||||
'-vsync', 'vfr',
|
||||
'-frame_pts', '1',
|
||||
str(output_pattern),
|
||||
'-loglevel', 'info'
|
||||
]
|
||||
|
||||
try:
|
||||
result = subprocess.run(cmd, capture_output=True, text=True, check=True)
|
||||
|
||||
# Parse output to get frame timestamps
|
||||
frames_info = []
|
||||
for img in sorted(self.output_dir.glob(f"{video_name}_*.jpg")):
|
||||
# Extract timestamp from filename or use FFprobe
|
||||
frames_info.append((str(img), 0.0)) # Timestamp extraction can be enhanced
|
||||
|
||||
logger.info(f"Extracted {len(frames_info)} frames at scene changes")
|
||||
return frames_info
|
||||
|
||||
except subprocess.CalledProcessError as e:
|
||||
logger.error(f"FFmpeg error: {e.stderr}")
|
||||
# Fallback to interval extraction
|
||||
logger.warning("Falling back to interval extraction...")
|
||||
return self.extract_by_interval()
|
||||
|
||||
def get_video_duration(self) -> float:
|
||||
"""Get video duration in seconds."""
|
||||
cap = cv2.VideoCapture(self.video_path)
|
||||
fps = cap.get(cv2.CAP_PROP_FPS)
|
||||
frame_count = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))
|
||||
duration = frame_count / fps if fps > 0 else 0
|
||||
cap.release()
|
||||
return duration
|
||||
143
meetus/ocr_processor.py
Normal file
143
meetus/ocr_processor.py
Normal file
@@ -0,0 +1,143 @@
|
||||
"""
|
||||
OCR processing for extracted video frames.
|
||||
Supports multiple OCR engines and text deduplication.
|
||||
"""
|
||||
from typing import List, Tuple, Dict, Optional
|
||||
from pathlib import Path
|
||||
from difflib import SequenceMatcher
|
||||
import re
|
||||
import logging
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
|
||||
class OCRProcessor:
|
||||
"""Process frames with OCR to extract text."""
|
||||
|
||||
def __init__(self, engine: str = "tesseract", lang: str = "eng"):
|
||||
"""
|
||||
Initialize OCR processor.
|
||||
|
||||
Args:
|
||||
engine: OCR engine to use ('tesseract', 'easyocr', 'paddleocr')
|
||||
lang: Language code for OCR
|
||||
"""
|
||||
self.engine = engine.lower()
|
||||
self.lang = lang
|
||||
self._ocr_engine = None
|
||||
self._init_engine()
|
||||
|
||||
def _init_engine(self):
|
||||
"""Initialize the selected OCR engine."""
|
||||
if self.engine == "tesseract":
|
||||
try:
|
||||
import pytesseract
|
||||
self._ocr_engine = pytesseract
|
||||
except ImportError:
|
||||
raise ImportError("pytesseract not installed. Run: pip install pytesseract")
|
||||
|
||||
elif self.engine == "easyocr":
|
||||
try:
|
||||
import easyocr
|
||||
self._ocr_engine = easyocr.Reader([self.lang])
|
||||
except ImportError:
|
||||
raise ImportError("easyocr not installed. Run: pip install easyocr")
|
||||
|
||||
elif self.engine == "paddleocr":
|
||||
try:
|
||||
from paddleocr import PaddleOCR
|
||||
self._ocr_engine = PaddleOCR(lang=self.lang, use_angle_cls=True, show_log=False)
|
||||
except ImportError:
|
||||
raise ImportError("paddleocr not installed. Run: pip install paddleocr")
|
||||
|
||||
else:
|
||||
raise ValueError(f"Unknown OCR engine: {self.engine}")
|
||||
|
||||
def extract_text(self, image_path: str) -> str:
|
||||
"""
|
||||
Extract text from a single image.
|
||||
|
||||
Args:
|
||||
image_path: Path to image file
|
||||
|
||||
Returns:
|
||||
Extracted text
|
||||
"""
|
||||
if self.engine == "tesseract":
|
||||
from PIL import Image
|
||||
image = Image.open(image_path)
|
||||
text = self._ocr_engine.image_to_string(image)
|
||||
|
||||
elif self.engine == "easyocr":
|
||||
result = self._ocr_engine.readtext(image_path, detail=0)
|
||||
text = "\n".join(result)
|
||||
|
||||
elif self.engine == "paddleocr":
|
||||
result = self._ocr_engine.ocr(image_path, cls=True)
|
||||
if result and result[0]:
|
||||
text = "\n".join([line[1][0] for line in result[0]])
|
||||
else:
|
||||
text = ""
|
||||
|
||||
return self._clean_text(text)
|
||||
|
||||
def _clean_text(self, text: str) -> str:
|
||||
"""Clean up OCR output."""
|
||||
# Remove excessive whitespace
|
||||
text = re.sub(r'\n\s*\n', '\n', text)
|
||||
text = re.sub(r' +', ' ', text)
|
||||
return text.strip()
|
||||
|
||||
def process_frames(
|
||||
self,
|
||||
frames_info: List[Tuple[str, float]],
|
||||
deduplicate: bool = True,
|
||||
similarity_threshold: float = 0.85
|
||||
) -> List[Dict]:
|
||||
"""
|
||||
Process multiple frames and extract text.
|
||||
|
||||
Args:
|
||||
frames_info: List of (frame_path, timestamp) tuples
|
||||
deduplicate: Whether to remove similar consecutive texts
|
||||
similarity_threshold: Threshold for considering texts as duplicates (0-1)
|
||||
|
||||
Returns:
|
||||
List of dicts with 'timestamp', 'text', and 'frame_path'
|
||||
"""
|
||||
results = []
|
||||
prev_text = ""
|
||||
|
||||
for frame_path, timestamp in frames_info:
|
||||
logger.debug(f"Processing frame at {timestamp:.2f}s...")
|
||||
text = self.extract_text(frame_path)
|
||||
|
||||
if not text:
|
||||
continue
|
||||
|
||||
# Deduplicate similar consecutive frames
|
||||
if deduplicate:
|
||||
similarity = self._text_similarity(prev_text, text)
|
||||
if similarity > similarity_threshold:
|
||||
logger.debug(f"Skipping duplicate frame at {timestamp:.2f}s (similarity: {similarity:.2f})")
|
||||
continue
|
||||
|
||||
results.append({
|
||||
'timestamp': timestamp,
|
||||
'text': text,
|
||||
'frame_path': frame_path
|
||||
})
|
||||
|
||||
prev_text = text
|
||||
|
||||
logger.info(f"Extracted text from {len(results)} frames (deduplication: {deduplicate})")
|
||||
return results
|
||||
|
||||
def _text_similarity(self, text1: str, text2: str) -> float:
|
||||
"""
|
||||
Calculate similarity between two texts.
|
||||
|
||||
Returns:
|
||||
Similarity score between 0 and 1
|
||||
"""
|
||||
return SequenceMatcher(None, text1, text2).ratio()
|
||||
173
meetus/transcript_merger.py
Normal file
173
meetus/transcript_merger.py
Normal file
@@ -0,0 +1,173 @@
|
||||
"""
|
||||
Merge Whisper transcripts with OCR screen content.
|
||||
Creates a unified, timestamped transcript for Claude summarization.
|
||||
"""
|
||||
from typing import List, Dict, Optional
|
||||
import json
|
||||
from pathlib import Path
|
||||
import logging
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
|
||||
class TranscriptMerger:
|
||||
"""Merge audio transcripts with screen OCR text."""
|
||||
|
||||
def __init__(self):
|
||||
"""Initialize transcript merger."""
|
||||
pass
|
||||
|
||||
def load_whisper_transcript(self, transcript_path: str) -> List[Dict]:
|
||||
"""
|
||||
Load Whisper transcript from file.
|
||||
|
||||
Supports both JSON format (with timestamps) and plain text.
|
||||
|
||||
Args:
|
||||
transcript_path: Path to transcript file
|
||||
|
||||
Returns:
|
||||
List of dicts with 'timestamp' (optional) and 'text'
|
||||
"""
|
||||
path = Path(transcript_path)
|
||||
|
||||
if path.suffix == '.json':
|
||||
with open(path, 'r', encoding='utf-8') as f:
|
||||
data = json.load(f)
|
||||
|
||||
# Handle different Whisper output formats
|
||||
if isinstance(data, dict) and 'segments' in data:
|
||||
# Standard Whisper JSON format
|
||||
return [
|
||||
{
|
||||
'timestamp': seg.get('start', 0),
|
||||
'text': seg['text'].strip(),
|
||||
'type': 'audio'
|
||||
}
|
||||
for seg in data['segments']
|
||||
]
|
||||
elif isinstance(data, list):
|
||||
# List of segments
|
||||
return [
|
||||
{
|
||||
'timestamp': seg.get('start', seg.get('timestamp', 0)),
|
||||
'text': seg['text'].strip(),
|
||||
'type': 'audio'
|
||||
}
|
||||
for seg in data
|
||||
]
|
||||
|
||||
else:
|
||||
# Plain text file - no timestamps
|
||||
with open(path, 'r', encoding='utf-8') as f:
|
||||
text = f.read().strip()
|
||||
|
||||
return [{
|
||||
'timestamp': 0,
|
||||
'text': text,
|
||||
'type': 'audio'
|
||||
}]
|
||||
|
||||
def merge_transcripts(
|
||||
self,
|
||||
audio_segments: List[Dict],
|
||||
screen_segments: List[Dict]
|
||||
) -> List[Dict]:
|
||||
"""
|
||||
Merge audio and screen transcripts by timestamp.
|
||||
|
||||
Args:
|
||||
audio_segments: List of audio transcript segments
|
||||
screen_segments: List of screen OCR segments
|
||||
|
||||
Returns:
|
||||
Merged list sorted by timestamp
|
||||
"""
|
||||
# Mark segment types
|
||||
for seg in audio_segments:
|
||||
seg['type'] = 'audio'
|
||||
for seg in screen_segments:
|
||||
seg['type'] = 'screen'
|
||||
|
||||
# Combine and sort by timestamp
|
||||
all_segments = audio_segments + screen_segments
|
||||
all_segments.sort(key=lambda x: x['timestamp'])
|
||||
|
||||
return all_segments
|
||||
|
||||
def format_for_claude(
|
||||
self,
|
||||
merged_segments: List[Dict],
|
||||
format_style: str = "detailed"
|
||||
) -> str:
|
||||
"""
|
||||
Format merged transcript for Claude processing.
|
||||
|
||||
Args:
|
||||
merged_segments: Merged transcript segments
|
||||
format_style: 'detailed' or 'compact'
|
||||
|
||||
Returns:
|
||||
Formatted transcript string
|
||||
"""
|
||||
if format_style == "detailed":
|
||||
return self._format_detailed(merged_segments)
|
||||
else:
|
||||
return self._format_compact(merged_segments)
|
||||
|
||||
def _format_detailed(self, segments: List[Dict]) -> str:
|
||||
"""Format with clear visual separation between audio and screen content."""
|
||||
lines = []
|
||||
lines.append("=" * 80)
|
||||
lines.append("ENHANCED MEETING TRANSCRIPT")
|
||||
lines.append("Audio transcript + Screen content")
|
||||
lines.append("=" * 80)
|
||||
lines.append("")
|
||||
|
||||
for seg in segments:
|
||||
timestamp = self._format_timestamp(seg['timestamp'])
|
||||
|
||||
if seg['type'] == 'audio':
|
||||
lines.append(f"[{timestamp}] SPEAKER:")
|
||||
lines.append(f" {seg['text']}")
|
||||
lines.append("")
|
||||
|
||||
else: # screen
|
||||
lines.append(f"[{timestamp}] SCREEN CONTENT:")
|
||||
# Indent screen text for visibility
|
||||
screen_text = seg['text'].replace('\n', '\n | ')
|
||||
lines.append(f" | {screen_text}")
|
||||
lines.append("")
|
||||
|
||||
return "\n".join(lines)
|
||||
|
||||
def _format_compact(self, segments: List[Dict]) -> str:
|
||||
"""Compact format for shorter transcripts."""
|
||||
lines = []
|
||||
|
||||
for seg in segments:
|
||||
timestamp = self._format_timestamp(seg['timestamp'])
|
||||
prefix = "SPEAKER" if seg['type'] == 'audio' else "SCREEN"
|
||||
text = seg['text'].replace('\n', ' ')[:200] # Truncate long screen text
|
||||
lines.append(f"[{timestamp}] {prefix}: {text}")
|
||||
|
||||
return "\n".join(lines)
|
||||
|
||||
def _format_timestamp(self, seconds: float) -> str:
|
||||
"""Format timestamp as MM:SS."""
|
||||
minutes = int(seconds // 60)
|
||||
secs = int(seconds % 60)
|
||||
return f"{minutes:02d}:{secs:02d}"
|
||||
|
||||
def save_transcript(self, formatted_text: str, output_path: str):
|
||||
"""
|
||||
Save formatted transcript to file.
|
||||
|
||||
Args:
|
||||
formatted_text: Formatted transcript
|
||||
output_path: Output file path
|
||||
"""
|
||||
with open(output_path, 'w', encoding='utf-8') as f:
|
||||
f.write(formatted_text)
|
||||
|
||||
logger.info(f"Saved enhanced transcript to: {output_path}")
|
||||
0
meetus/utils/__init__.py
Normal file
0
meetus/utils/__init__.py
Normal file
229
process_meeting.py
Normal file
229
process_meeting.py
Normal file
@@ -0,0 +1,229 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Process meeting recordings to extract audio + screen content.
|
||||
Combines Whisper transcripts with OCR from screen shares.
|
||||
"""
|
||||
import argparse
|
||||
from pathlib import Path
|
||||
import sys
|
||||
import json
|
||||
import logging
|
||||
|
||||
from meetus.frame_extractor import FrameExtractor
|
||||
from meetus.ocr_processor import OCRProcessor
|
||||
from meetus.transcript_merger import TranscriptMerger
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
|
||||
def setup_logging(verbose: bool = False):
|
||||
"""
|
||||
Configure logging for the application.
|
||||
|
||||
Args:
|
||||
verbose: If True, set DEBUG level, otherwise INFO
|
||||
"""
|
||||
level = logging.DEBUG if verbose else logging.INFO
|
||||
|
||||
# Configure root logger
|
||||
logging.basicConfig(
|
||||
level=level,
|
||||
format='%(asctime)s - %(levelname)s - %(message)s',
|
||||
datefmt='%H:%M:%S'
|
||||
)
|
||||
|
||||
# Suppress verbose output from libraries
|
||||
logging.getLogger('PIL').setLevel(logging.WARNING)
|
||||
logging.getLogger('easyocr').setLevel(logging.WARNING)
|
||||
logging.getLogger('paddleocr').setLevel(logging.WARNING)
|
||||
|
||||
|
||||
def main():
|
||||
parser = argparse.ArgumentParser(
|
||||
description="Extract screen content from meeting recordings and merge with transcripts",
|
||||
formatter_class=argparse.RawDescriptionHelpFormatter,
|
||||
epilog="""
|
||||
Examples:
|
||||
# Process video and extract frames only
|
||||
python process_meeting.py samples/meeting.mkv --extract-only
|
||||
|
||||
# Process video with Whisper transcript
|
||||
python process_meeting.py samples/meeting.mkv --transcript meeting.json
|
||||
|
||||
# Use scene detection instead of interval
|
||||
python process_meeting.py samples/meeting.mkv --scene-detection
|
||||
|
||||
# Use different OCR engine
|
||||
python process_meeting.py samples/meeting.mkv --ocr-engine easyocr
|
||||
"""
|
||||
)
|
||||
|
||||
parser.add_argument(
|
||||
'video',
|
||||
help='Path to video file'
|
||||
)
|
||||
|
||||
parser.add_argument(
|
||||
'--transcript', '-t',
|
||||
help='Path to Whisper transcript (JSON or TXT)',
|
||||
default=None
|
||||
)
|
||||
|
||||
parser.add_argument(
|
||||
'--output', '-o',
|
||||
help='Output file for enhanced transcript (default: <video>_enhanced.txt)',
|
||||
default=None
|
||||
)
|
||||
|
||||
parser.add_argument(
|
||||
'--frames-dir',
|
||||
help='Directory to save extracted frames (default: frames/)',
|
||||
default='frames'
|
||||
)
|
||||
|
||||
parser.add_argument(
|
||||
'--interval',
|
||||
type=int,
|
||||
help='Extract frame every N seconds (default: 5)',
|
||||
default=5
|
||||
)
|
||||
|
||||
parser.add_argument(
|
||||
'--scene-detection',
|
||||
action='store_true',
|
||||
help='Use scene detection instead of interval extraction'
|
||||
)
|
||||
|
||||
parser.add_argument(
|
||||
'--ocr-engine',
|
||||
choices=['tesseract', 'easyocr', 'paddleocr'],
|
||||
help='OCR engine to use (default: tesseract)',
|
||||
default='tesseract'
|
||||
)
|
||||
|
||||
parser.add_argument(
|
||||
'--no-deduplicate',
|
||||
action='store_true',
|
||||
help='Disable text deduplication'
|
||||
)
|
||||
|
||||
parser.add_argument(
|
||||
'--extract-only',
|
||||
action='store_true',
|
||||
help='Only extract frames and OCR, skip transcript merging'
|
||||
)
|
||||
|
||||
parser.add_argument(
|
||||
'--format',
|
||||
choices=['detailed', 'compact'],
|
||||
help='Output format style (default: detailed)',
|
||||
default='detailed'
|
||||
)
|
||||
|
||||
parser.add_argument(
|
||||
'--verbose', '-v',
|
||||
action='store_true',
|
||||
help='Enable verbose logging (DEBUG level)'
|
||||
)
|
||||
|
||||
args = parser.parse_args()
|
||||
|
||||
# Setup logging
|
||||
setup_logging(args.verbose)
|
||||
|
||||
# Validate video path
|
||||
video_path = Path(args.video)
|
||||
if not video_path.exists():
|
||||
logger.error(f"Video file not found: {args.video}")
|
||||
sys.exit(1)
|
||||
|
||||
# Set default output path
|
||||
if args.output is None:
|
||||
args.output = video_path.stem + '_enhanced.txt'
|
||||
|
||||
logger.info("=" * 80)
|
||||
logger.info("MEETING PROCESSOR")
|
||||
logger.info("=" * 80)
|
||||
logger.info(f"Video: {video_path.name}")
|
||||
logger.info(f"OCR Engine: {args.ocr_engine}")
|
||||
logger.info(f"Frame extraction: {'Scene detection' if args.scene_detection else f'Every {args.interval}s'}")
|
||||
logger.info("=" * 80)
|
||||
|
||||
# Step 1: Extract frames
|
||||
logger.info("Step 1: Extracting frames from video...")
|
||||
extractor = FrameExtractor(str(video_path), args.frames_dir)
|
||||
|
||||
if args.scene_detection:
|
||||
frames_info = extractor.extract_scene_changes()
|
||||
else:
|
||||
frames_info = extractor.extract_by_interval(args.interval)
|
||||
|
||||
if not frames_info:
|
||||
logger.error("No frames extracted")
|
||||
sys.exit(1)
|
||||
|
||||
logger.info(f"✓ Extracted {len(frames_info)} frames")
|
||||
|
||||
# Step 2: Run OCR on frames
|
||||
logger.info("Step 2: Running OCR on extracted frames...")
|
||||
try:
|
||||
ocr = OCRProcessor(engine=args.ocr_engine)
|
||||
screen_segments = ocr.process_frames(
|
||||
frames_info,
|
||||
deduplicate=not args.no_deduplicate
|
||||
)
|
||||
logger.info(f"✓ Processed {len(screen_segments)} frames with text content")
|
||||
|
||||
except ImportError as e:
|
||||
logger.error(f"{e}")
|
||||
logger.error(f"To install {args.ocr_engine}:")
|
||||
logger.error(f" pip install {args.ocr_engine}")
|
||||
sys.exit(1)
|
||||
|
||||
# Save OCR results as JSON
|
||||
ocr_output = video_path.stem + '_ocr.json'
|
||||
with open(ocr_output, 'w', encoding='utf-8') as f:
|
||||
json.dump(screen_segments, f, indent=2, ensure_ascii=False)
|
||||
logger.info(f"✓ Saved OCR results to: {ocr_output}")
|
||||
|
||||
if args.extract_only:
|
||||
logger.info("Done! (extract-only mode)")
|
||||
return
|
||||
|
||||
# Step 3: Merge with transcript (if provided)
|
||||
merger = TranscriptMerger()
|
||||
|
||||
if args.transcript:
|
||||
logger.info("Step 3: Merging with Whisper transcript...")
|
||||
transcript_path = Path(args.transcript)
|
||||
|
||||
if not transcript_path.exists():
|
||||
logger.warning(f"Transcript not found: {args.transcript}")
|
||||
logger.info("Proceeding with screen content only...")
|
||||
audio_segments = []
|
||||
else:
|
||||
audio_segments = merger.load_whisper_transcript(str(transcript_path))
|
||||
logger.info(f"✓ Loaded {len(audio_segments)} audio segments")
|
||||
else:
|
||||
logger.info("No transcript provided, using screen content only...")
|
||||
audio_segments = []
|
||||
|
||||
# Merge and format
|
||||
merged = merger.merge_transcripts(audio_segments, screen_segments)
|
||||
formatted = merger.format_for_claude(merged, format_style=args.format)
|
||||
|
||||
# Save output
|
||||
merger.save_transcript(formatted, args.output)
|
||||
|
||||
logger.info("=" * 80)
|
||||
logger.info("✓ PROCESSING COMPLETE!")
|
||||
logger.info("=" * 80)
|
||||
logger.info(f"Enhanced transcript: {args.output}")
|
||||
logger.info(f"OCR data: {ocr_output}")
|
||||
logger.info(f"Frames: {args.frames_dir}/")
|
||||
logger.info("")
|
||||
logger.info("You can now use the enhanced transcript with Claude for summarization!")
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
main()
|
||||
14
requirements.txt
Normal file
14
requirements.txt
Normal file
@@ -0,0 +1,14 @@
|
||||
# Core dependencies
|
||||
opencv-python>=4.8.0
|
||||
Pillow>=10.0.0
|
||||
|
||||
# OCR engines (install at least one)
|
||||
# Tesseract (recommended, lightweight)
|
||||
pytesseract>=0.3.10
|
||||
|
||||
# Alternative OCR engines (optional, install as needed)
|
||||
# easyocr>=1.7.0
|
||||
# paddleocr>=2.7.0
|
||||
|
||||
# For Whisper transcription (if not already installed)
|
||||
# openai-whisper>=20230918
|
||||
Reference in New Issue
Block a user