update docs
This commit is contained in:
@@ -1,31 +1,24 @@
|
||||
# Media Storage Architecture
|
||||
# Media & Artifact Storage
|
||||
|
||||
## Overview
|
||||
|
||||
MPR uses **S3-compatible storage** everywhere. Locally via MinIO, in production via AWS S3. The same boto3 code and S3 keys work in both environments - the only difference is the `S3_ENDPOINT_URL` env var.
|
||||
MPR stores everything on **S3-compatible** object storage. Locally that's MinIO; in any
|
||||
cloud target (AWS, GCS via HMAC, Cloudflare R2, etc.) it's the provider's S3 API. The
|
||||
code in `core/storage/` uses boto3 throughout — only the endpoint URL and credentials
|
||||
change between environments.
|
||||
|
||||
## Storage Strategy
|
||||
## What goes where
|
||||
|
||||
### S3 Buckets
|
||||
| Bucket / prefix | Contents | Producer | Consumer |
|
||||
|---|---|---|---|
|
||||
| `mpr-media-in` | Source video files (chunks the user uploaded or device-recorded) | user / chunker UI | `extract_frames` stage, `core/api/detect/sources.py` |
|
||||
| `mpr-media-out` | Per-job artifacts: extracted frame caches, debug overlays | pipeline stages, `core/api/detect/replay.py` overlays endpoints | UI panels (frame strip, overlay viewer) |
|
||||
|
||||
| Bucket | Env Var | Purpose |
|
||||
|--------|---------|---------|
|
||||
| `mpr-media-in` | `S3_BUCKET_IN` | Source media files |
|
||||
| `mpr-media-out` | `S3_BUCKET_OUT` | Transcoded/trimmed output |
|
||||
Both buckets live behind the same S3 client (`core/storage/`). DB rows store relative
|
||||
keys (e.g. `chunks/2025-04-15/match-01.mp4`); the bucket is implicit.
|
||||
|
||||
### S3 Keys as File Paths
|
||||
- **Database**: Stores S3 object keys (e.g., `video1.mp4`, `subfolder/video3.mp4`)
|
||||
- **Local dev**: MinIO serves these via S3 API on port 9000
|
||||
- **AWS**: Real S3, same keys, different endpoint
|
||||
## Local development (MinIO)
|
||||
|
||||
### Why S3 Everywhere?
|
||||
1. **Identical code paths** - no branching between local and cloud
|
||||
2. **Seamless executor switching** - Celery and Lambda both use boto3
|
||||
3. **Cloud-native** - ready for production without refactoring
|
||||
|
||||
## Local Development (MinIO)
|
||||
|
||||
### Configuration
|
||||
```bash
|
||||
S3_ENDPOINT_URL=http://minio:9000
|
||||
S3_BUCKET_IN=mpr-media-in
|
||||
@@ -34,137 +27,49 @@ AWS_ACCESS_KEY_ID=minioadmin
|
||||
AWS_SECRET_ACCESS_KEY=minioadmin
|
||||
```
|
||||
|
||||
### How It Works
|
||||
- MinIO runs as a Docker container (port 9000 API, port 9001 console)
|
||||
- `minio-init` container creates buckets and sets public read access on startup
|
||||
- Nginx proxies `/media/in/` and `/media/out/` to MinIO buckets
|
||||
- Upload files via MinIO Console (http://localhost:9001) or `mc` CLI
|
||||
In the Tilt setup, MinIO runs as a k8s Deployment with port-forwards for `9000` (S3 API)
|
||||
and `9001` (web console). A `minio-init` job creates the buckets on first start.
|
||||
|
||||
## Cloud (AWS S3 / GCS / others)
|
||||
|
||||
### Upload Files to MinIO
|
||||
```bash
|
||||
# Using mc CLI
|
||||
mc alias set local http://localhost:9000 minioadmin minioadmin
|
||||
mc cp video.mp4 local/mpr-media-in/
|
||||
|
||||
# Using aws CLI with endpoint override
|
||||
aws --endpoint-url http://localhost:9000 s3 cp video.mp4 s3://mpr-media-in/
|
||||
```
|
||||
|
||||
## AWS Production (S3)
|
||||
|
||||
### Configuration
|
||||
```bash
|
||||
# No S3_ENDPOINT_URL = uses real AWS S3
|
||||
S3_BUCKET_IN=mpr-media-in
|
||||
S3_BUCKET_OUT=mpr-media-out
|
||||
# AWS S3 — no endpoint URL needed
|
||||
S3_BUCKET_IN=...
|
||||
S3_BUCKET_OUT=...
|
||||
AWS_REGION=us-east-1
|
||||
AWS_ACCESS_KEY_ID=<real-key>
|
||||
AWS_SECRET_ACCESS_KEY=<real-secret>
|
||||
```
|
||||
AWS_ACCESS_KEY_ID=...
|
||||
AWS_SECRET_ACCESS_KEY=...
|
||||
|
||||
### Upload Files to S3
|
||||
```bash
|
||||
aws s3 cp video.mp4 s3://mpr-media-in/
|
||||
aws s3 sync /local/media/ s3://mpr-media-in/
|
||||
```
|
||||
|
||||
## GCP Production (GCS via S3 compatibility)
|
||||
|
||||
GCS exposes an S3-compatible API. The same `core/storage/s3.py` boto3 code works
|
||||
with no changes — only the endpoint and credentials differ.
|
||||
|
||||
### GCS HMAC Keys
|
||||
Generate under **Cloud Storage → Settings → Interoperability** in the GCP console.
|
||||
These act as `AWS_ACCESS_KEY_ID` / `AWS_SECRET_ACCESS_KEY`.
|
||||
|
||||
### Configuration
|
||||
```bash
|
||||
# GCS via HMAC
|
||||
S3_ENDPOINT_URL=https://storage.googleapis.com
|
||||
S3_BUCKET_IN=mpr-media-in
|
||||
S3_BUCKET_OUT=mpr-media-out
|
||||
AWS_ACCESS_KEY_ID=<GCS HMAC access key>
|
||||
AWS_SECRET_ACCESS_KEY=<GCS HMAC secret>
|
||||
|
||||
# Executor
|
||||
MPR_EXECUTOR=gcp
|
||||
GCP_PROJECT_ID=my-project
|
||||
GCP_REGION=us-central1
|
||||
CLOUD_RUN_JOB=mpr-transcode
|
||||
CALLBACK_URL=https://mpr.mcrn.ar/api
|
||||
CALLBACK_API_KEY=<secret>
|
||||
AWS_ACCESS_KEY_ID=<gcs hmac access>
|
||||
AWS_SECRET_ACCESS_KEY=<gcs hmac secret>
|
||||
```
|
||||
|
||||
### Upload Files to GCS
|
||||
```bash
|
||||
gcloud storage cp video.mp4 gs://mpr-media-in/
|
||||
## Database vs. object storage
|
||||
|
||||
# Or with the aws CLI via compat endpoint
|
||||
aws --endpoint-url https://storage.googleapis.com s3 cp video.mp4 s3://mpr-media-in/
|
||||
```
|
||||
Heavy artifacts (frames, masks, overlays) live in MinIO/S3. The `Checkpoint` and
|
||||
`StageOutput` tables in Postgres (see `02-data-model.svg`) hold structured outputs
|
||||
(detections, stats, references to S3 keys) — never blobs. Frame caches keyed by
|
||||
`timeline_id` are written by the first run of `extract_frames` and reused by every
|
||||
later replay on the same timeline.
|
||||
|
||||
### Cloud Run Job Handler
|
||||
`core/task/gcp_handler.py` is the Cloud Run Job entrypoint. It reads the job payload
|
||||
from `MPR_JOB_PAYLOAD` (injected by `GCPExecutor`), uses `core/storage` for all
|
||||
GCS access (S3 compat), and POSTs the completion callback to the API.
|
||||
## Storage module
|
||||
|
||||
Set the Cloud Run Job command to: `python -m core.task.gcp_handler`
|
||||
|
||||
## Storage Module
|
||||
|
||||
`core/storage/` package provides all S3 operations:
|
||||
`core/storage/` exposes the small set of helpers callers need:
|
||||
|
||||
```python
|
||||
from core.storage import (
|
||||
get_s3_client, # boto3 client (MinIO or AWS)
|
||||
list_objects, # List bucket contents, filter by extension
|
||||
download_file, # Download S3 object to local path
|
||||
download_to_temp, # Download to temp file (caller cleans up)
|
||||
upload_file, # Upload local file to S3
|
||||
get_presigned_url, # Generate presigned URL
|
||||
BUCKET_IN, # Input bucket name
|
||||
BUCKET_OUT, # Output bucket name
|
||||
get_s3_client,
|
||||
list_objects,
|
||||
download_file,
|
||||
download_to_temp,
|
||||
upload_file,
|
||||
get_presigned_url,
|
||||
BUCKET_IN,
|
||||
BUCKET_OUT,
|
||||
)
|
||||
```
|
||||
|
||||
## API Endpoints
|
||||
|
||||
### Scan Media (REST)
|
||||
```http
|
||||
POST /api/assets/scan
|
||||
```
|
||||
Lists objects in `S3_BUCKET_IN`, registers new media files.
|
||||
|
||||
### Scan Media (GraphQL)
|
||||
```graphql
|
||||
mutation { scanMediaFolder { found registered skipped files } }
|
||||
```
|
||||
|
||||
## Job Flow with S3
|
||||
|
||||
### Local Mode (Celery)
|
||||
1. Celery task receives `source_key` and `output_key`
|
||||
2. Downloads source from `S3_BUCKET_IN` to temp file
|
||||
3. Runs FFmpeg locally
|
||||
4. Uploads result to `S3_BUCKET_OUT`
|
||||
5. Cleans up temp files
|
||||
|
||||
### Lambda Mode (AWS)
|
||||
1. Step Functions invokes Lambda with S3 keys
|
||||
2. Lambda downloads source from `S3_BUCKET_IN` to `/tmp`
|
||||
3. Runs FFmpeg in container
|
||||
4. Uploads result to `S3_BUCKET_OUT`
|
||||
5. Calls back to API with result
|
||||
|
||||
### Cloud Run Job Mode (GCP)
|
||||
1. `GCPExecutor` triggers Cloud Run Job with payload in `MPR_JOB_PAYLOAD`
|
||||
2. `core/task/gcp_handler.py` downloads source from `S3_BUCKET_IN` (GCS S3 compat)
|
||||
3. Runs FFmpeg in container
|
||||
4. Uploads result to `S3_BUCKET_OUT` (GCS S3 compat)
|
||||
5. Calls back to API with result
|
||||
|
||||
All three paths use the same S3-compatible bucket names and key structure.
|
||||
|
||||
## Supported File Types
|
||||
|
||||
**Video:** `.mp4`, `.mkv`, `.avi`, `.mov`, `.webm`, `.flv`, `.wmv`, `.m4v`
|
||||
**Audio:** `.mp3`, `.wav`, `.flac`, `.aac`, `.ogg`, `.m4a`
|
||||
Anything else (multipart, lifecycle, versioning) is the bucket's responsibility, not
|
||||
the application's.
|
||||
|
||||
Reference in New Issue
Block a user