Files
mediaproc/docs/architecture/04-media-storage.md
2026-03-13 01:07:02 -03:00

5.0 KiB

Media Storage Architecture

Overview

MPR uses S3-compatible storage everywhere. Locally via MinIO, in production via AWS S3. The same boto3 code and S3 keys work in both environments - the only difference is the S3_ENDPOINT_URL env var.

Storage Strategy

S3 Buckets

Bucket Env Var Purpose
mpr-media-in S3_BUCKET_IN Source media files
mpr-media-out S3_BUCKET_OUT Transcoded/trimmed output

S3 Keys as File Paths

  • Database: Stores S3 object keys (e.g., video1.mp4, subfolder/video3.mp4)
  • Local dev: MinIO serves these via S3 API on port 9000
  • AWS: Real S3, same keys, different endpoint

Why S3 Everywhere?

  1. Identical code paths - no branching between local and cloud
  2. Seamless executor switching - Celery and Lambda both use boto3
  3. Cloud-native - ready for production without refactoring

Local Development (MinIO)

Configuration

S3_ENDPOINT_URL=http://minio:9000
S3_BUCKET_IN=mpr-media-in
S3_BUCKET_OUT=mpr-media-out
AWS_ACCESS_KEY_ID=minioadmin
AWS_SECRET_ACCESS_KEY=minioadmin

How It Works

  • MinIO runs as a Docker container (port 9000 API, port 9001 console)
  • minio-init container creates buckets and sets public read access on startup
  • Nginx proxies /media/in/ and /media/out/ to MinIO buckets
  • Upload files via MinIO Console (http://localhost:9001) or mc CLI

Upload Files to MinIO

# Using mc CLI
mc alias set local http://localhost:9000 minioadmin minioadmin
mc cp video.mp4 local/mpr-media-in/

# Using aws CLI with endpoint override
aws --endpoint-url http://localhost:9000 s3 cp video.mp4 s3://mpr-media-in/

AWS Production (S3)

Configuration

# No S3_ENDPOINT_URL = uses real AWS S3
S3_BUCKET_IN=mpr-media-in
S3_BUCKET_OUT=mpr-media-out
AWS_REGION=us-east-1
AWS_ACCESS_KEY_ID=<real-key>
AWS_SECRET_ACCESS_KEY=<real-secret>

Upload Files to S3

aws s3 cp video.mp4 s3://mpr-media-in/
aws s3 sync /local/media/ s3://mpr-media-in/

GCP Production (GCS via S3 compatibility)

GCS exposes an S3-compatible API. The same core/storage/s3.py boto3 code works with no changes — only the endpoint and credentials differ.

GCS HMAC Keys

Generate under Cloud Storage → Settings → Interoperability in the GCP console. These act as AWS_ACCESS_KEY_ID / AWS_SECRET_ACCESS_KEY.

Configuration

S3_ENDPOINT_URL=https://storage.googleapis.com
S3_BUCKET_IN=mpr-media-in
S3_BUCKET_OUT=mpr-media-out
AWS_ACCESS_KEY_ID=<GCS HMAC access key>
AWS_SECRET_ACCESS_KEY=<GCS HMAC secret>

# Executor
MPR_EXECUTOR=gcp
GCP_PROJECT_ID=my-project
GCP_REGION=us-central1
CLOUD_RUN_JOB=mpr-transcode
CALLBACK_URL=https://mpr.mcrn.ar/api
CALLBACK_API_KEY=<secret>

Upload Files to GCS

gcloud storage cp video.mp4 gs://mpr-media-in/

# Or with the aws CLI via compat endpoint
aws --endpoint-url https://storage.googleapis.com s3 cp video.mp4 s3://mpr-media-in/

Cloud Run Job Handler

core/task/gcp_handler.py is the Cloud Run Job entrypoint. It reads the job payload from MPR_JOB_PAYLOAD (injected by GCPExecutor), uses core/storage for all GCS access (S3 compat), and POSTs the completion callback to the API.

Set the Cloud Run Job command to: python -m core.task.gcp_handler

Storage Module

core/storage/ package provides all S3 operations:

from core.storage import (
    get_s3_client,     # boto3 client (MinIO or AWS)
    list_objects,      # List bucket contents, filter by extension
    download_file,     # Download S3 object to local path
    download_to_temp,  # Download to temp file (caller cleans up)
    upload_file,       # Upload local file to S3
    get_presigned_url, # Generate presigned URL
    BUCKET_IN,         # Input bucket name
    BUCKET_OUT,        # Output bucket name
)

API Endpoints

Scan Media (REST)

POST /api/assets/scan

Lists objects in S3_BUCKET_IN, registers new media files.

Scan Media (GraphQL)

mutation { scanMediaFolder { found registered skipped files } }

Job Flow with S3

Local Mode (Celery)

  1. Celery task receives source_key and output_key
  2. Downloads source from S3_BUCKET_IN to temp file
  3. Runs FFmpeg locally
  4. Uploads result to S3_BUCKET_OUT
  5. Cleans up temp files

Lambda Mode (AWS)

  1. Step Functions invokes Lambda with S3 keys
  2. Lambda downloads source from S3_BUCKET_IN to /tmp
  3. Runs FFmpeg in container
  4. Uploads result to S3_BUCKET_OUT
  5. Calls back to API with result

Cloud Run Job Mode (GCP)

  1. GCPExecutor triggers Cloud Run Job with payload in MPR_JOB_PAYLOAD
  2. core/task/gcp_handler.py downloads source from S3_BUCKET_IN (GCS S3 compat)
  3. Runs FFmpeg in container
  4. Uploads result to S3_BUCKET_OUT (GCS S3 compat)
  5. Calls back to API with result

All three paths use the same S3-compatible bucket names and key structure.

Supported File Types

Video: .mp4, .mkv, .avi, .mov, .webm, .flv, .wmv, .m4v Audio: .mp3, .wav, .flac, .aac, .ogg, .m4a