Files
mediaproc/docs/architecture/04-media-storage.md
2026-05-03 03:19:19 -03:00

76 lines
2.3 KiB
Markdown

# Media & Artifact Storage
## Overview
MPR stores everything on **S3-compatible** object storage. Locally that's MinIO; in any
cloud target (AWS, GCS via HMAC, Cloudflare R2, etc.) it's the provider's S3 API. The
code in `core/storage/` uses boto3 throughout — only the endpoint URL and credentials
change between environments.
## What goes where
| Bucket / prefix | Contents | Producer | Consumer |
|---|---|---|---|
| `mpr-media-in` | Source video files (chunks the user uploaded or device-recorded) | user / chunker UI | `extract_frames` stage, `core/api/detect/sources.py` |
| `mpr-media-out` | Per-job artifacts: extracted frame caches, debug overlays | pipeline stages, `core/api/detect/replay.py` overlays endpoints | UI panels (frame strip, overlay viewer) |
Both buckets live behind the same S3 client (`core/storage/`). DB rows store relative
keys (e.g. `chunks/2025-04-15/match-01.mp4`); the bucket is implicit.
## Local development (MinIO)
```bash
S3_ENDPOINT_URL=http://minio:9000
S3_BUCKET_IN=mpr-media-in
S3_BUCKET_OUT=mpr-media-out
AWS_ACCESS_KEY_ID=minioadmin
AWS_SECRET_ACCESS_KEY=minioadmin
```
In the Tilt setup, MinIO runs as a k8s Deployment with port-forwards for `9000` (S3 API)
and `9001` (web console). A `minio-init` job creates the buckets on first start.
## Cloud (AWS S3 / GCS / others)
```bash
# AWS S3 — no endpoint URL needed
S3_BUCKET_IN=...
S3_BUCKET_OUT=...
AWS_REGION=us-east-1
AWS_ACCESS_KEY_ID=...
AWS_SECRET_ACCESS_KEY=...
# GCS via HMAC
S3_ENDPOINT_URL=https://storage.googleapis.com
AWS_ACCESS_KEY_ID=<gcs hmac access>
AWS_SECRET_ACCESS_KEY=<gcs hmac secret>
```
## Database vs. object storage
Heavy artifacts (frames, masks, overlays) live in MinIO/S3. The `Checkpoint` and
`StageOutput` tables in Postgres (see `02-data-model.svg`) hold structured outputs
(detections, stats, references to S3 keys) — never blobs. Frame caches keyed by
`timeline_id` are written by the first run of `extract_frames` and reused by every
later replay on the same timeline.
## Storage module
`core/storage/` exposes the small set of helpers callers need:
```python
from core.storage import (
get_s3_client,
list_objects,
download_file,
download_to_temp,
upload_file,
get_presigned_url,
BUCKET_IN,
BUCKET_OUT,
)
```
Anything else (multipart, lifecycle, versioning) is the bucket's responsibility, not
the application's.