added docs

This commit is contained in:
buenosairesam
2025-12-31 12:45:28 -03:00
parent 040fccc58d
commit b526bde98e
19 changed files with 1382 additions and 276 deletions

View File

@@ -2,6 +2,8 @@
This is the story of building a distributed system monitoring platform. Not a tutorial with sanitized examples, but an explanation of the actual decisions made, the trade-offs considered, and the code that resulted.
![System Architecture Overview](images/01-architecture-overview.svg)
## The Problem
I have multiple development machines. A workstation, a laptop, sometimes a remote VM. Each one occasionally runs out of disk space, hits memory limits, or has a runaway process eating CPU. The pattern was always the same: something breaks, I SSH in, run `htop`, realize the problem, fix it.
@@ -29,6 +31,8 @@ service MetricsService {
The collector is the client. It streams metrics. The aggregator is the server. It receives them. When the stream ends (collector shuts down, network drops), the aggregator gets a `StreamAck` response.
![gRPC Streaming Pattern](images/02-grpc-streaming.svg)
### Why This Storage Tier Approach
Metrics have different access patterns at different ages:
@@ -45,6 +49,8 @@ Storing everything in one place forces a choice between fast reads (keep it all
The aggregator writes to both on every batch. Redis for live dashboard. TimescaleDB for history.
![Storage Tiers](images/03-storage-tiers.svg)
### Why Event-Driven for Alerts
The alerts service needs to evaluate every metric against threshold rules. Two options:
@@ -70,6 +76,8 @@ class EventSubscriber(ABC):
Currently backed by Redis Pub/Sub (`shared/events/redis_pubsub.py`). The abstraction means switching to Kafka or RabbitMQ later requires implementing a new backend, not changing any service code.
![Event-Driven Architecture](images/04-event-driven.svg)
## Phase 1: MVP - Getting Streaming to Work
The goal was simple: run a collector, see metrics appear in the aggregator's logs.