added docs
This commit is contained in:
@@ -2,6 +2,8 @@
|
||||
|
||||
This is the story of building a distributed system monitoring platform. Not a tutorial with sanitized examples, but an explanation of the actual decisions made, the trade-offs considered, and the code that resulted.
|
||||
|
||||

|
||||
|
||||
## The Problem
|
||||
|
||||
I have multiple development machines. A workstation, a laptop, sometimes a remote VM. Each one occasionally runs out of disk space, hits memory limits, or has a runaway process eating CPU. The pattern was always the same: something breaks, I SSH in, run `htop`, realize the problem, fix it.
|
||||
@@ -29,6 +31,8 @@ service MetricsService {
|
||||
|
||||
The collector is the client. It streams metrics. The aggregator is the server. It receives them. When the stream ends (collector shuts down, network drops), the aggregator gets a `StreamAck` response.
|
||||
|
||||

|
||||
|
||||
### Why This Storage Tier Approach
|
||||
|
||||
Metrics have different access patterns at different ages:
|
||||
@@ -45,6 +49,8 @@ Storing everything in one place forces a choice between fast reads (keep it all
|
||||
|
||||
The aggregator writes to both on every batch. Redis for live dashboard. TimescaleDB for history.
|
||||
|
||||

|
||||
|
||||
### Why Event-Driven for Alerts
|
||||
|
||||
The alerts service needs to evaluate every metric against threshold rules. Two options:
|
||||
@@ -70,6 +76,8 @@ class EventSubscriber(ABC):
|
||||
|
||||
Currently backed by Redis Pub/Sub (`shared/events/redis_pubsub.py`). The abstraction means switching to Kafka or RabbitMQ later requires implementing a new backend, not changing any service code.
|
||||
|
||||

|
||||
|
||||
## Phase 1: MVP - Getting Streaming to Work
|
||||
|
||||
The goal was simple: run a collector, see metrics appear in the aggregator's logs.
|
||||
|
||||
Reference in New Issue
Block a user