From 174bc15368776725fab46fc99a6363db9df957ab Mon Sep 17 00:00:00 2001 From: buenosairesam Date: Thu, 22 Jan 2026 06:02:01 -0300 Subject: [PATCH] add readme --- README.md | 214 ++++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 214 insertions(+) create mode 100644 README.md diff --git a/README.md b/README.md new file mode 100644 index 0000000..71f06c9 --- /dev/null +++ b/README.md @@ -0,0 +1,214 @@ +# sysmonstm + +A real-time distributed system monitoring platform that streams metrics from multiple machines to a central hub with a live web dashboard. + +## Overview + +sysmonstm demonstrates production microservices patterns (gRPC streaming, FastAPI, event-driven architecture) while solving a real problem: monitoring development infrastructure across multiple machines. + +``` +┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ +│ Collector │ │ Collector │ │ Collector │ +│ (Machine 1) │ │ (Machine 2) │ │ (Machine N) │ +└────────┬────────┘ └────────┬────────┘ └────────┬────────┘ + │ │ │ + │ gRPC Streaming │ + └───────────────────────┼───────────────────────┘ + ▼ + ┌────────────────────────┐ + │ Aggregator │ + │ (gRPC Server + Redis │ + │ + TimescaleDB) │ + └────────────┬───────────┘ + │ + ┌──────────────────┼──────────────────┐ + │ │ │ + ▼ ▼ ▼ + ┌────────────────┐ ┌──────────────┐ ┌──────────────┐ + │ Gateway │ │ Alerts │ │ Event Stream│ + │ (FastAPI + WS) │ │ Service │ │ (Redis PubSub│ + └────────┬───────┘ └──────────────┘ └──────────────┘ + │ + │ WebSocket + ▼ + ┌────────────────┐ + │ Browser │ + │ Dashboard │ + └────────────────┘ +``` + +## Features + +- **Real-time streaming**: Collectors stream metrics via gRPC to central aggregator +- **Multi-machine support**: Monitor any number of machines from a single dashboard +- **Live dashboard**: WebSocket-powered updates with real-time graphs +- **Tiered storage**: Redis for hot data, TimescaleDB for historical analysis +- **Threshold alerts**: Configurable rules for CPU, memory, disk usage +- **Event-driven**: Decoupled services via Redis Pub/Sub + +## Quick Start + +```bash +# Start the full stack +docker compose up + +# Open dashboard +open http://localhost:8000 +``` + +Metrics appear within seconds. The collector runs locally by default. + +### Monitor Additional Machines + +Run the collector on any machine you want to monitor: + +```bash +# On a remote machine, point to your aggregator +COLLECTOR_AGGREGATOR_URL=your-server:50051 \ +COLLECTOR_MACHINE_ID=my-laptop \ +python services/collector/main.py +``` + +## Architecture + +### Services + +| Service | Port | Description | +|---------|------|-------------| +| **Collector** | - | gRPC client that streams system metrics (CPU, memory, disk, network) | +| **Aggregator** | 50051 | gRPC server that receives metrics, stores them, publishes events | +| **Gateway** | 8000 | FastAPI server with REST API and WebSocket for dashboard | +| **Alerts** | - | Subscribes to events, evaluates threshold rules, triggers notifications | + +### Infrastructure + +| Component | Purpose | +|-----------|---------| +| **Redis** | Current state cache, event pub/sub | +| **TimescaleDB** | Historical metrics with automatic downsampling | + +### Key Patterns + +- **gRPC Streaming**: Collectors stream metrics continuously to the aggregator +- **Event-Driven**: Services communicate via Redis Pub/Sub for decoupling +- **Tiered Storage**: Hot data in Redis, historical in TimescaleDB +- **Graceful Degradation**: System continues partially if storage fails + +## Project Structure + +``` +sysmonstm/ +├── proto/ +│ └── metrics.proto # gRPC service definitions +├── services/ +│ ├── collector/ # Metrics collection (psutil) +│ ├── aggregator/ # Central gRPC server +│ ├── gateway/ # FastAPI + WebSocket +│ └── alerts/ # Threshold evaluation +├── shared/ +│ ├── config.py # Pydantic settings +│ ├── logging.py # Structured JSON logging +│ └── events/ # Event pub/sub abstraction +├── web/ +│ ├── static/ # CSS, JS +│ └── templates/ # Dashboard HTML +├── scripts/ +│ └── init-db.sql # TimescaleDB schema +├── docs/ # Architecture diagrams & explainers +├── docker-compose.yml +└── Tiltfile # Local Kubernetes dev +``` + +## Configuration + +All services use environment variables with sensible defaults: + +```bash +# Collector +COLLECTOR_MACHINE_ID=my-machine # Machine identifier +COLLECTOR_AGGREGATOR_URL=localhost:50051 +COLLECTOR_COLLECTION_INTERVAL=5 # Seconds between collections + +# Common +REDIS_URL=redis://localhost:6379 +TIMESCALE_URL=postgresql://monitor:monitor@localhost:5432/monitor +LOG_LEVEL=INFO +LOG_FORMAT=json +``` + +## Metrics Collected + +- CPU: Overall percentage, per-core usage +- Memory: Percentage, used/available bytes +- Disk: Percentage, used bytes, read/write throughput +- Network: Bytes sent/received per second, connection count +- System: Process count, load averages (1m, 5m, 15m) + +## Development + +### Local Development with Hot Reload + +```bash +# Use the override file for volume mounts +docker compose -f docker-compose.yml -f docker-compose.override.yml up +``` + +### Kubernetes Development with Tilt + +```bash +tilt up +``` + +### Running Services Individually + +```bash +# Install dependencies +python -m venv .venv +source .venv/bin/activate +pip install -r services/collector/requirements.txt + +# Generate protobuf code +python -m grpc_tools.protoc -I proto --python_out=. --grpc_python_out=. proto/metrics.proto + +# Run a service +python services/collector/main.py +``` + +## API Endpoints + +### REST (Gateway) + +| Endpoint | Description | +|----------|-------------| +| `GET /` | Dashboard UI | +| `GET /api/machines` | List all monitored machines | +| `GET /api/machines/{id}/metrics` | Current metrics for a machine | +| `GET /api/machines/{id}/history` | Historical metrics | +| `GET /health` | Health check | +| `GET /ready` | Readiness check (includes dependencies) | + +### WebSocket + +Connect to `ws://localhost:8000/ws` for real-time metric updates. + +## Documentation + +Detailed documentation is available in the `docs/` folder: + +- [Architecture Diagrams](docs/architecture/) - System overview, data flow, deployment +- [Building sysmonstm](docs/explainer/sysmonstm-from-start-to-finish.md) - Deep dive into implementation decisions +- [Domain Applications](docs/explainer/other-applications.md) - How these patterns apply to payment processing and other domains + +## Tech Stack + +- **Python 3.11+** with async/await throughout +- **gRPC** for inter-service communication +- **FastAPI** for REST API and WebSocket +- **Redis** for caching and pub/sub +- **TimescaleDB** for time-series storage +- **psutil** for system metrics collection +- **Docker Compose** for orchestration + +## License + +MIT