# sysmonstm A real-time distributed system monitoring platform that streams metrics from multiple machines to a central hub with a live web dashboard. ## Overview sysmonstm demonstrates production microservices patterns (gRPC streaming, FastAPI, event-driven architecture) while solving a real problem: monitoring development infrastructure across multiple machines. ``` ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │ Collector │ │ Collector │ │ Collector │ │ (Machine 1) │ │ (Machine 2) │ │ (Machine N) │ └────────┬────────┘ └────────┬────────┘ └────────┬────────┘ │ │ │ │ gRPC Streaming │ └───────────────────────┼───────────────────────┘ ▼ ┌────────────────────────┐ │ Aggregator │ │ (gRPC Server + Redis │ │ + TimescaleDB) │ └────────────┬───────────┘ │ ┌──────────────────┼──────────────────┐ │ │ │ ▼ ▼ ▼ ┌────────────────┐ ┌──────────────┐ ┌──────────────┐ │ Gateway │ │ Alerts │ │ Event Stream│ │ (FastAPI + WS) │ │ Service │ │ (Redis PubSub│ └────────┬───────┘ └──────────────┘ └──────────────┘ │ │ WebSocket ▼ ┌────────────────┐ │ Browser │ │ Dashboard │ └────────────────┘ ``` ## Features - **Real-time streaming**: Collectors stream metrics via gRPC to central aggregator - **Multi-machine support**: Monitor any number of machines from a single dashboard - **Live dashboard**: WebSocket-powered updates with real-time graphs - **Tiered storage**: Redis for hot data, TimescaleDB for historical analysis - **Threshold alerts**: Configurable rules for CPU, memory, disk usage - **Event-driven**: Decoupled services via Redis Pub/Sub ## Quick Start ```bash # Start the full stack docker compose up # Open dashboard open http://localhost:8000 ``` Metrics appear within seconds. The collector runs locally by default. ### Monitor Additional Machines Run the collector on any machine you want to monitor: ```bash # On a remote machine, point to your aggregator COLLECTOR_AGGREGATOR_URL=your-server:50051 \ COLLECTOR_MACHINE_ID=my-laptop \ python services/collector/main.py ``` ## Architecture ### Services | Service | Port | Description | |---------|------|-------------| | **Collector** | - | gRPC client that streams system metrics (CPU, memory, disk, network) | | **Aggregator** | 50051 | gRPC server that receives metrics, stores them, publishes events | | **Gateway** | 8000 | FastAPI server with REST API and WebSocket for dashboard | | **Alerts** | - | Subscribes to events, evaluates threshold rules, triggers notifications | ### Infrastructure | Component | Purpose | |-----------|---------| | **Redis** | Current state cache, event pub/sub | | **TimescaleDB** | Historical metrics with automatic downsampling | ### Key Patterns - **gRPC Streaming**: Collectors stream metrics continuously to the aggregator - **Event-Driven**: Services communicate via Redis Pub/Sub for decoupling - **Tiered Storage**: Hot data in Redis, historical in TimescaleDB - **Graceful Degradation**: System continues partially if storage fails ## Project Structure ``` sysmonstm/ ├── proto/ │ └── metrics.proto # gRPC service definitions ├── services/ │ ├── collector/ # Metrics collection (psutil) │ ├── aggregator/ # Central gRPC server │ ├── gateway/ # FastAPI + WebSocket │ └── alerts/ # Threshold evaluation ├── shared/ │ ├── config.py # Pydantic settings │ ├── logging.py # Structured JSON logging │ └── events/ # Event pub/sub abstraction ├── web/ │ ├── static/ # CSS, JS │ └── templates/ # Dashboard HTML ├── scripts/ │ └── init-db.sql # TimescaleDB schema ├── docs/ # Architecture diagrams & explainers ├── docker-compose.yml └── Tiltfile # Local Kubernetes dev ``` ## Configuration All services use environment variables with sensible defaults: ```bash # Collector COLLECTOR_MACHINE_ID=my-machine # Machine identifier COLLECTOR_AGGREGATOR_URL=localhost:50051 COLLECTOR_COLLECTION_INTERVAL=5 # Seconds between collections # Common REDIS_URL=redis://localhost:6379 TIMESCALE_URL=postgresql://monitor:monitor@localhost:5432/monitor LOG_LEVEL=INFO LOG_FORMAT=json ``` ## Metrics Collected - CPU: Overall percentage, per-core usage - Memory: Percentage, used/available bytes - Disk: Percentage, used bytes, read/write throughput - Network: Bytes sent/received per second, connection count - System: Process count, load averages (1m, 5m, 15m) ## Development ### Local Development with Hot Reload ```bash # Use the override file for volume mounts docker compose -f docker-compose.yml -f docker-compose.override.yml up ``` ### Kubernetes Development with Tilt ```bash tilt up ``` ### Running Services Individually ```bash # Install dependencies python -m venv .venv source .venv/bin/activate pip install -r services/collector/requirements.txt # Generate protobuf code python -m grpc_tools.protoc -I proto --python_out=. --grpc_python_out=. proto/metrics.proto # Run a service python services/collector/main.py ``` ## API Endpoints ### REST (Gateway) | Endpoint | Description | |----------|-------------| | `GET /` | Dashboard UI | | `GET /api/machines` | List all monitored machines | | `GET /api/machines/{id}/metrics` | Current metrics for a machine | | `GET /api/machines/{id}/history` | Historical metrics | | `GET /health` | Health check | | `GET /ready` | Readiness check (includes dependencies) | ### WebSocket Connect to `ws://localhost:8000/ws` for real-time metric updates. ## Documentation Detailed documentation is available in the `docs/` folder: - [Architecture Diagrams](docs/architecture/) - System overview, data flow, deployment - [Building sysmonstm](docs/explainer/sysmonstm-from-start-to-finish.md) - Deep dive into implementation decisions - [Domain Applications](docs/explainer/other-applications.md) - How these patterns apply to payment processing and other domains ## Tech Stack - **Python 3.11+** with async/await throughout - **gRPC** for inter-service communication - **FastAPI** for REST API and WebSocket - **Redis** for caching and pub/sub - **TimescaleDB** for time-series storage - **psutil** for system metrics collection - **Docker Compose** for orchestration ## License MIT