215 lines
7.7 KiB
Markdown
215 lines
7.7 KiB
Markdown
# sysmonstm
|
|
|
|
A real-time distributed system monitoring platform that streams metrics from multiple machines to a central hub with a live web dashboard.
|
|
|
|
## Overview
|
|
|
|
sysmonstm demonstrates production microservices patterns (gRPC streaming, FastAPI, event-driven architecture) while solving a real problem: monitoring development infrastructure across multiple machines.
|
|
|
|
```
|
|
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
|
|
│ Collector │ │ Collector │ │ Collector │
|
|
│ (Machine 1) │ │ (Machine 2) │ │ (Machine N) │
|
|
└────────┬────────┘ └────────┬────────┘ └────────┬────────┘
|
|
│ │ │
|
|
│ gRPC Streaming │
|
|
└───────────────────────┼───────────────────────┘
|
|
▼
|
|
┌────────────────────────┐
|
|
│ Aggregator │
|
|
│ (gRPC Server + Redis │
|
|
│ + TimescaleDB) │
|
|
└────────────┬───────────┘
|
|
│
|
|
┌──────────────────┼──────────────────┐
|
|
│ │ │
|
|
▼ ▼ ▼
|
|
┌────────────────┐ ┌──────────────┐ ┌──────────────┐
|
|
│ Gateway │ │ Alerts │ │ Event Stream│
|
|
│ (FastAPI + WS) │ │ Service │ │ (Redis PubSub│
|
|
└────────┬───────┘ └──────────────┘ └──────────────┘
|
|
│
|
|
│ WebSocket
|
|
▼
|
|
┌────────────────┐
|
|
│ Browser │
|
|
│ Dashboard │
|
|
└────────────────┘
|
|
```
|
|
|
|
## Features
|
|
|
|
- **Real-time streaming**: Collectors stream metrics via gRPC to central aggregator
|
|
- **Multi-machine support**: Monitor any number of machines from a single dashboard
|
|
- **Live dashboard**: WebSocket-powered updates with real-time graphs
|
|
- **Tiered storage**: Redis for hot data, TimescaleDB for historical analysis
|
|
- **Threshold alerts**: Configurable rules for CPU, memory, disk usage
|
|
- **Event-driven**: Decoupled services via Redis Pub/Sub
|
|
|
|
## Quick Start
|
|
|
|
```bash
|
|
# Start the full stack
|
|
docker compose up
|
|
|
|
# Open dashboard
|
|
open http://localhost:8000
|
|
```
|
|
|
|
Metrics appear within seconds. The collector runs locally by default.
|
|
|
|
### Monitor Additional Machines
|
|
|
|
Run the collector on any machine you want to monitor:
|
|
|
|
```bash
|
|
# On a remote machine, point to your aggregator
|
|
COLLECTOR_AGGREGATOR_URL=your-server:50051 \
|
|
COLLECTOR_MACHINE_ID=my-laptop \
|
|
python services/collector/main.py
|
|
```
|
|
|
|
## Architecture
|
|
|
|
### Services
|
|
|
|
| Service | Port | Description |
|
|
|---------|------|-------------|
|
|
| **Collector** | - | gRPC client that streams system metrics (CPU, memory, disk, network) |
|
|
| **Aggregator** | 50051 | gRPC server that receives metrics, stores them, publishes events |
|
|
| **Gateway** | 8000 | FastAPI server with REST API and WebSocket for dashboard |
|
|
| **Alerts** | - | Subscribes to events, evaluates threshold rules, triggers notifications |
|
|
|
|
### Infrastructure
|
|
|
|
| Component | Purpose |
|
|
|-----------|---------|
|
|
| **Redis** | Current state cache, event pub/sub |
|
|
| **TimescaleDB** | Historical metrics with automatic downsampling |
|
|
|
|
### Key Patterns
|
|
|
|
- **gRPC Streaming**: Collectors stream metrics continuously to the aggregator
|
|
- **Event-Driven**: Services communicate via Redis Pub/Sub for decoupling
|
|
- **Tiered Storage**: Hot data in Redis, historical in TimescaleDB
|
|
- **Graceful Degradation**: System continues partially if storage fails
|
|
|
|
## Project Structure
|
|
|
|
```
|
|
sysmonstm/
|
|
├── proto/
|
|
│ └── metrics.proto # gRPC service definitions
|
|
├── services/
|
|
│ ├── collector/ # Metrics collection (psutil)
|
|
│ ├── aggregator/ # Central gRPC server
|
|
│ ├── gateway/ # FastAPI + WebSocket
|
|
│ └── alerts/ # Threshold evaluation
|
|
├── shared/
|
|
│ ├── config.py # Pydantic settings
|
|
│ ├── logging.py # Structured JSON logging
|
|
│ └── events/ # Event pub/sub abstraction
|
|
├── web/
|
|
│ ├── static/ # CSS, JS
|
|
│ └── templates/ # Dashboard HTML
|
|
├── scripts/
|
|
│ └── init-db.sql # TimescaleDB schema
|
|
├── docs/ # Architecture diagrams & explainers
|
|
├── docker-compose.yml
|
|
└── Tiltfile # Local Kubernetes dev
|
|
```
|
|
|
|
## Configuration
|
|
|
|
All services use environment variables with sensible defaults:
|
|
|
|
```bash
|
|
# Collector
|
|
COLLECTOR_MACHINE_ID=my-machine # Machine identifier
|
|
COLLECTOR_AGGREGATOR_URL=localhost:50051
|
|
COLLECTOR_COLLECTION_INTERVAL=5 # Seconds between collections
|
|
|
|
# Common
|
|
REDIS_URL=redis://localhost:6379
|
|
TIMESCALE_URL=postgresql://monitor:monitor@localhost:5432/monitor
|
|
LOG_LEVEL=INFO
|
|
LOG_FORMAT=json
|
|
```
|
|
|
|
## Metrics Collected
|
|
|
|
- CPU: Overall percentage, per-core usage
|
|
- Memory: Percentage, used/available bytes
|
|
- Disk: Percentage, used bytes, read/write throughput
|
|
- Network: Bytes sent/received per second, connection count
|
|
- System: Process count, load averages (1m, 5m, 15m)
|
|
|
|
## Development
|
|
|
|
### Local Development with Hot Reload
|
|
|
|
```bash
|
|
# Use the override file for volume mounts
|
|
docker compose -f docker-compose.yml -f docker-compose.override.yml up
|
|
```
|
|
|
|
### Kubernetes Development with Tilt
|
|
|
|
```bash
|
|
tilt up
|
|
```
|
|
|
|
### Running Services Individually
|
|
|
|
```bash
|
|
# Install dependencies
|
|
python -m venv .venv
|
|
source .venv/bin/activate
|
|
pip install -r services/collector/requirements.txt
|
|
|
|
# Generate protobuf code
|
|
python -m grpc_tools.protoc -I proto --python_out=. --grpc_python_out=. proto/metrics.proto
|
|
|
|
# Run a service
|
|
python services/collector/main.py
|
|
```
|
|
|
|
## API Endpoints
|
|
|
|
### REST (Gateway)
|
|
|
|
| Endpoint | Description |
|
|
|----------|-------------|
|
|
| `GET /` | Dashboard UI |
|
|
| `GET /api/machines` | List all monitored machines |
|
|
| `GET /api/machines/{id}/metrics` | Current metrics for a machine |
|
|
| `GET /api/machines/{id}/history` | Historical metrics |
|
|
| `GET /health` | Health check |
|
|
| `GET /ready` | Readiness check (includes dependencies) |
|
|
|
|
### WebSocket
|
|
|
|
Connect to `ws://localhost:8000/ws` for real-time metric updates.
|
|
|
|
## Documentation
|
|
|
|
Detailed documentation is available in the `docs/` folder:
|
|
|
|
- [Architecture Diagrams](docs/architecture/) - System overview, data flow, deployment
|
|
- [Building sysmonstm](docs/explainer/sysmonstm-from-start-to-finish.md) - Deep dive into implementation decisions
|
|
- [Domain Applications](docs/explainer/other-applications.md) - How these patterns apply to payment processing and other domains
|
|
|
|
## Tech Stack
|
|
|
|
- **Python 3.11+** with async/await throughout
|
|
- **gRPC** for inter-service communication
|
|
- **FastAPI** for REST API and WebSocket
|
|
- **Redis** for caching and pub/sub
|
|
- **TimescaleDB** for time-series storage
|
|
- **psutil** for system metrics collection
|
|
- **Docker Compose** for orchestration
|
|
|
|
## License
|
|
|
|
MIT
|