add readme

This commit is contained in:
buenosairesam
2026-01-22 06:02:01 -03:00
parent 79ccae8a6e
commit 174bc15368

214
README.md Normal file
View File

@@ -0,0 +1,214 @@
# sysmonstm
A real-time distributed system monitoring platform that streams metrics from multiple machines to a central hub with a live web dashboard.
## Overview
sysmonstm demonstrates production microservices patterns (gRPC streaming, FastAPI, event-driven architecture) while solving a real problem: monitoring development infrastructure across multiple machines.
```
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ Collector │ │ Collector │ │ Collector │
│ (Machine 1) │ │ (Machine 2) │ │ (Machine N) │
└────────┬────────┘ └────────┬────────┘ └────────┬────────┘
│ │ │
│ gRPC Streaming │
└───────────────────────┼───────────────────────┘
┌────────────────────────┐
│ Aggregator │
│ (gRPC Server + Redis │
│ + TimescaleDB) │
└────────────┬───────────┘
┌──────────────────┼──────────────────┐
│ │ │
▼ ▼ ▼
┌────────────────┐ ┌──────────────┐ ┌──────────────┐
│ Gateway │ │ Alerts │ │ Event Stream│
│ (FastAPI + WS) │ │ Service │ │ (Redis PubSub│
└────────┬───────┘ └──────────────┘ └──────────────┘
│ WebSocket
┌────────────────┐
│ Browser │
│ Dashboard │
└────────────────┘
```
## Features
- **Real-time streaming**: Collectors stream metrics via gRPC to central aggregator
- **Multi-machine support**: Monitor any number of machines from a single dashboard
- **Live dashboard**: WebSocket-powered updates with real-time graphs
- **Tiered storage**: Redis for hot data, TimescaleDB for historical analysis
- **Threshold alerts**: Configurable rules for CPU, memory, disk usage
- **Event-driven**: Decoupled services via Redis Pub/Sub
## Quick Start
```bash
# Start the full stack
docker compose up
# Open dashboard
open http://localhost:8000
```
Metrics appear within seconds. The collector runs locally by default.
### Monitor Additional Machines
Run the collector on any machine you want to monitor:
```bash
# On a remote machine, point to your aggregator
COLLECTOR_AGGREGATOR_URL=your-server:50051 \
COLLECTOR_MACHINE_ID=my-laptop \
python services/collector/main.py
```
## Architecture
### Services
| Service | Port | Description |
|---------|------|-------------|
| **Collector** | - | gRPC client that streams system metrics (CPU, memory, disk, network) |
| **Aggregator** | 50051 | gRPC server that receives metrics, stores them, publishes events |
| **Gateway** | 8000 | FastAPI server with REST API and WebSocket for dashboard |
| **Alerts** | - | Subscribes to events, evaluates threshold rules, triggers notifications |
### Infrastructure
| Component | Purpose |
|-----------|---------|
| **Redis** | Current state cache, event pub/sub |
| **TimescaleDB** | Historical metrics with automatic downsampling |
### Key Patterns
- **gRPC Streaming**: Collectors stream metrics continuously to the aggregator
- **Event-Driven**: Services communicate via Redis Pub/Sub for decoupling
- **Tiered Storage**: Hot data in Redis, historical in TimescaleDB
- **Graceful Degradation**: System continues partially if storage fails
## Project Structure
```
sysmonstm/
├── proto/
│ └── metrics.proto # gRPC service definitions
├── services/
│ ├── collector/ # Metrics collection (psutil)
│ ├── aggregator/ # Central gRPC server
│ ├── gateway/ # FastAPI + WebSocket
│ └── alerts/ # Threshold evaluation
├── shared/
│ ├── config.py # Pydantic settings
│ ├── logging.py # Structured JSON logging
│ └── events/ # Event pub/sub abstraction
├── web/
│ ├── static/ # CSS, JS
│ └── templates/ # Dashboard HTML
├── scripts/
│ └── init-db.sql # TimescaleDB schema
├── docs/ # Architecture diagrams & explainers
├── docker-compose.yml
└── Tiltfile # Local Kubernetes dev
```
## Configuration
All services use environment variables with sensible defaults:
```bash
# Collector
COLLECTOR_MACHINE_ID=my-machine # Machine identifier
COLLECTOR_AGGREGATOR_URL=localhost:50051
COLLECTOR_COLLECTION_INTERVAL=5 # Seconds between collections
# Common
REDIS_URL=redis://localhost:6379
TIMESCALE_URL=postgresql://monitor:monitor@localhost:5432/monitor
LOG_LEVEL=INFO
LOG_FORMAT=json
```
## Metrics Collected
- CPU: Overall percentage, per-core usage
- Memory: Percentage, used/available bytes
- Disk: Percentage, used bytes, read/write throughput
- Network: Bytes sent/received per second, connection count
- System: Process count, load averages (1m, 5m, 15m)
## Development
### Local Development with Hot Reload
```bash
# Use the override file for volume mounts
docker compose -f docker-compose.yml -f docker-compose.override.yml up
```
### Kubernetes Development with Tilt
```bash
tilt up
```
### Running Services Individually
```bash
# Install dependencies
python -m venv .venv
source .venv/bin/activate
pip install -r services/collector/requirements.txt
# Generate protobuf code
python -m grpc_tools.protoc -I proto --python_out=. --grpc_python_out=. proto/metrics.proto
# Run a service
python services/collector/main.py
```
## API Endpoints
### REST (Gateway)
| Endpoint | Description |
|----------|-------------|
| `GET /` | Dashboard UI |
| `GET /api/machines` | List all monitored machines |
| `GET /api/machines/{id}/metrics` | Current metrics for a machine |
| `GET /api/machines/{id}/history` | Historical metrics |
| `GET /health` | Health check |
| `GET /ready` | Readiness check (includes dependencies) |
### WebSocket
Connect to `ws://localhost:8000/ws` for real-time metric updates.
## Documentation
Detailed documentation is available in the `docs/` folder:
- [Architecture Diagrams](docs/architecture/) - System overview, data flow, deployment
- [Building sysmonstm](docs/explainer/sysmonstm-from-start-to-finish.md) - Deep dive into implementation decisions
- [Domain Applications](docs/explainer/other-applications.md) - How these patterns apply to payment processing and other domains
## Tech Stack
- **Python 3.11+** with async/await throughout
- **gRPC** for inter-service communication
- **FastAPI** for REST API and WebSocket
- **Redis** for caching and pub/sub
- **TimescaleDB** for time-series storage
- **psutil** for system metrics collection
- **Docker Compose** for orchestration
## License
MIT