add readme
This commit is contained in:
214
README.md
Normal file
214
README.md
Normal file
@@ -0,0 +1,214 @@
|
||||
# sysmonstm
|
||||
|
||||
A real-time distributed system monitoring platform that streams metrics from multiple machines to a central hub with a live web dashboard.
|
||||
|
||||
## Overview
|
||||
|
||||
sysmonstm demonstrates production microservices patterns (gRPC streaming, FastAPI, event-driven architecture) while solving a real problem: monitoring development infrastructure across multiple machines.
|
||||
|
||||
```
|
||||
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
|
||||
│ Collector │ │ Collector │ │ Collector │
|
||||
│ (Machine 1) │ │ (Machine 2) │ │ (Machine N) │
|
||||
└────────┬────────┘ └────────┬────────┘ └────────┬────────┘
|
||||
│ │ │
|
||||
│ gRPC Streaming │
|
||||
└───────────────────────┼───────────────────────┘
|
||||
▼
|
||||
┌────────────────────────┐
|
||||
│ Aggregator │
|
||||
│ (gRPC Server + Redis │
|
||||
│ + TimescaleDB) │
|
||||
└────────────┬───────────┘
|
||||
│
|
||||
┌──────────────────┼──────────────────┐
|
||||
│ │ │
|
||||
▼ ▼ ▼
|
||||
┌────────────────┐ ┌──────────────┐ ┌──────────────┐
|
||||
│ Gateway │ │ Alerts │ │ Event Stream│
|
||||
│ (FastAPI + WS) │ │ Service │ │ (Redis PubSub│
|
||||
└────────┬───────┘ └──────────────┘ └──────────────┘
|
||||
│
|
||||
│ WebSocket
|
||||
▼
|
||||
┌────────────────┐
|
||||
│ Browser │
|
||||
│ Dashboard │
|
||||
└────────────────┘
|
||||
```
|
||||
|
||||
## Features
|
||||
|
||||
- **Real-time streaming**: Collectors stream metrics via gRPC to central aggregator
|
||||
- **Multi-machine support**: Monitor any number of machines from a single dashboard
|
||||
- **Live dashboard**: WebSocket-powered updates with real-time graphs
|
||||
- **Tiered storage**: Redis for hot data, TimescaleDB for historical analysis
|
||||
- **Threshold alerts**: Configurable rules for CPU, memory, disk usage
|
||||
- **Event-driven**: Decoupled services via Redis Pub/Sub
|
||||
|
||||
## Quick Start
|
||||
|
||||
```bash
|
||||
# Start the full stack
|
||||
docker compose up
|
||||
|
||||
# Open dashboard
|
||||
open http://localhost:8000
|
||||
```
|
||||
|
||||
Metrics appear within seconds. The collector runs locally by default.
|
||||
|
||||
### Monitor Additional Machines
|
||||
|
||||
Run the collector on any machine you want to monitor:
|
||||
|
||||
```bash
|
||||
# On a remote machine, point to your aggregator
|
||||
COLLECTOR_AGGREGATOR_URL=your-server:50051 \
|
||||
COLLECTOR_MACHINE_ID=my-laptop \
|
||||
python services/collector/main.py
|
||||
```
|
||||
|
||||
## Architecture
|
||||
|
||||
### Services
|
||||
|
||||
| Service | Port | Description |
|
||||
|---------|------|-------------|
|
||||
| **Collector** | - | gRPC client that streams system metrics (CPU, memory, disk, network) |
|
||||
| **Aggregator** | 50051 | gRPC server that receives metrics, stores them, publishes events |
|
||||
| **Gateway** | 8000 | FastAPI server with REST API and WebSocket for dashboard |
|
||||
| **Alerts** | - | Subscribes to events, evaluates threshold rules, triggers notifications |
|
||||
|
||||
### Infrastructure
|
||||
|
||||
| Component | Purpose |
|
||||
|-----------|---------|
|
||||
| **Redis** | Current state cache, event pub/sub |
|
||||
| **TimescaleDB** | Historical metrics with automatic downsampling |
|
||||
|
||||
### Key Patterns
|
||||
|
||||
- **gRPC Streaming**: Collectors stream metrics continuously to the aggregator
|
||||
- **Event-Driven**: Services communicate via Redis Pub/Sub for decoupling
|
||||
- **Tiered Storage**: Hot data in Redis, historical in TimescaleDB
|
||||
- **Graceful Degradation**: System continues partially if storage fails
|
||||
|
||||
## Project Structure
|
||||
|
||||
```
|
||||
sysmonstm/
|
||||
├── proto/
|
||||
│ └── metrics.proto # gRPC service definitions
|
||||
├── services/
|
||||
│ ├── collector/ # Metrics collection (psutil)
|
||||
│ ├── aggregator/ # Central gRPC server
|
||||
│ ├── gateway/ # FastAPI + WebSocket
|
||||
│ └── alerts/ # Threshold evaluation
|
||||
├── shared/
|
||||
│ ├── config.py # Pydantic settings
|
||||
│ ├── logging.py # Structured JSON logging
|
||||
│ └── events/ # Event pub/sub abstraction
|
||||
├── web/
|
||||
│ ├── static/ # CSS, JS
|
||||
│ └── templates/ # Dashboard HTML
|
||||
├── scripts/
|
||||
│ └── init-db.sql # TimescaleDB schema
|
||||
├── docs/ # Architecture diagrams & explainers
|
||||
├── docker-compose.yml
|
||||
└── Tiltfile # Local Kubernetes dev
|
||||
```
|
||||
|
||||
## Configuration
|
||||
|
||||
All services use environment variables with sensible defaults:
|
||||
|
||||
```bash
|
||||
# Collector
|
||||
COLLECTOR_MACHINE_ID=my-machine # Machine identifier
|
||||
COLLECTOR_AGGREGATOR_URL=localhost:50051
|
||||
COLLECTOR_COLLECTION_INTERVAL=5 # Seconds between collections
|
||||
|
||||
# Common
|
||||
REDIS_URL=redis://localhost:6379
|
||||
TIMESCALE_URL=postgresql://monitor:monitor@localhost:5432/monitor
|
||||
LOG_LEVEL=INFO
|
||||
LOG_FORMAT=json
|
||||
```
|
||||
|
||||
## Metrics Collected
|
||||
|
||||
- CPU: Overall percentage, per-core usage
|
||||
- Memory: Percentage, used/available bytes
|
||||
- Disk: Percentage, used bytes, read/write throughput
|
||||
- Network: Bytes sent/received per second, connection count
|
||||
- System: Process count, load averages (1m, 5m, 15m)
|
||||
|
||||
## Development
|
||||
|
||||
### Local Development with Hot Reload
|
||||
|
||||
```bash
|
||||
# Use the override file for volume mounts
|
||||
docker compose -f docker-compose.yml -f docker-compose.override.yml up
|
||||
```
|
||||
|
||||
### Kubernetes Development with Tilt
|
||||
|
||||
```bash
|
||||
tilt up
|
||||
```
|
||||
|
||||
### Running Services Individually
|
||||
|
||||
```bash
|
||||
# Install dependencies
|
||||
python -m venv .venv
|
||||
source .venv/bin/activate
|
||||
pip install -r services/collector/requirements.txt
|
||||
|
||||
# Generate protobuf code
|
||||
python -m grpc_tools.protoc -I proto --python_out=. --grpc_python_out=. proto/metrics.proto
|
||||
|
||||
# Run a service
|
||||
python services/collector/main.py
|
||||
```
|
||||
|
||||
## API Endpoints
|
||||
|
||||
### REST (Gateway)
|
||||
|
||||
| Endpoint | Description |
|
||||
|----------|-------------|
|
||||
| `GET /` | Dashboard UI |
|
||||
| `GET /api/machines` | List all monitored machines |
|
||||
| `GET /api/machines/{id}/metrics` | Current metrics for a machine |
|
||||
| `GET /api/machines/{id}/history` | Historical metrics |
|
||||
| `GET /health` | Health check |
|
||||
| `GET /ready` | Readiness check (includes dependencies) |
|
||||
|
||||
### WebSocket
|
||||
|
||||
Connect to `ws://localhost:8000/ws` for real-time metric updates.
|
||||
|
||||
## Documentation
|
||||
|
||||
Detailed documentation is available in the `docs/` folder:
|
||||
|
||||
- [Architecture Diagrams](docs/architecture/) - System overview, data flow, deployment
|
||||
- [Building sysmonstm](docs/explainer/sysmonstm-from-start-to-finish.md) - Deep dive into implementation decisions
|
||||
- [Domain Applications](docs/explainer/other-applications.md) - How these patterns apply to payment processing and other domains
|
||||
|
||||
## Tech Stack
|
||||
|
||||
- **Python 3.11+** with async/await throughout
|
||||
- **gRPC** for inter-service communication
|
||||
- **FastAPI** for REST API and WebSocket
|
||||
- **Redis** for caching and pub/sub
|
||||
- **TimescaleDB** for time-series storage
|
||||
- **psutil** for system metrics collection
|
||||
- **Docker Compose** for orchestration
|
||||
|
||||
## License
|
||||
|
||||
MIT
|
||||
Reference in New Issue
Block a user