# sysmonstm

A real-time distributed system monitoring platform that streams metrics from multiple machines to a central hub with a live web dashboard.

## Overview

sysmonstm demonstrates production microservices patterns (gRPC streaming, FastAPI, event-driven architecture) while solving a real problem: monitoring development infrastructure across multiple machines.

```
┌─────────────────┐     ┌─────────────────┐     ┌─────────────────┐
│    Collector    │     │    Collector    │     │    Collector    │
│   (Machine 1)   │     │   (Machine 2)   │     │   (Machine N)   │
└────────┬────────┘     └────────┬────────┘     └────────┬────────┘
         │                       │                       │
         │              gRPC Streaming                   │
         └───────────────────────┼───────────────────────┘
                                 ▼
                    ┌────────────────────────┐
                    │      Aggregator        │
                    │  (gRPC Server + Redis  │
                    │   + TimescaleDB)       │
                    └────────────┬───────────┘
                                 │
              ┌──────────────────┼──────────────────┐
              │                  │                  │
              ▼                  ▼                  ▼
     ┌────────────────┐  ┌──────────────┐  ┌──────────────┐
     │    Gateway     │  │    Alerts    │  │  Event Stream│
     │ (FastAPI + WS) │  │   Service    │  │ (Redis PubSub│
     └────────┬───────┘  └──────────────┘  └──────────────┘
              │
              │ WebSocket
              ▼
     ┌────────────────┐
     │    Browser     │
     │   Dashboard    │
     └────────────────┘
```

## Features

- **Real-time streaming**: Collectors stream metrics via gRPC to central aggregator
- **Multi-machine support**: Monitor any number of machines from a single dashboard
- **Live dashboard**: WebSocket-powered updates with real-time graphs
- **Tiered storage**: Redis for hot data, TimescaleDB for historical analysis
- **Threshold alerts**: Configurable rules for CPU, memory, disk usage
- **Event-driven**: Decoupled services via Redis Pub/Sub

## Quick Start

```bash
# Start the full stack
docker compose up

# Open dashboard
open http://localhost:8000
```

Metrics appear within seconds. The collector runs locally by default.

### Monitor Additional Machines

Run the collector on any machine you want to monitor:

```bash
# On a remote machine, point to your aggregator
COLLECTOR_AGGREGATOR_URL=your-server:50051 \
COLLECTOR_MACHINE_ID=my-laptop \
python services/collector/main.py
```

## Architecture

### Services

| Service | Port | Description |
|---------|------|-------------|
| **Collector** | - | gRPC client that streams system metrics (CPU, memory, disk, network) |
| **Aggregator** | 50051 | gRPC server that receives metrics, stores them, publishes events |
| **Gateway** | 8000 | FastAPI server with REST API and WebSocket for dashboard |
| **Alerts** | - | Subscribes to events, evaluates threshold rules, triggers notifications |

### Infrastructure

| Component | Purpose |
|-----------|---------|
| **Redis** | Current state cache, event pub/sub |
| **TimescaleDB** | Historical metrics with automatic downsampling |

### Key Patterns

- **gRPC Streaming**: Collectors stream metrics continuously to the aggregator
- **Event-Driven**: Services communicate via Redis Pub/Sub for decoupling
- **Tiered Storage**: Hot data in Redis, historical in TimescaleDB
- **Graceful Degradation**: System continues partially if storage fails

## Project Structure

```
sysmonstm/
├── proto/
│   └── metrics.proto        # gRPC service definitions
├── services/
│   ├── collector/           # Metrics collection (psutil)
│   ├── aggregator/          # Central gRPC server
│   ├── gateway/             # FastAPI + WebSocket
│   └── alerts/              # Threshold evaluation
├── shared/
│   ├── config.py            # Pydantic settings
│   ├── logging.py           # Structured JSON logging
│   └── events/              # Event pub/sub abstraction
├── web/
│   ├── static/              # CSS, JS
│   └── templates/           # Dashboard HTML
├── scripts/
│   └── init-db.sql          # TimescaleDB schema
├── docs/                    # Architecture diagrams & explainers
├── docker-compose.yml
└── Tiltfile                 # Local Kubernetes dev
```

## Configuration

All services use environment variables with sensible defaults:

```bash
# Collector
COLLECTOR_MACHINE_ID=my-machine      # Machine identifier
COLLECTOR_AGGREGATOR_URL=localhost:50051
COLLECTOR_COLLECTION_INTERVAL=5      # Seconds between collections

# Common
REDIS_URL=redis://localhost:6379
TIMESCALE_URL=postgresql://monitor:monitor@localhost:5432/monitor
LOG_LEVEL=INFO
LOG_FORMAT=json
```

## Metrics Collected

- CPU: Overall percentage, per-core usage
- Memory: Percentage, used/available bytes
- Disk: Percentage, used bytes, read/write throughput
- Network: Bytes sent/received per second, connection count
- System: Process count, load averages (1m, 5m, 15m)

## Development

### Local Development with Hot Reload

```bash
# Use the override file for volume mounts
docker compose -f docker-compose.yml -f docker-compose.override.yml up
```

### Kubernetes Development with Tilt

```bash
tilt up
```

### Running Services Individually

```bash
# Install dependencies
python -m venv .venv
source .venv/bin/activate
pip install -r services/collector/requirements.txt

# Generate protobuf code
python -m grpc_tools.protoc -I proto --python_out=. --grpc_python_out=. proto/metrics.proto

# Run a service
python services/collector/main.py
```

## API Endpoints

### REST (Gateway)

| Endpoint | Description |
|----------|-------------|
| `GET /` | Dashboard UI |
| `GET /api/machines` | List all monitored machines |
| `GET /api/machines/{id}/metrics` | Current metrics for a machine |
| `GET /api/machines/{id}/history` | Historical metrics |
| `GET /health` | Health check |
| `GET /ready` | Readiness check (includes dependencies) |

### WebSocket

Connect to `ws://localhost:8000/ws` for real-time metric updates.

## Documentation

Detailed documentation is available in the `docs/` folder:

- [Architecture Diagrams](docs/architecture/) - System overview, data flow, deployment
- [Building sysmonstm](docs/explainer/sysmonstm-from-start-to-finish.md) - Deep dive into implementation decisions
- [Domain Applications](docs/explainer/other-applications.md) - How these patterns apply to payment processing and other domains

## Tech Stack

- **Python 3.11+** with async/await throughout
- **gRPC** for inter-service communication
- **FastAPI** for REST API and WebSocket
- **Redis** for caching and pub/sub
- **TimescaleDB** for time-series storage
- **psutil** for system metrics collection
- **Docker Compose** for orchestration

## License

MIT