System Monitoring Platform

Documentation

System Overview

View Full
System Overview

High-level architecture showing all services, data stores, and communication patterns.

Key Components

  • Collector: Runs on each monitored machine, streams metrics via gRPC
  • Aggregator: Central gRPC server, receives streams, normalizes data
  • Gateway: FastAPI service, WebSocket for browser, REST for queries
  • Alerts: Subscribes to events, evaluates thresholds, triggers actions
  • Edge: Lightweight WebSocket relay on AWS, serves public dashboard at sysmonstm.mcrn.ar

Data Flow Pipeline

View Full
Data Flow

How metrics flow from collection through storage with different retention tiers.

Storage Tiers

Tier Resolution Retention Use Case
Hot (Redis) 5s 5 min Current state, live dashboard
Raw (TimescaleDB) 5s 24h Recent detailed analysis
1-min Aggregates 1m 7d Week view, trends
1-hour Aggregates 1h 90d Long-term analysis

Deployment Architecture

View Full
Deployment

Deployment options from local development to AWS production.

Environments

  • Local: Docker Compose with aggregator, gateway, Redis, TimescaleDB, alerts
  • Edge (AWS): Lightweight WebSocket relay at sysmonstm.mcrn.ar, receives forwarded metrics from local gateway
  • Collectors: Run on remote machines, stream to local aggregator via gRPC

gRPC Service Definitions

View Full
gRPC Services

Protocol Buffer service and message definitions.

Services

  • MetricsService: Client-side streaming for metrics ingestion
  • ControlService: Bidirectional streaming for collector control
  • ConfigService: Server-side streaming for config updates

Key Design Decisions

Domain Mapping

  • Machine = Payment Processor
  • Metrics Stream = Transaction Stream
  • Thresholds = Fraud Detection
  • Aggregator = Payment Hub

gRPC Patterns

  • Client streaming (metrics)
  • Server streaming (config)
  • Bidirectional (control)
  • Health checking

Event-Driven

  • Redis Pub/Sub (current)
  • Abstraction for Kafka switch
  • Decoupled alert processing
  • Real-time WebSocket push

Resilience

  • Collectors are independent
  • Graceful degradation
  • Retry with backoff
  • Health checks everywhere

Technology Stack

Core

  • Python 3.11+
  • FastAPI
  • gRPC / protobuf
  • asyncio

Data

  • TimescaleDB
  • Redis
  • Redis Pub/Sub

Infrastructure

  • Docker
  • Docker Compose

CI/CD

  • Woodpecker CI
  • Container Registry