Files
sysmonstm/CLAUDE.md
2025-12-29 14:40:06 -03:00

17 KiB

Distributed System Monitoring Platform

Project Overview

A real-time system monitoring platform that streams metrics from multiple machines to a central hub with live web dashboard. Built to demonstrate production microservices patterns (gRPC, FastAPI, streaming, event-driven architecture) while solving a real problem: monitoring development infrastructure across multiple machines.

Primary Goal: Interview demonstration project for Python Microservices Engineer position
Secondary Goal: Actually useful tool for managing multi-machine development environment
Time Investment: Phased approach - MVP in weekend, polish over 2-3 weeks

Why This Project

Interview Alignment:

  • Demonstrates gRPC-based microservices architecture (core requirement)
  • Shows streaming patterns (server-side and bidirectional)
  • Real-time data aggregation and processing
  • Alert/threshold monitoring (maps to fraud detection)
  • Event-driven patterns
  • Multiple data sources requiring normalization (maps to multiple payment processors)

Personal Utility:

  • Monitors existing multi-machine dev setup
  • Dashboard stays open, provides real value
  • Solves actual pain point
  • Will continue running post-interview

Domain Mapping for Interview:

  • Machine = Payment Processor
  • Metrics Stream = Transaction Stream
  • Resource Thresholds = Fraud/Limit Detection
  • Alert System = Risk Management
  • Aggregation Service = Payment Processing Hub

Technical Stack

Core Technologies (Must Use - From JD)

  • Python 3.11+ - Primary language
  • FastAPI - Web gateway, REST endpoints, WebSocket streaming
  • gRPC - Inter-service communication, metric streaming
  • PostgreSQL/TimescaleDB - Time-series historical data
  • Redis - Current state, caching, alert rules
  • Docker Compose - Orchestration

Supporting Technologies

  • Protocol Buffers - gRPC message definitions
  • WebSockets - Browser streaming
  • htmx + Alpine.js - Lightweight reactive frontend (avoid heavy SPA)
  • Chart.js or Apache ECharts - Real-time graphs
  • asyncio - Async patterns throughout

Development Tools

  • grpcio & grpcio-tools - Python gRPC
  • psutil - System metrics collection
  • uvicorn - FastAPI server
  • pytest - Testing
  • docker-compose - Local orchestration

Architecture

┌─────────────────────────────────────────────────────────────┐
│                         Browser                              │
│  ┌──────────────────────────────────────────────────────┐  │
│  │  Dashboard (htmx + Alpine.js + WebSockets)           │  │
│  └──────────────────────────────────────────────────────┘  │
└────────────────────────┬────────────────────────────────────┘
                         │ WebSocket
                         ▼
┌─────────────────────────────────────────────────────────────┐
│                    Web Gateway Service                       │
│                    (FastAPI + WebSockets)                    │
│  - Serves dashboard                                          │
│  - Streams updates to browser                                │
│  - REST API for historical queries                           │
└────────────────────────┬────────────────────────────────────┘
                         │ gRPC
                         ▼
┌─────────────────────────────────────────────────────────────┐
│                   Aggregator Service (gRPC)                  │
│  - Receives metric streams from all collectors               │
│  - Normalizes data from different sources                    │
│  - Enriches with machine context                             │
│  - Publishes to event stream                                 │
│  - Checks alert thresholds                                   │
└─────┬───────────────────────────────────┬───────────────────┘
      │                                   │
      │ Stores                            │ Publishes events
      ▼                                   ▼
┌──────────────┐                   ┌────────────────┐
│  TimescaleDB │                   │  Event Stream  │
│  (historical)│                   │  (Redis Pub/Sub│
└──────────────┘                   │   or RabbitMQ) │
                                   └────────┬───────┘
┌──────────────┐                            │
│    Redis     │                            │ Subscribes
│  (current    │◄───────────────────────────┘
│   state)     │                            │
└──────────────┘                            ▼
                                   ┌────────────────┐
      ▲                            │ Alert Service  │
      │                            │  - Processes   │
      │                            │    events      │
      │ gRPC Streaming             │  - Triggers    │
      │                            │    actions     │
┌─────┴────────────────────────────┴────────────────┘
│
│  Multiple Collector Services (one per machine)
│  ┌───────────────────────────────────────┐
│  │  Metrics Collector (gRPC Client)      │
│  │  - Gathers system metrics (psutil)    │
│  │  - Streams to Aggregator via gRPC     │
│  │  - CPU, Memory, Disk, Network         │
│  │  - Process list                       │
│  │  - Docker container stats (optional)  │
│  └───────────────────────────────────────┘
│
└──► Machine 1, Machine 2, Machine 3, ...

Implementation Phases

Phase 1: MVP - Core Streaming (Weekend - 8-12 hours)

Goal: Prove the gRPC streaming works end-to-end

Deliverables:

  1. Metrics Collector Service (gRPC client)

    • Collects CPU, memory, disk on localhost
    • Streams to aggregator every 5 seconds
  2. Aggregator Service (gRPC server)

    • Receives metric stream
    • Stores current state in Redis
    • Logs to console
  3. Proto definitions for metric messages

  4. Docker Compose setup

Success Criteria:

  • Run collector, see metrics flowing to aggregator
  • Redis contains current state
  • Can query Redis manually for latest metrics

Phase 2: Web Dashboard (1 week)

Goal: Make it visible and useful

Deliverables:

  1. Web Gateway Service (FastAPI)

    • WebSocket endpoint for streaming
    • REST endpoints for current/historical data
  2. Dashboard UI

    • Real-time CPU/Memory graphs per machine
    • Current state table
    • Simple, clean design
  3. WebSocket bridge (Gateway ↔ Aggregator)

  4. TimescaleDB integration

    • Store historical metrics
    • Query endpoints for time ranges

Success Criteria:

  • Open dashboard, see live graphs updating
  • Graphs show last hour of data
  • Multiple machines displayed separately

Phase 3: Alerts & Intelligence (1 week)

Goal: Add decision-making layer (interview focus)

Deliverables:

  1. Alert Service

    • Subscribes to event stream
    • Evaluates threshold rules
    • Triggers notifications
  2. Configuration Service (gRPC)

    • Dynamic threshold management
    • Alert rule CRUD
    • Stored in PostgreSQL
  3. Event Stream implementation (Redis Pub/Sub or RabbitMQ)

  4. Enhanced dashboard

    • Alert indicators
    • Alert history
    • Threshold configuration UI

Success Criteria:

  • Set CPU threshold at 80%
  • Generate load (stress-ng)
  • See alert trigger in dashboard
  • Alert logged to database

Phase 4: Interview Polish (Final week)

Goal: Demo-ready, production patterns visible

Deliverables:

  1. Observability

    • OpenTelemetry tracing (optional)
    • Structured logging
    • Health check endpoints
  2. "Synthetic Transactions"

    • Simulate business operations through system
    • Track end-to-end latency
    • Maps directly to payment processing demo
  3. Documentation

    • Architecture diagram
    • Service interaction flows
    • Deployment guide
  4. Demo script

    • Story to walk through
    • Key talking points
    • Domain mapping explanations

Success Criteria:

  • Can deploy entire stack with one command
  • Can explain every service's role
  • Can map architecture to payment processing
  • Demo runs smoothly without hiccups

Key Technical Patterns to Demonstrate

1. gRPC Streaming Patterns

Server-Side Streaming:

# Collector streams metrics to aggregator
service MetricsService {
  rpc StreamMetrics(MetricsRequest) returns (stream Metric) {}
}

Bidirectional Streaming:

# Two-way communication between services
service ControlService {
  rpc ManageStream(stream Command) returns (stream Response) {}
}

2. Service Communication Patterns

  • Synchronous (gRPC): Query current state, configuration
  • Asynchronous (Events): Metric updates, alerts, audit logs
  • Streaming (gRPC + WebSocket): Real-time data flow

3. Data Storage Patterns

  • Hot data (Redis): Current state, recent metrics (last 5 minutes)
  • Warm data (TimescaleDB): Historical metrics (last 30 days)
  • Cold data (Optional): Archive to S3-compatible storage

4. Error Handling & Resilience

  • gRPC retry logic with exponential backoff
  • Circuit breaker pattern for service calls
  • Graceful degradation (continue if one collector fails)
  • Dead letter queue for failed events

Proto Definitions (Starting Point)

syntax = "proto3";

package monitoring;

service MetricsService {
  rpc StreamMetrics(MetricsRequest) returns (stream Metric) {}
  rpc GetCurrentState(StateRequest) returns (MachineState) {}
}

message MetricsRequest {
  string machine_id = 1;
  int32 interval_seconds = 2;
}

message Metric {
  string machine_id = 1;
  int64 timestamp = 2;
  MetricType type = 3;
  double value = 4;
  map<string, string> labels = 5;
}

enum MetricType {
  CPU_PERCENT = 0;
  MEMORY_PERCENT = 1;
  MEMORY_USED_GB = 2;
  DISK_PERCENT = 3;
  NETWORK_SENT_MBPS = 4;
  NETWORK_RECV_MBPS = 5;
}

message MachineState {
  string machine_id = 1;
  int64 last_seen = 2;
  repeated Metric current_metrics = 3;
  HealthStatus health = 4;
}

enum HealthStatus {
  HEALTHY = 0;
  WARNING = 1;
  CRITICAL = 2;
  UNKNOWN = 3;
}

Project Structure

system-monitor/
├── docker-compose.yml
├── proto/
│   └── metrics.proto
├── services/
│   ├── collector/
│   │   ├── Dockerfile
│   │   ├── requirements.txt
│   │   ├── main.py
│   │   └── metrics.py
│   ├── aggregator/
│   │   ├── Dockerfile
│   │   ├── requirements.txt
│   │   ├── main.py
│   │   └── storage.py
│   ├── gateway/
│   │   ├── Dockerfile
│   │   ├── requirements.txt
│   │   ├── main.py
│   │   └── websocket.py
│   └── alerts/
│       ├── Dockerfile
│       ├── requirements.txt
│       ├── main.py
│       └── rules.py
├── web/
│   ├── static/
│   │   ├── css/
│   │   └── js/
│   └── templates/
│       └── dashboard.html
└── README.md

Interview Talking Points

Domain Mapping to Payments

What you say:

  • "I built this to monitor my dev machines, but the architecture directly maps to payment processing"
  • "Each machine streaming metrics is like a payment processor streaming transactions"
  • "The aggregator normalizes data from different sources - same as aggregating from Stripe, PayPal, bank APIs"
  • "Alert thresholds on resource usage are structurally identical to fraud detection thresholds"
  • "The event stream for audit trails maps directly to payment audit logs"

Technical Decisions to Highlight

gRPC vs REST:

  • "I use gRPC between services for efficiency and strong typing"
  • "FastAPI gateway exposes REST/WebSocket for browser clients"
  • "This pattern is common - internal gRPC, external REST"

Streaming vs Polling:

  • "Server-side streaming reduces network overhead"
  • "Bidirectional streaming allows dynamic configuration updates"
  • "WebSocket to browser maintains single connection"

State Management:

  • "Redis for hot data - current state, needs fast access"
  • "TimescaleDB for historical analysis - optimized for time-series"
  • "This tiered storage approach scales to payment transaction volumes"

Resilience:

  • "Each collector is independent - one failing doesn't affect others"
  • "Circuit breaker prevents cascade failures"
  • "Event stream decouples alert processing from metric ingestion"

What NOT to Say

  • Don't call it a "toy project" or "learning exercise"
  • Don't apologize for running locally vs AWS
  • Don't over-explain obvious things
  • Don't claim it's production-ready when it's not

What TO Say

  • "I built this to solve a real problem I have"
  • "Locally it uses PostgreSQL/Redis, in production these become Aurora/ElastiCache"
  • "I focused on the architectural patterns since those transfer directly"
  • "I'd keep developing this - it's genuinely useful"

Development Guidelines

Code Quality Standards

  • Type hints throughout (Python 3.11+ syntax)
  • Async/await patterns consistently
  • Structured logging (JSON format)
  • Error handling at all boundaries
  • Unit tests for business logic
  • Integration tests for service interactions

Docker Best Practices

  • Multi-stage builds
  • Non-root users
  • Health checks
  • Resource limits
  • Volume mounts for development

Configuration Management

  • Environment variables for all config
  • Sensible defaults
  • Config validation on startup
  • No secrets in code

AWS Mapping (For Interview Discussion)

What you have → What it becomes:

  • PostgreSQL → Aurora PostgreSQL
  • Redis → ElastiCache
  • Docker Containers → ECS/Fargate or Lambda
  • RabbitMQ/Redis Pub/Sub → SQS/SNS
  • Docker Compose → CloudFormation/Terraform
  • Local networking → VPC, Security Groups

Key point: "The architecture and patterns are production-ready, the infrastructure is local for development convenience"

Common Pitfalls to Avoid

  1. Over-engineering Phase 1 - Resist adding features, just get streaming working
  2. Ugly UI - Don't waste time on design, htmx + basic CSS is fine
  3. Perfect metrics - Mock data is OK early on, real psutil data comes later
  4. Complete coverage - Better to have 3 services working perfectly than 10 half-done
  5. AWS deployment - Local is fine, AWS costs money and adds complexity

Success Metrics

For Yourself:

  • Actually use the dashboard daily
  • Catches a real issue before you notice
  • Runs stable for 1+ week without intervention

For Interview:

  • Can demo end-to-end in 5 minutes
  • Can explain every service interaction
  • Can map to payment domain fluently
  • Shows understanding of production patterns

Next Steps

  1. Set up project structure
  2. Define proto messages
  3. Build Phase 1 MVP
  4. Iterate based on what feels useful
  5. Polish for demo when interview approaches

Resources

Questions to Ask Yourself During Development

  • "Would I actually use this feature?"
  • "How does this map to payments?"
  • "Can I explain why I built it this way?"
  • "What would break if X service failed?"
  • "How would this scale to 1000 machines?"

Final Note

This project works because it's:

  1. Real - You'll use it
  2. Focused - Shows specific patterns they care about
  3. Mappable - Clear connection to their domain
  4. Yours - Not a tutorial copy, demonstrates your thinking

Build it in phases, use it daily, and by interview time you'll have natural stories about trade-offs, failures, and learnings. That authenticity is more valuable than perfect code.

Good luck! 🚀