Files

buenosairesam 116d4032e2 first claude draft

2025-12-29 14:40:06 -03:00

17 KiB

Raw Blame History

Distributed System Monitoring Platform

Project Overview

A real-time system monitoring platform that streams metrics from multiple machines to a central hub with live web dashboard. Built to demonstrate production microservices patterns (gRPC, FastAPI, streaming, event-driven architecture) while solving a real problem: monitoring development infrastructure across multiple machines.

Primary Goal: Interview demonstration project for Python Microservices Engineer position
Secondary Goal: Actually useful tool for managing multi-machine development environment
Time Investment: Phased approach - MVP in weekend, polish over 2-3 weeks

Why This Project

Interview Alignment:

Demonstrates gRPC-based microservices architecture (core requirement)
Shows streaming patterns (server-side and bidirectional)
Real-time data aggregation and processing
Alert/threshold monitoring (maps to fraud detection)
Event-driven patterns
Multiple data sources requiring normalization (maps to multiple payment processors)

Personal Utility:

Monitors existing multi-machine dev setup
Dashboard stays open, provides real value
Solves actual pain point
Will continue running post-interview

Domain Mapping for Interview:

Machine = Payment Processor
Metrics Stream = Transaction Stream
Resource Thresholds = Fraud/Limit Detection
Alert System = Risk Management
Aggregation Service = Payment Processing Hub

Technical Stack

Core Technologies (Must Use - From JD)

Python 3.11+ - Primary language
FastAPI - Web gateway, REST endpoints, WebSocket streaming
gRPC - Inter-service communication, metric streaming
PostgreSQL/TimescaleDB - Time-series historical data
Redis - Current state, caching, alert rules
Docker Compose - Orchestration

Supporting Technologies

Protocol Buffers - gRPC message definitions
WebSockets - Browser streaming
htmx + Alpine.js - Lightweight reactive frontend (avoid heavy SPA)
Chart.js or Apache ECharts - Real-time graphs
asyncio - Async patterns throughout

Development Tools

grpcio & grpcio-tools - Python gRPC
psutil - System metrics collection
uvicorn - FastAPI server
pytest - Testing
docker-compose - Local orchestration

Architecture

┌─────────────────────────────────────────────────────────────┐
│                         Browser                              │
│  ┌──────────────────────────────────────────────────────┐  │
│  │  Dashboard (htmx + Alpine.js + WebSockets)           │  │
│  └──────────────────────────────────────────────────────┘  │
└────────────────────────┬────────────────────────────────────┘
                         │ WebSocket
                         ▼
┌─────────────────────────────────────────────────────────────┐
│                    Web Gateway Service                       │
│                    (FastAPI + WebSockets)                    │
│  - Serves dashboard                                          │
│  - Streams updates to browser                                │
│  - REST API for historical queries                           │
└────────────────────────┬────────────────────────────────────┘
                         │ gRPC
                         ▼
┌─────────────────────────────────────────────────────────────┐
│                   Aggregator Service (gRPC)                  │
│  - Receives metric streams from all collectors               │
│  - Normalizes data from different sources                    │
│  - Enriches with machine context                             │
│  - Publishes to event stream                                 │
│  - Checks alert thresholds                                   │
└─────┬───────────────────────────────────┬───────────────────┘
      │                                   │
      │ Stores                            │ Publishes events
      ▼                                   ▼
┌──────────────┐                   ┌────────────────┐
│  TimescaleDB │                   │  Event Stream  │
│  (historical)│                   │  (Redis Pub/Sub│
└──────────────┘                   │   or RabbitMQ) │
                                   └────────┬───────┘
┌──────────────┐                            │
│    Redis     │                            │ Subscribes
│  (current    │◄───────────────────────────┘
│   state)     │                            │
└──────────────┘                            ▼
                                   ┌────────────────┐
      ▲                            │ Alert Service  │
      │                            │  - Processes   │
      │                            │    events      │
      │ gRPC Streaming             │  - Triggers    │
      │                            │    actions     │
┌─────┴────────────────────────────┴────────────────┘
│
│  Multiple Collector Services (one per machine)
│  ┌───────────────────────────────────────┐
│  │  Metrics Collector (gRPC Client)      │
│  │  - Gathers system metrics (psutil)    │
│  │  - Streams to Aggregator via gRPC     │
│  │  - CPU, Memory, Disk, Network         │
│  │  - Process list                       │
│  │  - Docker container stats (optional)  │
│  └───────────────────────────────────────┘
│
└──► Machine 1, Machine 2, Machine 3, ...

Implementation Phases

Phase 1: MVP - Core Streaming (Weekend - 8-12 hours)

Goal: Prove the gRPC streaming works end-to-end

Deliverables:

Metrics Collector Service (gRPC client)
- Collects CPU, memory, disk on localhost
- Streams to aggregator every 5 seconds
Aggregator Service (gRPC server)
- Receives metric stream
- Stores current state in Redis
- Logs to console
Proto definitions for metric messages
Docker Compose setup

Success Criteria:

Run collector, see metrics flowing to aggregator
Redis contains current state
Can query Redis manually for latest metrics

Phase 2: Web Dashboard (1 week)

Goal: Make it visible and useful

Deliverables:

Web Gateway Service (FastAPI)
- WebSocket endpoint for streaming
- REST endpoints for current/historical data
Dashboard UI
- Real-time CPU/Memory graphs per machine
- Current state table
- Simple, clean design
WebSocket bridge (Gateway ↔ Aggregator)
TimescaleDB integration
- Store historical metrics
- Query endpoints for time ranges

Success Criteria:

Open dashboard, see live graphs updating
Graphs show last hour of data
Multiple machines displayed separately

Phase 3: Alerts & Intelligence (1 week)

Goal: Add decision-making layer (interview focus)

Deliverables:

Alert Service
- Subscribes to event stream
- Evaluates threshold rules
- Triggers notifications
Configuration Service (gRPC)
- Dynamic threshold management
- Alert rule CRUD
- Stored in PostgreSQL
Event Stream implementation (Redis Pub/Sub or RabbitMQ)
Enhanced dashboard
- Alert indicators
- Alert history
- Threshold configuration UI

Success Criteria:

Set CPU threshold at 80%
Generate load (stress-ng)
See alert trigger in dashboard
Alert logged to database

Phase 4: Interview Polish (Final week)

Goal: Demo-ready, production patterns visible

Deliverables:

Observability
- OpenTelemetry tracing (optional)
- Structured logging
- Health check endpoints
"Synthetic Transactions"
- Simulate business operations through system
- Track end-to-end latency
- Maps directly to payment processing demo
Documentation
- Architecture diagram
- Service interaction flows
- Deployment guide
Demo script
- Story to walk through
- Key talking points
- Domain mapping explanations

Success Criteria:

Can deploy entire stack with one command
Can explain every service's role
Can map architecture to payment processing
Demo runs smoothly without hiccups

Key Technical Patterns to Demonstrate

1. gRPC Streaming Patterns

Server-Side Streaming:

# Collector streams metrics to aggregator
service MetricsService {
  rpc StreamMetrics(MetricsRequest) returns (stream Metric) {}
}

Bidirectional Streaming:

# Two-way communication between services
service ControlService {
  rpc ManageStream(stream Command) returns (stream Response) {}
}

2. Service Communication Patterns

Synchronous (gRPC): Query current state, configuration
Asynchronous (Events): Metric updates, alerts, audit logs
Streaming (gRPC + WebSocket): Real-time data flow

3. Data Storage Patterns

Hot data (Redis): Current state, recent metrics (last 5 minutes)
Warm data (TimescaleDB): Historical metrics (last 30 days)
Cold data (Optional): Archive to S3-compatible storage

4. Error Handling & Resilience

gRPC retry logic with exponential backoff
Circuit breaker pattern for service calls
Graceful degradation (continue if one collector fails)
Dead letter queue for failed events

Proto Definitions (Starting Point)

syntax = "proto3";

package monitoring;

service MetricsService {
  rpc StreamMetrics(MetricsRequest) returns (stream Metric) {}
  rpc GetCurrentState(StateRequest) returns (MachineState) {}
}

message MetricsRequest {
  string machine_id = 1;
  int32 interval_seconds = 2;
}

message Metric {
  string machine_id = 1;
  int64 timestamp = 2;
  MetricType type = 3;
  double value = 4;
  map<string, string> labels = 5;
}

enum MetricType {
  CPU_PERCENT = 0;
  MEMORY_PERCENT = 1;
  MEMORY_USED_GB = 2;
  DISK_PERCENT = 3;
  NETWORK_SENT_MBPS = 4;
  NETWORK_RECV_MBPS = 5;
}

message MachineState {
  string machine_id = 1;
  int64 last_seen = 2;
  repeated Metric current_metrics = 3;
  HealthStatus health = 4;
}

enum HealthStatus {
  HEALTHY = 0;
  WARNING = 1;
  CRITICAL = 2;
  UNKNOWN = 3;
}

Project Structure

system-monitor/
├── docker-compose.yml
├── proto/
│   └── metrics.proto
├── services/
│   ├── collector/
│   │   ├── Dockerfile
│   │   ├── requirements.txt
│   │   ├── main.py
│   │   └── metrics.py
│   ├── aggregator/
│   │   ├── Dockerfile
│   │   ├── requirements.txt
│   │   ├── main.py
│   │   └── storage.py
│   ├── gateway/
│   │   ├── Dockerfile
│   │   ├── requirements.txt
│   │   ├── main.py
│   │   └── websocket.py
│   └── alerts/
│       ├── Dockerfile
│       ├── requirements.txt
│       ├── main.py
│       └── rules.py
├── web/
│   ├── static/
│   │   ├── css/
│   │   └── js/
│   └── templates/
│       └── dashboard.html
└── README.md

Interview Talking Points

Domain Mapping to Payments

What you say:

"I built this to monitor my dev machines, but the architecture directly maps to payment processing"
"Each machine streaming metrics is like a payment processor streaming transactions"
"The aggregator normalizes data from different sources - same as aggregating from Stripe, PayPal, bank APIs"
"Alert thresholds on resource usage are structurally identical to fraud detection thresholds"
"The event stream for audit trails maps directly to payment audit logs"

Technical Decisions to Highlight

gRPC vs REST:

"I use gRPC between services for efficiency and strong typing"
"FastAPI gateway exposes REST/WebSocket for browser clients"
"This pattern is common - internal gRPC, external REST"

Streaming vs Polling:

"Server-side streaming reduces network overhead"
"Bidirectional streaming allows dynamic configuration updates"
"WebSocket to browser maintains single connection"

State Management:

"Redis for hot data - current state, needs fast access"
"TimescaleDB for historical analysis - optimized for time-series"
"This tiered storage approach scales to payment transaction volumes"

Resilience:

"Each collector is independent - one failing doesn't affect others"
"Circuit breaker prevents cascade failures"
"Event stream decouples alert processing from metric ingestion"

What NOT to Say

Don't call it a "toy project" or "learning exercise"
Don't apologize for running locally vs AWS
Don't over-explain obvious things
Don't claim it's production-ready when it's not

What TO Say

"I built this to solve a real problem I have"
"Locally it uses PostgreSQL/Redis, in production these become Aurora/ElastiCache"
"I focused on the architectural patterns since those transfer directly"
"I'd keep developing this - it's genuinely useful"

Development Guidelines

Code Quality Standards

Type hints throughout (Python 3.11+ syntax)
Async/await patterns consistently
Structured logging (JSON format)
Error handling at all boundaries
Unit tests for business logic
Integration tests for service interactions

Docker Best Practices

Multi-stage builds
Non-root users
Health checks
Resource limits
Volume mounts for development

Configuration Management

Environment variables for all config
Sensible defaults
Config validation on startup
No secrets in code

AWS Mapping (For Interview Discussion)

What you have → What it becomes:

PostgreSQL → Aurora PostgreSQL
Redis → ElastiCache
Docker Containers → ECS/Fargate or Lambda
RabbitMQ/Redis Pub/Sub → SQS/SNS
Docker Compose → CloudFormation/Terraform
Local networking → VPC, Security Groups

Key point: "The architecture and patterns are production-ready, the infrastructure is local for development convenience"

Common Pitfalls to Avoid

Over-engineering Phase 1 - Resist adding features, just get streaming working
Ugly UI - Don't waste time on design, htmx + basic CSS is fine
Perfect metrics - Mock data is OK early on, real psutil data comes later
Complete coverage - Better to have 3 services working perfectly than 10 half-done
AWS deployment - Local is fine, AWS costs money and adds complexity

Success Metrics

For Yourself:

Actually use the dashboard daily
Catches a real issue before you notice
Runs stable for 1+ week without intervention

For Interview:

Can demo end-to-end in 5 minutes
Can explain every service interaction
Can map to payment domain fluently
Shows understanding of production patterns

Next Steps

Set up project structure
Define proto messages
Build Phase 1 MVP
Iterate based on what feels useful
Polish for demo when interview approaches

Resources

gRPC Python docs: https://grpc.io/docs/languages/python/
FastAPI WebSockets: https://fastapi.tiangolo.com/advanced/websockets/
TimescaleDB: https://docs.timescale.com/
htmx: https://htmx.org/

Questions to Ask Yourself During Development

"Would I actually use this feature?"
"How does this map to payments?"
"Can I explain why I built it this way?"
"What would break if X service failed?"
"How would this scale to 1000 machines?"

Final Note

This project works because it's:

Real - You'll use it
Focused - Shows specific patterns they care about
Mappable - Clear connection to their domain
Yours - Not a tutorial copy, demonstrates your thinking

Build it in phases, use it daily, and by interview time you'll have natural stories about trade-offs, failures, and learnings. That authenticity is more valuable than perfect code.

Good luck! 🚀

17 KiB Raw Blame History

Distributed System Monitoring Platform

Project Overview

Why This Project

Technical Stack

Core Technologies (Must Use - From JD)

Supporting Technologies

Development Tools

Architecture

Implementation Phases

Phase 1: MVP - Core Streaming (Weekend - 8-12 hours)

Phase 2: Web Dashboard (1 week)

Phase 3: Alerts & Intelligence (1 week)

Phase 4: Interview Polish (Final week)

Key Technical Patterns to Demonstrate

1. gRPC Streaming Patterns

2. Service Communication Patterns

3. Data Storage Patterns

4. Error Handling & Resilience

Proto Definitions (Starting Point)

Project Structure

Interview Talking Points

Domain Mapping to Payments

Technical Decisions to Highlight

What NOT to Say

What TO Say

Development Guidelines

Code Quality Standards

Docker Best Practices

Configuration Management

AWS Mapping (For Interview Discussion)

Common Pitfalls to Avoid

Success Metrics

Next Steps

Resources

Questions to Ask Yourself During Development

Final Note

17 KiB

Raw Blame History