17 KiB
Distributed System Monitoring Platform
Project Overview
A real-time system monitoring platform that streams metrics from multiple machines to a central hub with live web dashboard. Built to demonstrate production microservices patterns (gRPC, FastAPI, streaming, event-driven architecture) while solving a real problem: monitoring development infrastructure across multiple machines.
Primary Goal: Interview demonstration project for Python Microservices Engineer position
Secondary Goal: Actually useful tool for managing multi-machine development environment
Time Investment: Phased approach - MVP in weekend, polish over 2-3 weeks
Why This Project
Interview Alignment:
- Demonstrates gRPC-based microservices architecture (core requirement)
- Shows streaming patterns (server-side and bidirectional)
- Real-time data aggregation and processing
- Alert/threshold monitoring (maps to fraud detection)
- Event-driven patterns
- Multiple data sources requiring normalization (maps to multiple payment processors)
Personal Utility:
- Monitors existing multi-machine dev setup
- Dashboard stays open, provides real value
- Solves actual pain point
- Will continue running post-interview
Domain Mapping for Interview:
- Machine = Payment Processor
- Metrics Stream = Transaction Stream
- Resource Thresholds = Fraud/Limit Detection
- Alert System = Risk Management
- Aggregation Service = Payment Processing Hub
Technical Stack
Core Technologies (Must Use - From JD)
- Python 3.11+ - Primary language
- FastAPI - Web gateway, REST endpoints, WebSocket streaming
- gRPC - Inter-service communication, metric streaming
- PostgreSQL/TimescaleDB - Time-series historical data
- Redis - Current state, caching, alert rules
- Docker Compose - Orchestration
Supporting Technologies
- Protocol Buffers - gRPC message definitions
- WebSockets - Browser streaming
- htmx + Alpine.js - Lightweight reactive frontend (avoid heavy SPA)
- Chart.js or Apache ECharts - Real-time graphs
- asyncio - Async patterns throughout
Development Tools
- grpcio & grpcio-tools - Python gRPC
- psutil - System metrics collection
- uvicorn - FastAPI server
- pytest - Testing
- docker-compose - Local orchestration
Architecture
┌─────────────────────────────────────────────────────────────┐
│ Browser │
│ ┌──────────────────────────────────────────────────────┐ │
│ │ Dashboard (htmx + Alpine.js + WebSockets) │ │
│ └──────────────────────────────────────────────────────┘ │
└────────────────────────┬────────────────────────────────────┘
│ WebSocket
▼
┌─────────────────────────────────────────────────────────────┐
│ Web Gateway Service │
│ (FastAPI + WebSockets) │
│ - Serves dashboard │
│ - Streams updates to browser │
│ - REST API for historical queries │
└────────────────────────┬────────────────────────────────────┘
│ gRPC
▼
┌─────────────────────────────────────────────────────────────┐
│ Aggregator Service (gRPC) │
│ - Receives metric streams from all collectors │
│ - Normalizes data from different sources │
│ - Enriches with machine context │
│ - Publishes to event stream │
│ - Checks alert thresholds │
└─────┬───────────────────────────────────┬───────────────────┘
│ │
│ Stores │ Publishes events
▼ ▼
┌──────────────┐ ┌────────────────┐
│ TimescaleDB │ │ Event Stream │
│ (historical)│ │ (Redis Pub/Sub│
└──────────────┘ │ or RabbitMQ) │
└────────┬───────┘
┌──────────────┐ │
│ Redis │ │ Subscribes
│ (current │◄───────────────────────────┘
│ state) │ │
└──────────────┘ ▼
┌────────────────┐
▲ │ Alert Service │
│ │ - Processes │
│ │ events │
│ gRPC Streaming │ - Triggers │
│ │ actions │
┌─────┴────────────────────────────┴────────────────┘
│
│ Multiple Collector Services (one per machine)
│ ┌───────────────────────────────────────┐
│ │ Metrics Collector (gRPC Client) │
│ │ - Gathers system metrics (psutil) │
│ │ - Streams to Aggregator via gRPC │
│ │ - CPU, Memory, Disk, Network │
│ │ - Process list │
│ │ - Docker container stats (optional) │
│ └───────────────────────────────────────┘
│
└──► Machine 1, Machine 2, Machine 3, ...
Implementation Phases
Phase 1: MVP - Core Streaming (Weekend - 8-12 hours)
Goal: Prove the gRPC streaming works end-to-end
Deliverables:
-
Metrics Collector Service (gRPC client)
- Collects CPU, memory, disk on localhost
- Streams to aggregator every 5 seconds
-
Aggregator Service (gRPC server)
- Receives metric stream
- Stores current state in Redis
- Logs to console
-
Proto definitions for metric messages
-
Docker Compose setup
Success Criteria:
- Run collector, see metrics flowing to aggregator
- Redis contains current state
- Can query Redis manually for latest metrics
Phase 2: Web Dashboard (1 week)
Goal: Make it visible and useful
Deliverables:
-
Web Gateway Service (FastAPI)
- WebSocket endpoint for streaming
- REST endpoints for current/historical data
-
Dashboard UI
- Real-time CPU/Memory graphs per machine
- Current state table
- Simple, clean design
-
WebSocket bridge (Gateway ↔ Aggregator)
-
TimescaleDB integration
- Store historical metrics
- Query endpoints for time ranges
Success Criteria:
- Open dashboard, see live graphs updating
- Graphs show last hour of data
- Multiple machines displayed separately
Phase 3: Alerts & Intelligence (1 week)
Goal: Add decision-making layer (interview focus)
Deliverables:
-
Alert Service
- Subscribes to event stream
- Evaluates threshold rules
- Triggers notifications
-
Configuration Service (gRPC)
- Dynamic threshold management
- Alert rule CRUD
- Stored in PostgreSQL
-
Event Stream implementation (Redis Pub/Sub or RabbitMQ)
-
Enhanced dashboard
- Alert indicators
- Alert history
- Threshold configuration UI
Success Criteria:
- Set CPU threshold at 80%
- Generate load (stress-ng)
- See alert trigger in dashboard
- Alert logged to database
Phase 4: Interview Polish (Final week)
Goal: Demo-ready, production patterns visible
Deliverables:
-
Observability
- OpenTelemetry tracing (optional)
- Structured logging
- Health check endpoints
-
"Synthetic Transactions"
- Simulate business operations through system
- Track end-to-end latency
- Maps directly to payment processing demo
-
Documentation
- Architecture diagram
- Service interaction flows
- Deployment guide
-
Demo script
- Story to walk through
- Key talking points
- Domain mapping explanations
Success Criteria:
- Can deploy entire stack with one command
- Can explain every service's role
- Can map architecture to payment processing
- Demo runs smoothly without hiccups
Key Technical Patterns to Demonstrate
1. gRPC Streaming Patterns
Server-Side Streaming:
# Collector streams metrics to aggregator
service MetricsService {
rpc StreamMetrics(MetricsRequest) returns (stream Metric) {}
}
Bidirectional Streaming:
# Two-way communication between services
service ControlService {
rpc ManageStream(stream Command) returns (stream Response) {}
}
2. Service Communication Patterns
- Synchronous (gRPC): Query current state, configuration
- Asynchronous (Events): Metric updates, alerts, audit logs
- Streaming (gRPC + WebSocket): Real-time data flow
3. Data Storage Patterns
- Hot data (Redis): Current state, recent metrics (last 5 minutes)
- Warm data (TimescaleDB): Historical metrics (last 30 days)
- Cold data (Optional): Archive to S3-compatible storage
4. Error Handling & Resilience
- gRPC retry logic with exponential backoff
- Circuit breaker pattern for service calls
- Graceful degradation (continue if one collector fails)
- Dead letter queue for failed events
Proto Definitions (Starting Point)
syntax = "proto3";
package monitoring;
service MetricsService {
rpc StreamMetrics(MetricsRequest) returns (stream Metric) {}
rpc GetCurrentState(StateRequest) returns (MachineState) {}
}
message MetricsRequest {
string machine_id = 1;
int32 interval_seconds = 2;
}
message Metric {
string machine_id = 1;
int64 timestamp = 2;
MetricType type = 3;
double value = 4;
map<string, string> labels = 5;
}
enum MetricType {
CPU_PERCENT = 0;
MEMORY_PERCENT = 1;
MEMORY_USED_GB = 2;
DISK_PERCENT = 3;
NETWORK_SENT_MBPS = 4;
NETWORK_RECV_MBPS = 5;
}
message MachineState {
string machine_id = 1;
int64 last_seen = 2;
repeated Metric current_metrics = 3;
HealthStatus health = 4;
}
enum HealthStatus {
HEALTHY = 0;
WARNING = 1;
CRITICAL = 2;
UNKNOWN = 3;
}
Project Structure
system-monitor/
├── docker-compose.yml
├── proto/
│ └── metrics.proto
├── services/
│ ├── collector/
│ │ ├── Dockerfile
│ │ ├── requirements.txt
│ │ ├── main.py
│ │ └── metrics.py
│ ├── aggregator/
│ │ ├── Dockerfile
│ │ ├── requirements.txt
│ │ ├── main.py
│ │ └── storage.py
│ ├── gateway/
│ │ ├── Dockerfile
│ │ ├── requirements.txt
│ │ ├── main.py
│ │ └── websocket.py
│ └── alerts/
│ ├── Dockerfile
│ ├── requirements.txt
│ ├── main.py
│ └── rules.py
├── web/
│ ├── static/
│ │ ├── css/
│ │ └── js/
│ └── templates/
│ └── dashboard.html
└── README.md
Interview Talking Points
Domain Mapping to Payments
What you say:
- "I built this to monitor my dev machines, but the architecture directly maps to payment processing"
- "Each machine streaming metrics is like a payment processor streaming transactions"
- "The aggregator normalizes data from different sources - same as aggregating from Stripe, PayPal, bank APIs"
- "Alert thresholds on resource usage are structurally identical to fraud detection thresholds"
- "The event stream for audit trails maps directly to payment audit logs"
Technical Decisions to Highlight
gRPC vs REST:
- "I use gRPC between services for efficiency and strong typing"
- "FastAPI gateway exposes REST/WebSocket for browser clients"
- "This pattern is common - internal gRPC, external REST"
Streaming vs Polling:
- "Server-side streaming reduces network overhead"
- "Bidirectional streaming allows dynamic configuration updates"
- "WebSocket to browser maintains single connection"
State Management:
- "Redis for hot data - current state, needs fast access"
- "TimescaleDB for historical analysis - optimized for time-series"
- "This tiered storage approach scales to payment transaction volumes"
Resilience:
- "Each collector is independent - one failing doesn't affect others"
- "Circuit breaker prevents cascade failures"
- "Event stream decouples alert processing from metric ingestion"
What NOT to Say
- Don't call it a "toy project" or "learning exercise"
- Don't apologize for running locally vs AWS
- Don't over-explain obvious things
- Don't claim it's production-ready when it's not
What TO Say
- "I built this to solve a real problem I have"
- "Locally it uses PostgreSQL/Redis, in production these become Aurora/ElastiCache"
- "I focused on the architectural patterns since those transfer directly"
- "I'd keep developing this - it's genuinely useful"
Development Guidelines
Code Quality Standards
- Type hints throughout (Python 3.11+ syntax)
- Async/await patterns consistently
- Structured logging (JSON format)
- Error handling at all boundaries
- Unit tests for business logic
- Integration tests for service interactions
Docker Best Practices
- Multi-stage builds
- Non-root users
- Health checks
- Resource limits
- Volume mounts for development
Configuration Management
- Environment variables for all config
- Sensible defaults
- Config validation on startup
- No secrets in code
AWS Mapping (For Interview Discussion)
What you have → What it becomes:
- PostgreSQL → Aurora PostgreSQL
- Redis → ElastiCache
- Docker Containers → ECS/Fargate or Lambda
- RabbitMQ/Redis Pub/Sub → SQS/SNS
- Docker Compose → CloudFormation/Terraform
- Local networking → VPC, Security Groups
Key point: "The architecture and patterns are production-ready, the infrastructure is local for development convenience"
Common Pitfalls to Avoid
- Over-engineering Phase 1 - Resist adding features, just get streaming working
- Ugly UI - Don't waste time on design, htmx + basic CSS is fine
- Perfect metrics - Mock data is OK early on, real psutil data comes later
- Complete coverage - Better to have 3 services working perfectly than 10 half-done
- AWS deployment - Local is fine, AWS costs money and adds complexity
Success Metrics
For Yourself:
- Actually use the dashboard daily
- Catches a real issue before you notice
- Runs stable for 1+ week without intervention
For Interview:
- Can demo end-to-end in 5 minutes
- Can explain every service interaction
- Can map to payment domain fluently
- Shows understanding of production patterns
Next Steps
- Set up project structure
- Define proto messages
- Build Phase 1 MVP
- Iterate based on what feels useful
- Polish for demo when interview approaches
Resources
- gRPC Python docs: https://grpc.io/docs/languages/python/
- FastAPI WebSockets: https://fastapi.tiangolo.com/advanced/websockets/
- TimescaleDB: https://docs.timescale.com/
- htmx: https://htmx.org/
Questions to Ask Yourself During Development
- "Would I actually use this feature?"
- "How does this map to payments?"
- "Can I explain why I built it this way?"
- "What would break if X service failed?"
- "How would this scale to 1000 machines?"
Final Note
This project works because it's:
- Real - You'll use it
- Focused - Shows specific patterns they care about
- Mappable - Clear connection to their domain
- Yours - Not a tutorial copy, demonstrates your thinking
Build it in phases, use it daily, and by interview time you'll have natural stories about trade-offs, failures, and learnings. That authenticity is more valuable than perfect code.
Good luck! 🚀