493 lines
17 KiB
Markdown
493 lines
17 KiB
Markdown
# Distributed System Monitoring Platform
|
|
|
|
## Project Overview
|
|
|
|
A real-time system monitoring platform that streams metrics from multiple machines to a central hub with live web dashboard. Built to demonstrate production microservices patterns (gRPC, FastAPI, streaming, event-driven architecture) while solving a real problem: monitoring development infrastructure across multiple machines.
|
|
|
|
**Primary Goal:** Interview demonstration project for Python Microservices Engineer position
|
|
**Secondary Goal:** Actually useful tool for managing multi-machine development environment
|
|
**Time Investment:** Phased approach - MVP in weekend, polish over 2-3 weeks
|
|
|
|
## Why This Project
|
|
|
|
**Interview Alignment:**
|
|
- Demonstrates gRPC-based microservices architecture (core requirement)
|
|
- Shows streaming patterns (server-side and bidirectional)
|
|
- Real-time data aggregation and processing
|
|
- Alert/threshold monitoring (maps to fraud detection)
|
|
- Event-driven patterns
|
|
- Multiple data sources requiring normalization (maps to multiple payment processors)
|
|
|
|
**Personal Utility:**
|
|
- Monitors existing multi-machine dev setup
|
|
- Dashboard stays open, provides real value
|
|
- Solves actual pain point
|
|
- Will continue running post-interview
|
|
|
|
**Domain Mapping for Interview:**
|
|
- Machine = Payment Processor
|
|
- Metrics Stream = Transaction Stream
|
|
- Resource Thresholds = Fraud/Limit Detection
|
|
- Alert System = Risk Management
|
|
- Aggregation Service = Payment Processing Hub
|
|
|
|
## Technical Stack
|
|
|
|
### Core Technologies (Must Use - From JD)
|
|
- **Python 3.11+** - Primary language
|
|
- **FastAPI** - Web gateway, REST endpoints, WebSocket streaming
|
|
- **gRPC** - Inter-service communication, metric streaming
|
|
- **PostgreSQL/TimescaleDB** - Time-series historical data
|
|
- **Redis** - Current state, caching, alert rules
|
|
- **Docker Compose** - Orchestration
|
|
|
|
### Supporting Technologies
|
|
- **Protocol Buffers** - gRPC message definitions
|
|
- **WebSockets** - Browser streaming
|
|
- **htmx + Alpine.js** - Lightweight reactive frontend (avoid heavy SPA)
|
|
- **Chart.js or Apache ECharts** - Real-time graphs
|
|
- **asyncio** - Async patterns throughout
|
|
|
|
### Development Tools
|
|
- **grpcio & grpcio-tools** - Python gRPC
|
|
- **psutil** - System metrics collection
|
|
- **uvicorn** - FastAPI server
|
|
- **pytest** - Testing
|
|
- **docker-compose** - Local orchestration
|
|
|
|
## Architecture
|
|
|
|
```
|
|
┌─────────────────────────────────────────────────────────────┐
|
|
│ Browser │
|
|
│ ┌──────────────────────────────────────────────────────┐ │
|
|
│ │ Dashboard (htmx + Alpine.js + WebSockets) │ │
|
|
│ └──────────────────────────────────────────────────────┘ │
|
|
└────────────────────────┬────────────────────────────────────┘
|
|
│ WebSocket
|
|
▼
|
|
┌─────────────────────────────────────────────────────────────┐
|
|
│ Web Gateway Service │
|
|
│ (FastAPI + WebSockets) │
|
|
│ - Serves dashboard │
|
|
│ - Streams updates to browser │
|
|
│ - REST API for historical queries │
|
|
└────────────────────────┬────────────────────────────────────┘
|
|
│ gRPC
|
|
▼
|
|
┌─────────────────────────────────────────────────────────────┐
|
|
│ Aggregator Service (gRPC) │
|
|
│ - Receives metric streams from all collectors │
|
|
│ - Normalizes data from different sources │
|
|
│ - Enriches with machine context │
|
|
│ - Publishes to event stream │
|
|
│ - Checks alert thresholds │
|
|
└─────┬───────────────────────────────────┬───────────────────┘
|
|
│ │
|
|
│ Stores │ Publishes events
|
|
▼ ▼
|
|
┌──────────────┐ ┌────────────────┐
|
|
│ TimescaleDB │ │ Event Stream │
|
|
│ (historical)│ │ (Redis Pub/Sub│
|
|
└──────────────┘ │ or RabbitMQ) │
|
|
└────────┬───────┘
|
|
┌──────────────┐ │
|
|
│ Redis │ │ Subscribes
|
|
│ (current │◄───────────────────────────┘
|
|
│ state) │ │
|
|
└──────────────┘ ▼
|
|
┌────────────────┐
|
|
▲ │ Alert Service │
|
|
│ │ - Processes │
|
|
│ │ events │
|
|
│ gRPC Streaming │ - Triggers │
|
|
│ │ actions │
|
|
┌─────┴────────────────────────────┴────────────────┘
|
|
│
|
|
│ Multiple Collector Services (one per machine)
|
|
│ ┌───────────────────────────────────────┐
|
|
│ │ Metrics Collector (gRPC Client) │
|
|
│ │ - Gathers system metrics (psutil) │
|
|
│ │ - Streams to Aggregator via gRPC │
|
|
│ │ - CPU, Memory, Disk, Network │
|
|
│ │ - Process list │
|
|
│ │ - Docker container stats (optional) │
|
|
│ └───────────────────────────────────────┘
|
|
│
|
|
└──► Machine 1, Machine 2, Machine 3, ...
|
|
```
|
|
|
|
## Implementation Phases
|
|
|
|
### Phase 1: MVP - Core Streaming (Weekend - 8-12 hours)
|
|
|
|
**Goal:** Prove the gRPC streaming works end-to-end
|
|
|
|
**Deliverables:**
|
|
1. Metrics Collector Service (gRPC client)
|
|
- Collects CPU, memory, disk on localhost
|
|
- Streams to aggregator every 5 seconds
|
|
|
|
2. Aggregator Service (gRPC server)
|
|
- Receives metric stream
|
|
- Stores current state in Redis
|
|
- Logs to console
|
|
|
|
3. Proto definitions for metric messages
|
|
|
|
4. Docker Compose setup
|
|
|
|
**Success Criteria:**
|
|
- Run collector, see metrics flowing to aggregator
|
|
- Redis contains current state
|
|
- Can query Redis manually for latest metrics
|
|
|
|
### Phase 2: Web Dashboard (1 week)
|
|
|
|
**Goal:** Make it visible and useful
|
|
|
|
**Deliverables:**
|
|
1. Web Gateway Service (FastAPI)
|
|
- WebSocket endpoint for streaming
|
|
- REST endpoints for current/historical data
|
|
|
|
2. Dashboard UI
|
|
- Real-time CPU/Memory graphs per machine
|
|
- Current state table
|
|
- Simple, clean design
|
|
|
|
3. WebSocket bridge (Gateway ↔ Aggregator)
|
|
|
|
4. TimescaleDB integration
|
|
- Store historical metrics
|
|
- Query endpoints for time ranges
|
|
|
|
**Success Criteria:**
|
|
- Open dashboard, see live graphs updating
|
|
- Graphs show last hour of data
|
|
- Multiple machines displayed separately
|
|
|
|
### Phase 3: Alerts & Intelligence (1 week)
|
|
|
|
**Goal:** Add decision-making layer (interview focus)
|
|
|
|
**Deliverables:**
|
|
1. Alert Service
|
|
- Subscribes to event stream
|
|
- Evaluates threshold rules
|
|
- Triggers notifications
|
|
|
|
2. Configuration Service (gRPC)
|
|
- Dynamic threshold management
|
|
- Alert rule CRUD
|
|
- Stored in PostgreSQL
|
|
|
|
3. Event Stream implementation (Redis Pub/Sub or RabbitMQ)
|
|
|
|
4. Enhanced dashboard
|
|
- Alert indicators
|
|
- Alert history
|
|
- Threshold configuration UI
|
|
|
|
**Success Criteria:**
|
|
- Set CPU threshold at 80%
|
|
- Generate load (stress-ng)
|
|
- See alert trigger in dashboard
|
|
- Alert logged to database
|
|
|
|
### Phase 4: Interview Polish (Final week)
|
|
|
|
**Goal:** Demo-ready, production patterns visible
|
|
|
|
**Deliverables:**
|
|
1. Observability
|
|
- OpenTelemetry tracing (optional)
|
|
- Structured logging
|
|
- Health check endpoints
|
|
|
|
2. "Synthetic Transactions"
|
|
- Simulate business operations through system
|
|
- Track end-to-end latency
|
|
- Maps directly to payment processing demo
|
|
|
|
3. Documentation
|
|
- Architecture diagram
|
|
- Service interaction flows
|
|
- Deployment guide
|
|
|
|
4. Demo script
|
|
- Story to walk through
|
|
- Key talking points
|
|
- Domain mapping explanations
|
|
|
|
**Success Criteria:**
|
|
- Can deploy entire stack with one command
|
|
- Can explain every service's role
|
|
- Can map architecture to payment processing
|
|
- Demo runs smoothly without hiccups
|
|
|
|
## Key Technical Patterns to Demonstrate
|
|
|
|
### 1. gRPC Streaming Patterns
|
|
|
|
**Server-Side Streaming:**
|
|
```python
|
|
# Collector streams metrics to aggregator
|
|
service MetricsService {
|
|
rpc StreamMetrics(MetricsRequest) returns (stream Metric) {}
|
|
}
|
|
```
|
|
|
|
**Bidirectional Streaming:**
|
|
```python
|
|
# Two-way communication between services
|
|
service ControlService {
|
|
rpc ManageStream(stream Command) returns (stream Response) {}
|
|
}
|
|
```
|
|
|
|
### 2. Service Communication Patterns
|
|
|
|
- **Synchronous (gRPC):** Query current state, configuration
|
|
- **Asynchronous (Events):** Metric updates, alerts, audit logs
|
|
- **Streaming (gRPC + WebSocket):** Real-time data flow
|
|
|
|
### 3. Data Storage Patterns
|
|
|
|
- **Hot data (Redis):** Current state, recent metrics (last 5 minutes)
|
|
- **Warm data (TimescaleDB):** Historical metrics (last 30 days)
|
|
- **Cold data (Optional):** Archive to S3-compatible storage
|
|
|
|
### 4. Error Handling & Resilience
|
|
|
|
- gRPC retry logic with exponential backoff
|
|
- Circuit breaker pattern for service calls
|
|
- Graceful degradation (continue if one collector fails)
|
|
- Dead letter queue for failed events
|
|
|
|
## Proto Definitions (Starting Point)
|
|
|
|
```protobuf
|
|
syntax = "proto3";
|
|
|
|
package monitoring;
|
|
|
|
service MetricsService {
|
|
rpc StreamMetrics(MetricsRequest) returns (stream Metric) {}
|
|
rpc GetCurrentState(StateRequest) returns (MachineState) {}
|
|
}
|
|
|
|
message MetricsRequest {
|
|
string machine_id = 1;
|
|
int32 interval_seconds = 2;
|
|
}
|
|
|
|
message Metric {
|
|
string machine_id = 1;
|
|
int64 timestamp = 2;
|
|
MetricType type = 3;
|
|
double value = 4;
|
|
map<string, string> labels = 5;
|
|
}
|
|
|
|
enum MetricType {
|
|
CPU_PERCENT = 0;
|
|
MEMORY_PERCENT = 1;
|
|
MEMORY_USED_GB = 2;
|
|
DISK_PERCENT = 3;
|
|
NETWORK_SENT_MBPS = 4;
|
|
NETWORK_RECV_MBPS = 5;
|
|
}
|
|
|
|
message MachineState {
|
|
string machine_id = 1;
|
|
int64 last_seen = 2;
|
|
repeated Metric current_metrics = 3;
|
|
HealthStatus health = 4;
|
|
}
|
|
|
|
enum HealthStatus {
|
|
HEALTHY = 0;
|
|
WARNING = 1;
|
|
CRITICAL = 2;
|
|
UNKNOWN = 3;
|
|
}
|
|
```
|
|
|
|
## Project Structure
|
|
|
|
```
|
|
system-monitor/
|
|
├── docker-compose.yml
|
|
├── proto/
|
|
│ └── metrics.proto
|
|
├── services/
|
|
│ ├── collector/
|
|
│ │ ├── Dockerfile
|
|
│ │ ├── requirements.txt
|
|
│ │ ├── main.py
|
|
│ │ └── metrics.py
|
|
│ ├── aggregator/
|
|
│ │ ├── Dockerfile
|
|
│ │ ├── requirements.txt
|
|
│ │ ├── main.py
|
|
│ │ └── storage.py
|
|
│ ├── gateway/
|
|
│ │ ├── Dockerfile
|
|
│ │ ├── requirements.txt
|
|
│ │ ├── main.py
|
|
│ │ └── websocket.py
|
|
│ └── alerts/
|
|
│ ├── Dockerfile
|
|
│ ├── requirements.txt
|
|
│ ├── main.py
|
|
│ └── rules.py
|
|
├── web/
|
|
│ ├── static/
|
|
│ │ ├── css/
|
|
│ │ └── js/
|
|
│ └── templates/
|
|
│ └── dashboard.html
|
|
└── README.md
|
|
```
|
|
|
|
## Interview Talking Points
|
|
|
|
### Domain Mapping to Payments
|
|
|
|
**What you say:**
|
|
- "I built this to monitor my dev machines, but the architecture directly maps to payment processing"
|
|
- "Each machine streaming metrics is like a payment processor streaming transactions"
|
|
- "The aggregator normalizes data from different sources - same as aggregating from Stripe, PayPal, bank APIs"
|
|
- "Alert thresholds on resource usage are structurally identical to fraud detection thresholds"
|
|
- "The event stream for audit trails maps directly to payment audit logs"
|
|
|
|
### Technical Decisions to Highlight
|
|
|
|
**gRPC vs REST:**
|
|
- "I use gRPC between services for efficiency and strong typing"
|
|
- "FastAPI gateway exposes REST/WebSocket for browser clients"
|
|
- "This pattern is common - internal gRPC, external REST"
|
|
|
|
**Streaming vs Polling:**
|
|
- "Server-side streaming reduces network overhead"
|
|
- "Bidirectional streaming allows dynamic configuration updates"
|
|
- "WebSocket to browser maintains single connection"
|
|
|
|
**State Management:**
|
|
- "Redis for hot data - current state, needs fast access"
|
|
- "TimescaleDB for historical analysis - optimized for time-series"
|
|
- "This tiered storage approach scales to payment transaction volumes"
|
|
|
|
**Resilience:**
|
|
- "Each collector is independent - one failing doesn't affect others"
|
|
- "Circuit breaker prevents cascade failures"
|
|
- "Event stream decouples alert processing from metric ingestion"
|
|
|
|
### What NOT to Say
|
|
|
|
- Don't call it a "toy project" or "learning exercise"
|
|
- Don't apologize for running locally vs AWS
|
|
- Don't over-explain obvious things
|
|
- Don't claim it's production-ready when it's not
|
|
|
|
### What TO Say
|
|
|
|
- "I built this to solve a real problem I have"
|
|
- "Locally it uses PostgreSQL/Redis, in production these become Aurora/ElastiCache"
|
|
- "I focused on the architectural patterns since those transfer directly"
|
|
- "I'd keep developing this - it's genuinely useful"
|
|
|
|
## Development Guidelines
|
|
|
|
### Code Quality Standards
|
|
- Type hints throughout (Python 3.11+ syntax)
|
|
- Async/await patterns consistently
|
|
- Structured logging (JSON format)
|
|
- Error handling at all boundaries
|
|
- Unit tests for business logic
|
|
- Integration tests for service interactions
|
|
|
|
### Docker Best Practices
|
|
- Multi-stage builds
|
|
- Non-root users
|
|
- Health checks
|
|
- Resource limits
|
|
- Volume mounts for development
|
|
|
|
### Configuration Management
|
|
- Environment variables for all config
|
|
- Sensible defaults
|
|
- Config validation on startup
|
|
- No secrets in code
|
|
|
|
## AWS Mapping (For Interview Discussion)
|
|
|
|
**What you have → What it becomes:**
|
|
- PostgreSQL → Aurora PostgreSQL
|
|
- Redis → ElastiCache
|
|
- Docker Containers → ECS/Fargate or Lambda
|
|
- RabbitMQ/Redis Pub/Sub → SQS/SNS
|
|
- Docker Compose → CloudFormation/Terraform
|
|
- Local networking → VPC, Security Groups
|
|
|
|
**Key point:** "The architecture and patterns are production-ready, the infrastructure is local for development convenience"
|
|
|
|
## Common Pitfalls to Avoid
|
|
|
|
1. **Over-engineering Phase 1** - Resist adding features, just get streaming working
|
|
2. **Ugly UI** - Don't waste time on design, htmx + basic CSS is fine
|
|
3. **Perfect metrics** - Mock data is OK early on, real psutil data comes later
|
|
4. **Complete coverage** - Better to have 3 services working perfectly than 10 half-done
|
|
5. **AWS deployment** - Local is fine, AWS costs money and adds complexity
|
|
|
|
## Success Metrics
|
|
|
|
**For Yourself:**
|
|
- [ ] Actually use the dashboard daily
|
|
- [ ] Catches a real issue before you notice
|
|
- [ ] Runs stable for 1+ week without intervention
|
|
|
|
**For Interview:**
|
|
- [ ] Can demo end-to-end in 5 minutes
|
|
- [ ] Can explain every service interaction
|
|
- [ ] Can map to payment domain fluently
|
|
- [ ] Shows understanding of production patterns
|
|
|
|
## Next Steps
|
|
|
|
1. Set up project structure
|
|
2. Define proto messages
|
|
3. Build Phase 1 MVP
|
|
4. Iterate based on what feels useful
|
|
5. Polish for demo when interview approaches
|
|
|
|
## Resources
|
|
|
|
- gRPC Python docs: https://grpc.io/docs/languages/python/
|
|
- FastAPI WebSockets: https://fastapi.tiangolo.com/advanced/websockets/
|
|
- TimescaleDB: https://docs.timescale.com/
|
|
- htmx: https://htmx.org/
|
|
|
|
## Questions to Ask Yourself During Development
|
|
|
|
- "Would I actually use this feature?"
|
|
- "How does this map to payments?"
|
|
- "Can I explain why I built it this way?"
|
|
- "What would break if X service failed?"
|
|
- "How would this scale to 1000 machines?"
|
|
|
|
---
|
|
|
|
## Final Note
|
|
|
|
This project works because it's:
|
|
1. **Real** - You'll use it
|
|
2. **Focused** - Shows specific patterns they care about
|
|
3. **Mappable** - Clear connection to their domain
|
|
4. **Yours** - Not a tutorial copy, demonstrates your thinking
|
|
|
|
Build it in phases, use it daily, and by interview time you'll have natural stories about trade-offs, failures, and learnings. That authenticity is more valuable than perfect code.
|
|
|
|
Good luck! 🚀
|