# Distributed System Monitoring Platform ## Project Overview A real-time system monitoring platform that streams metrics from multiple machines to a central hub with live web dashboard. Built to demonstrate production microservices patterns (gRPC, FastAPI, streaming, event-driven architecture) while solving a real problem: monitoring development infrastructure across multiple machines. **Primary Goal:** Interview demonstration project for Python Microservices Engineer position **Secondary Goal:** Actually useful tool for managing multi-machine development environment **Time Investment:** Phased approach - MVP in weekend, polish over 2-3 weeks ## Why This Project **Interview Alignment:** - Demonstrates gRPC-based microservices architecture (core requirement) - Shows streaming patterns (server-side and bidirectional) - Real-time data aggregation and processing - Alert/threshold monitoring (maps to fraud detection) - Event-driven patterns - Multiple data sources requiring normalization (maps to multiple payment processors) **Personal Utility:** - Monitors existing multi-machine dev setup - Dashboard stays open, provides real value - Solves actual pain point - Will continue running post-interview **Domain Mapping for Interview:** - Machine = Payment Processor - Metrics Stream = Transaction Stream - Resource Thresholds = Fraud/Limit Detection - Alert System = Risk Management - Aggregation Service = Payment Processing Hub ## Technical Stack ### Core Technologies (Must Use - From JD) - **Python 3.11+** - Primary language - **FastAPI** - Web gateway, REST endpoints, WebSocket streaming - **gRPC** - Inter-service communication, metric streaming - **PostgreSQL/TimescaleDB** - Time-series historical data - **Redis** - Current state, caching, alert rules - **Docker Compose** - Orchestration ### Supporting Technologies - **Protocol Buffers** - gRPC message definitions - **WebSockets** - Browser streaming - **htmx + Alpine.js** - Lightweight reactive frontend (avoid heavy SPA) - **Chart.js or Apache ECharts** - Real-time graphs - **asyncio** - Async patterns throughout ### Development Tools - **grpcio & grpcio-tools** - Python gRPC - **psutil** - System metrics collection - **uvicorn** - FastAPI server - **pytest** - Testing - **docker-compose** - Local orchestration ## Architecture ``` ┌─────────────────────────────────────────────────────────────┐ │ Browser │ │ ┌──────────────────────────────────────────────────────┐ │ │ │ Dashboard (htmx + Alpine.js + WebSockets) │ │ │ └──────────────────────────────────────────────────────┘ │ └────────────────────────┬────────────────────────────────────┘ │ WebSocket ▼ ┌─────────────────────────────────────────────────────────────┐ │ Web Gateway Service │ │ (FastAPI + WebSockets) │ │ - Serves dashboard │ │ - Streams updates to browser │ │ - REST API for historical queries │ └────────────────────────┬────────────────────────────────────┘ │ gRPC ▼ ┌─────────────────────────────────────────────────────────────┐ │ Aggregator Service (gRPC) │ │ - Receives metric streams from all collectors │ │ - Normalizes data from different sources │ │ - Enriches with machine context │ │ - Publishes to event stream │ │ - Checks alert thresholds │ └─────┬───────────────────────────────────┬───────────────────┘ │ │ │ Stores │ Publishes events ▼ ▼ ┌──────────────┐ ┌────────────────┐ │ TimescaleDB │ │ Event Stream │ │ (historical)│ │ (Redis Pub/Sub│ └──────────────┘ │ or RabbitMQ) │ └────────┬───────┘ ┌──────────────┐ │ │ Redis │ │ Subscribes │ (current │◄───────────────────────────┘ │ state) │ │ └──────────────┘ ▼ ┌────────────────┐ ▲ │ Alert Service │ │ │ - Processes │ │ │ events │ │ gRPC Streaming │ - Triggers │ │ │ actions │ ┌─────┴────────────────────────────┴────────────────┘ │ │ Multiple Collector Services (one per machine) │ ┌───────────────────────────────────────┐ │ │ Metrics Collector (gRPC Client) │ │ │ - Gathers system metrics (psutil) │ │ │ - Streams to Aggregator via gRPC │ │ │ - CPU, Memory, Disk, Network │ │ │ - Process list │ │ │ - Docker container stats (optional) │ │ └───────────────────────────────────────┘ │ └──► Machine 1, Machine 2, Machine 3, ... ``` ## Implementation Phases ### Phase 1: MVP - Core Streaming (Weekend - 8-12 hours) **Goal:** Prove the gRPC streaming works end-to-end **Deliverables:** 1. Metrics Collector Service (gRPC client) - Collects CPU, memory, disk on localhost - Streams to aggregator every 5 seconds 2. Aggregator Service (gRPC server) - Receives metric stream - Stores current state in Redis - Logs to console 3. Proto definitions for metric messages 4. Docker Compose setup **Success Criteria:** - Run collector, see metrics flowing to aggregator - Redis contains current state - Can query Redis manually for latest metrics ### Phase 2: Web Dashboard (1 week) **Goal:** Make it visible and useful **Deliverables:** 1. Web Gateway Service (FastAPI) - WebSocket endpoint for streaming - REST endpoints for current/historical data 2. Dashboard UI - Real-time CPU/Memory graphs per machine - Current state table - Simple, clean design 3. WebSocket bridge (Gateway ↔ Aggregator) 4. TimescaleDB integration - Store historical metrics - Query endpoints for time ranges **Success Criteria:** - Open dashboard, see live graphs updating - Graphs show last hour of data - Multiple machines displayed separately ### Phase 3: Alerts & Intelligence (1 week) **Goal:** Add decision-making layer (interview focus) **Deliverables:** 1. Alert Service - Subscribes to event stream - Evaluates threshold rules - Triggers notifications 2. Configuration Service (gRPC) - Dynamic threshold management - Alert rule CRUD - Stored in PostgreSQL 3. Event Stream implementation (Redis Pub/Sub or RabbitMQ) 4. Enhanced dashboard - Alert indicators - Alert history - Threshold configuration UI **Success Criteria:** - Set CPU threshold at 80% - Generate load (stress-ng) - See alert trigger in dashboard - Alert logged to database ### Phase 4: Interview Polish (Final week) **Goal:** Demo-ready, production patterns visible **Deliverables:** 1. Observability - OpenTelemetry tracing (optional) - Structured logging - Health check endpoints 2. "Synthetic Transactions" - Simulate business operations through system - Track end-to-end latency - Maps directly to payment processing demo 3. Documentation - Architecture diagram - Service interaction flows - Deployment guide 4. Demo script - Story to walk through - Key talking points - Domain mapping explanations **Success Criteria:** - Can deploy entire stack with one command - Can explain every service's role - Can map architecture to payment processing - Demo runs smoothly without hiccups ## Key Technical Patterns to Demonstrate ### 1. gRPC Streaming Patterns **Server-Side Streaming:** ```python # Collector streams metrics to aggregator service MetricsService { rpc StreamMetrics(MetricsRequest) returns (stream Metric) {} } ``` **Bidirectional Streaming:** ```python # Two-way communication between services service ControlService { rpc ManageStream(stream Command) returns (stream Response) {} } ``` ### 2. Service Communication Patterns - **Synchronous (gRPC):** Query current state, configuration - **Asynchronous (Events):** Metric updates, alerts, audit logs - **Streaming (gRPC + WebSocket):** Real-time data flow ### 3. Data Storage Patterns - **Hot data (Redis):** Current state, recent metrics (last 5 minutes) - **Warm data (TimescaleDB):** Historical metrics (last 30 days) - **Cold data (Optional):** Archive to S3-compatible storage ### 4. Error Handling & Resilience - gRPC retry logic with exponential backoff - Circuit breaker pattern for service calls - Graceful degradation (continue if one collector fails) - Dead letter queue for failed events ## Proto Definitions (Starting Point) ```protobuf syntax = "proto3"; package monitoring; service MetricsService { rpc StreamMetrics(MetricsRequest) returns (stream Metric) {} rpc GetCurrentState(StateRequest) returns (MachineState) {} } message MetricsRequest { string machine_id = 1; int32 interval_seconds = 2; } message Metric { string machine_id = 1; int64 timestamp = 2; MetricType type = 3; double value = 4; map labels = 5; } enum MetricType { CPU_PERCENT = 0; MEMORY_PERCENT = 1; MEMORY_USED_GB = 2; DISK_PERCENT = 3; NETWORK_SENT_MBPS = 4; NETWORK_RECV_MBPS = 5; } message MachineState { string machine_id = 1; int64 last_seen = 2; repeated Metric current_metrics = 3; HealthStatus health = 4; } enum HealthStatus { HEALTHY = 0; WARNING = 1; CRITICAL = 2; UNKNOWN = 3; } ``` ## Project Structure ``` system-monitor/ ├── docker-compose.yml ├── proto/ │ └── metrics.proto ├── services/ │ ├── collector/ │ │ ├── Dockerfile │ │ ├── requirements.txt │ │ ├── main.py │ │ └── metrics.py │ ├── aggregator/ │ │ ├── Dockerfile │ │ ├── requirements.txt │ │ ├── main.py │ │ └── storage.py │ ├── gateway/ │ │ ├── Dockerfile │ │ ├── requirements.txt │ │ ├── main.py │ │ └── websocket.py │ └── alerts/ │ ├── Dockerfile │ ├── requirements.txt │ ├── main.py │ └── rules.py ├── web/ │ ├── static/ │ │ ├── css/ │ │ └── js/ │ └── templates/ │ └── dashboard.html └── README.md ``` ## Interview Talking Points ### Domain Mapping to Payments **What you say:** - "I built this to monitor my dev machines, but the architecture directly maps to payment processing" - "Each machine streaming metrics is like a payment processor streaming transactions" - "The aggregator normalizes data from different sources - same as aggregating from Stripe, PayPal, bank APIs" - "Alert thresholds on resource usage are structurally identical to fraud detection thresholds" - "The event stream for audit trails maps directly to payment audit logs" ### Technical Decisions to Highlight **gRPC vs REST:** - "I use gRPC between services for efficiency and strong typing" - "FastAPI gateway exposes REST/WebSocket for browser clients" - "This pattern is common - internal gRPC, external REST" **Streaming vs Polling:** - "Server-side streaming reduces network overhead" - "Bidirectional streaming allows dynamic configuration updates" - "WebSocket to browser maintains single connection" **State Management:** - "Redis for hot data - current state, needs fast access" - "TimescaleDB for historical analysis - optimized for time-series" - "This tiered storage approach scales to payment transaction volumes" **Resilience:** - "Each collector is independent - one failing doesn't affect others" - "Circuit breaker prevents cascade failures" - "Event stream decouples alert processing from metric ingestion" ### What NOT to Say - Don't call it a "toy project" or "learning exercise" - Don't apologize for running locally vs AWS - Don't over-explain obvious things - Don't claim it's production-ready when it's not ### What TO Say - "I built this to solve a real problem I have" - "Locally it uses PostgreSQL/Redis, in production these become Aurora/ElastiCache" - "I focused on the architectural patterns since those transfer directly" - "I'd keep developing this - it's genuinely useful" ## Development Guidelines ### Code Quality Standards - Type hints throughout (Python 3.11+ syntax) - Async/await patterns consistently - Structured logging (JSON format) - Error handling at all boundaries - Unit tests for business logic - Integration tests for service interactions ### Docker Best Practices - Multi-stage builds - Non-root users - Health checks - Resource limits - Volume mounts for development ### Configuration Management - Environment variables for all config - Sensible defaults - Config validation on startup - No secrets in code ## AWS Mapping (For Interview Discussion) **What you have → What it becomes:** - PostgreSQL → Aurora PostgreSQL - Redis → ElastiCache - Docker Containers → ECS/Fargate or Lambda - RabbitMQ/Redis Pub/Sub → SQS/SNS - Docker Compose → CloudFormation/Terraform - Local networking → VPC, Security Groups **Key point:** "The architecture and patterns are production-ready, the infrastructure is local for development convenience" ## Common Pitfalls to Avoid 1. **Over-engineering Phase 1** - Resist adding features, just get streaming working 2. **Ugly UI** - Don't waste time on design, htmx + basic CSS is fine 3. **Perfect metrics** - Mock data is OK early on, real psutil data comes later 4. **Complete coverage** - Better to have 3 services working perfectly than 10 half-done 5. **AWS deployment** - Local is fine, AWS costs money and adds complexity ## Success Metrics **For Yourself:** - [ ] Actually use the dashboard daily - [ ] Catches a real issue before you notice - [ ] Runs stable for 1+ week without intervention **For Interview:** - [ ] Can demo end-to-end in 5 minutes - [ ] Can explain every service interaction - [ ] Can map to payment domain fluently - [ ] Shows understanding of production patterns ## Next Steps 1. Set up project structure 2. Define proto messages 3. Build Phase 1 MVP 4. Iterate based on what feels useful 5. Polish for demo when interview approaches ## Resources - gRPC Python docs: https://grpc.io/docs/languages/python/ - FastAPI WebSockets: https://fastapi.tiangolo.com/advanced/websockets/ - TimescaleDB: https://docs.timescale.com/ - htmx: https://htmx.org/ ## Questions to Ask Yourself During Development - "Would I actually use this feature?" - "How does this map to payments?" - "Can I explain why I built it this way?" - "What would break if X service failed?" - "How would this scale to 1000 machines?" --- ## Final Note This project works because it's: 1. **Real** - You'll use it 2. **Focused** - Shows specific patterns they care about 3. **Mappable** - Clear connection to their domain 4. **Yours** - Not a tutorial copy, demonstrates your thinking Build it in phases, use it daily, and by interview time you'll have natural stories about trade-offs, failures, and learnings. That authenticity is more valuable than perfect code. Good luck! 🚀