first claude draft

2025-12-29 14:40:06 -03:00
commit 116d4032e2
69 changed files with 5020 additions and 0 deletions
--- a/CLAUDE.md
+++ b/CLAUDE.md
@@ -0,0 +1,492 @@
+# Distributed System Monitoring Platform
+
+## Project Overview
+
+A real-time system monitoring platform that streams metrics from multiple machines to a central hub with live web dashboard. Built to demonstrate production microservices patterns (gRPC, FastAPI, streaming, event-driven architecture) while solving a real problem: monitoring development infrastructure across multiple machines.
+
+**Primary Goal:** Interview demonstration project for Python Microservices Engineer position  
+**Secondary Goal:** Actually useful tool for managing multi-machine development environment  
+**Time Investment:** Phased approach - MVP in weekend, polish over 2-3 weeks
+
+## Why This Project
+
+**Interview Alignment:**
+- Demonstrates gRPC-based microservices architecture (core requirement)
+- Shows streaming patterns (server-side and bidirectional)
+- Real-time data aggregation and processing
+- Alert/threshold monitoring (maps to fraud detection)
+- Event-driven patterns
+- Multiple data sources requiring normalization (maps to multiple payment processors)
+
+**Personal Utility:**
+- Monitors existing multi-machine dev setup
+- Dashboard stays open, provides real value
+- Solves actual pain point
+- Will continue running post-interview
+
+**Domain Mapping for Interview:**
+- Machine = Payment Processor
+- Metrics Stream = Transaction Stream  
+- Resource Thresholds = Fraud/Limit Detection
+- Alert System = Risk Management
+- Aggregation Service = Payment Processing Hub
+
+## Technical Stack
+
+### Core Technologies (Must Use - From JD)
+- **Python 3.11+** - Primary language
+- **FastAPI** - Web gateway, REST endpoints, WebSocket streaming
+- **gRPC** - Inter-service communication, metric streaming
+- **PostgreSQL/TimescaleDB** - Time-series historical data
+- **Redis** - Current state, caching, alert rules
+- **Docker Compose** - Orchestration
+
+### Supporting Technologies
+- **Protocol Buffers** - gRPC message definitions
+- **WebSockets** - Browser streaming
+- **htmx + Alpine.js** - Lightweight reactive frontend (avoid heavy SPA)
+- **Chart.js or Apache ECharts** - Real-time graphs
+- **asyncio** - Async patterns throughout
+
+### Development Tools
+- **grpcio & grpcio-tools** - Python gRPC
+- **psutil** - System metrics collection
+- **uvicorn** - FastAPI server
+- **pytest** - Testing
+- **docker-compose** - Local orchestration
+
+## Architecture
+
+```
+┌─────────────────────────────────────────────────────────────┐
+│                         Browser                              │
+│  ┌──────────────────────────────────────────────────────┐  │
+│  │  Dashboard (htmx + Alpine.js + WebSockets)           │  │
+│  └──────────────────────────────────────────────────────┘  │
+└────────────────────────┬────────────────────────────────────┘
+                         │ WebSocket
+                         ▼
+┌─────────────────────────────────────────────────────────────┐
+│                    Web Gateway Service                       │
+│                    (FastAPI + WebSockets)                    │
+│  - Serves dashboard                                          │
+│  - Streams updates to browser                                │
+│  - REST API for historical queries                           │
+└────────────────────────┬────────────────────────────────────┘
+                         │ gRPC
+                         ▼
+┌─────────────────────────────────────────────────────────────┐
+│                   Aggregator Service (gRPC)                  │
+│  - Receives metric streams from all collectors               │
+│  - Normalizes data from different sources                    │
+│  - Enriches with machine context                             │
+│  - Publishes to event stream                                 │
+│  - Checks alert thresholds                                   │
+└─────┬───────────────────────────────────┬───────────────────┘
+      │                                   │
+      │ Stores                            │ Publishes events
+      ▼                                   ▼
+┌──────────────┐                   ┌────────────────┐
+│  TimescaleDB │                   │  Event Stream  │
+│  (historical)│                   │  (Redis Pub/Sub│
+└──────────────┘                   │   or RabbitMQ) │
+                                   └────────┬───────┘
+┌──────────────┐                            │
+│    Redis     │                            │ Subscribes
+│  (current    │◄───────────────────────────┘
+│   state)     │                            │
+└──────────────┘                            ▼
+                                   ┌────────────────┐
+      ▲                            │ Alert Service  │
+      │                            │  - Processes   │
+      │                            │    events      │
+      │ gRPC Streaming             │  - Triggers    │
+      │                            │    actions     │
+┌─────┴────────────────────────────┴────────────────┘
+│
+│  Multiple Collector Services (one per machine)
+│  ┌───────────────────────────────────────┐
+│  │  Metrics Collector (gRPC Client)      │
+│  │  - Gathers system metrics (psutil)    │
+│  │  - Streams to Aggregator via gRPC     │
+│  │  - CPU, Memory, Disk, Network         │
+│  │  - Process list                       │
+│  │  - Docker container stats (optional)  │
+│  └───────────────────────────────────────┘
+│
+└──► Machine 1, Machine 2, Machine 3, ...
+```
+
+## Implementation Phases
+
+### Phase 1: MVP - Core Streaming (Weekend - 8-12 hours)
+
+**Goal:** Prove the gRPC streaming works end-to-end
+
+**Deliverables:**
+1. Metrics Collector Service (gRPC client)
+   - Collects CPU, memory, disk on localhost
+   - Streams to aggregator every 5 seconds
+   
+2. Aggregator Service (gRPC server)
+   - Receives metric stream
+   - Stores current state in Redis
+   - Logs to console
+   
+3. Proto definitions for metric messages
+
+4. Docker Compose setup
+
+**Success Criteria:**
+- Run collector, see metrics flowing to aggregator
+- Redis contains current state
+- Can query Redis manually for latest metrics
+
+### Phase 2: Web Dashboard (1 week)
+
+**Goal:** Make it visible and useful
+
+**Deliverables:**
+1. Web Gateway Service (FastAPI)
+   - WebSocket endpoint for streaming
+   - REST endpoints for current/historical data
+   
+2. Dashboard UI
+   - Real-time CPU/Memory graphs per machine
+   - Current state table
+   - Simple, clean design
+   
+3. WebSocket bridge (Gateway ↔ Aggregator)
+
+4. TimescaleDB integration
+   - Store historical metrics
+   - Query endpoints for time ranges
+
+**Success Criteria:**
+- Open dashboard, see live graphs updating
+- Graphs show last hour of data
+- Multiple machines displayed separately
+
+### Phase 3: Alerts & Intelligence (1 week)
+
+**Goal:** Add decision-making layer (interview focus)
+
+**Deliverables:**
+1. Alert Service
+   - Subscribes to event stream
+   - Evaluates threshold rules
+   - Triggers notifications
+   
+2. Configuration Service (gRPC)
+   - Dynamic threshold management
+   - Alert rule CRUD
+   - Stored in PostgreSQL
+   
+3. Event Stream implementation (Redis Pub/Sub or RabbitMQ)
+
+4. Enhanced dashboard
+   - Alert indicators
+   - Alert history
+   - Threshold configuration UI
+
+**Success Criteria:**
+- Set CPU threshold at 80%
+- Generate load (stress-ng)
+- See alert trigger in dashboard
+- Alert logged to database
+
+### Phase 4: Interview Polish (Final week)
+
+**Goal:** Demo-ready, production patterns visible
+
+**Deliverables:**
+1. Observability
+   - OpenTelemetry tracing (optional)
+   - Structured logging
+   - Health check endpoints
+   
+2. "Synthetic Transactions"
+   - Simulate business operations through system
+   - Track end-to-end latency
+   - Maps directly to payment processing demo
+   
+3. Documentation
+   - Architecture diagram
+   - Service interaction flows
+   - Deployment guide
+   
+4. Demo script
+   - Story to walk through
+   - Key talking points
+   - Domain mapping explanations
+
+**Success Criteria:**
+- Can deploy entire stack with one command
+- Can explain every service's role
+- Can map architecture to payment processing
+- Demo runs smoothly without hiccups
+
+## Key Technical Patterns to Demonstrate
+
+### 1. gRPC Streaming Patterns
+
+**Server-Side Streaming:**
+```python
+# Collector streams metrics to aggregator
+service MetricsService {
+  rpc StreamMetrics(MetricsRequest) returns (stream Metric) {}
+}
+```
+
+**Bidirectional Streaming:**
+```python
+# Two-way communication between services
+service ControlService {
+  rpc ManageStream(stream Command) returns (stream Response) {}
+}
+```
+
+### 2. Service Communication Patterns
+
+- **Synchronous (gRPC):** Query current state, configuration
+- **Asynchronous (Events):** Metric updates, alerts, audit logs
+- **Streaming (gRPC + WebSocket):** Real-time data flow
+
+### 3. Data Storage Patterns
+
+- **Hot data (Redis):** Current state, recent metrics (last 5 minutes)
+- **Warm data (TimescaleDB):** Historical metrics (last 30 days)
+- **Cold data (Optional):** Archive to S3-compatible storage
+
+### 4. Error Handling & Resilience
+
+- gRPC retry logic with exponential backoff
+- Circuit breaker pattern for service calls
+- Graceful degradation (continue if one collector fails)
+- Dead letter queue for failed events
+
+## Proto Definitions (Starting Point)
+
+```protobuf
+syntax = "proto3";
+
+package monitoring;
+
+service MetricsService {
+  rpc StreamMetrics(MetricsRequest) returns (stream Metric) {}
+  rpc GetCurrentState(StateRequest) returns (MachineState) {}
+}
+
+message MetricsRequest {
+  string machine_id = 1;
+  int32 interval_seconds = 2;
+}
+
+message Metric {
+  string machine_id = 1;
+  int64 timestamp = 2;
+  MetricType type = 3;
+  double value = 4;
+  map<string, string> labels = 5;
+}
+
+enum MetricType {
+  CPU_PERCENT = 0;
+  MEMORY_PERCENT = 1;
+  MEMORY_USED_GB = 2;
+  DISK_PERCENT = 3;
+  NETWORK_SENT_MBPS = 4;
+  NETWORK_RECV_MBPS = 5;
+}
+
+message MachineState {
+  string machine_id = 1;
+  int64 last_seen = 2;
+  repeated Metric current_metrics = 3;
+  HealthStatus health = 4;
+}
+
+enum HealthStatus {
+  HEALTHY = 0;
+  WARNING = 1;
+  CRITICAL = 2;
+  UNKNOWN = 3;
+}
+```
+
+## Project Structure
+
+```
+system-monitor/
+├── docker-compose.yml
+├── proto/
+│   └── metrics.proto
+├── services/
+│   ├── collector/
+│   │   ├── Dockerfile
+│   │   ├── requirements.txt
+│   │   ├── main.py
+│   │   └── metrics.py
+│   ├── aggregator/
+│   │   ├── Dockerfile
+│   │   ├── requirements.txt
+│   │   ├── main.py
+│   │   └── storage.py
+│   ├── gateway/
+│   │   ├── Dockerfile
+│   │   ├── requirements.txt
+│   │   ├── main.py
+│   │   └── websocket.py
+│   └── alerts/
+│       ├── Dockerfile
+│       ├── requirements.txt
+│       ├── main.py
+│       └── rules.py
+├── web/
+│   ├── static/
+│   │   ├── css/
+│   │   └── js/
+│   └── templates/
+│       └── dashboard.html
+└── README.md
+```
+
+## Interview Talking Points
+
+### Domain Mapping to Payments
+
+**What you say:**
+- "I built this to monitor my dev machines, but the architecture directly maps to payment processing"
+- "Each machine streaming metrics is like a payment processor streaming transactions"
+- "The aggregator normalizes data from different sources - same as aggregating from Stripe, PayPal, bank APIs"
+- "Alert thresholds on resource usage are structurally identical to fraud detection thresholds"
+- "The event stream for audit trails maps directly to payment audit logs"
+
+### Technical Decisions to Highlight
+
+**gRPC vs REST:**
+- "I use gRPC between services for efficiency and strong typing"
+- "FastAPI gateway exposes REST/WebSocket for browser clients"
+- "This pattern is common - internal gRPC, external REST"
+
+**Streaming vs Polling:**
+- "Server-side streaming reduces network overhead"
+- "Bidirectional streaming allows dynamic configuration updates"
+- "WebSocket to browser maintains single connection"
+
+**State Management:**
+- "Redis for hot data - current state, needs fast access"
+- "TimescaleDB for historical analysis - optimized for time-series"
+- "This tiered storage approach scales to payment transaction volumes"
+
+**Resilience:**
+- "Each collector is independent - one failing doesn't affect others"
+- "Circuit breaker prevents cascade failures"
+- "Event stream decouples alert processing from metric ingestion"
+
+### What NOT to Say
+
+- Don't call it a "toy project" or "learning exercise"
+- Don't apologize for running locally vs AWS
+- Don't over-explain obvious things
+- Don't claim it's production-ready when it's not
+
+### What TO Say
+
+- "I built this to solve a real problem I have"
+- "Locally it uses PostgreSQL/Redis, in production these become Aurora/ElastiCache"
+- "I focused on the architectural patterns since those transfer directly"
+- "I'd keep developing this - it's genuinely useful"
+
+## Development Guidelines
+
+### Code Quality Standards
+- Type hints throughout (Python 3.11+ syntax)
+- Async/await patterns consistently
+- Structured logging (JSON format)
+- Error handling at all boundaries
+- Unit tests for business logic
+- Integration tests for service interactions
+
+### Docker Best Practices
+- Multi-stage builds
+- Non-root users
+- Health checks
+- Resource limits
+- Volume mounts for development
+
+### Configuration Management
+- Environment variables for all config
+- Sensible defaults
+- Config validation on startup
+- No secrets in code
+
+## AWS Mapping (For Interview Discussion)
+
+**What you have → What it becomes:**
+- PostgreSQL → Aurora PostgreSQL
+- Redis → ElastiCache
+- Docker Containers → ECS/Fargate or Lambda
+- RabbitMQ/Redis Pub/Sub → SQS/SNS
+- Docker Compose → CloudFormation/Terraform
+- Local networking → VPC, Security Groups
+
+**Key point:** "The architecture and patterns are production-ready, the infrastructure is local for development convenience"
+
+## Common Pitfalls to Avoid
+
+1. **Over-engineering Phase 1** - Resist adding features, just get streaming working
+2. **Ugly UI** - Don't waste time on design, htmx + basic CSS is fine
+3. **Perfect metrics** - Mock data is OK early on, real psutil data comes later
+4. **Complete coverage** - Better to have 3 services working perfectly than 10 half-done
+5. **AWS deployment** - Local is fine, AWS costs money and adds complexity
+
+## Success Metrics
+
+**For Yourself:**
+- [ ] Actually use the dashboard daily
+- [ ] Catches a real issue before you notice
+- [ ] Runs stable for 1+ week without intervention
+
+**For Interview:**
+- [ ] Can demo end-to-end in 5 minutes
+- [ ] Can explain every service interaction
+- [ ] Can map to payment domain fluently
+- [ ] Shows understanding of production patterns
+
+## Next Steps
+
+1. Set up project structure
+2. Define proto messages
+3. Build Phase 1 MVP
+4. Iterate based on what feels useful
+5. Polish for demo when interview approaches
+
+## Resources
+
+- gRPC Python docs: https://grpc.io/docs/languages/python/
+- FastAPI WebSockets: https://fastapi.tiangolo.com/advanced/websockets/
+- TimescaleDB: https://docs.timescale.com/
+- htmx: https://htmx.org/
+
+## Questions to Ask Yourself During Development
+
+- "Would I actually use this feature?"
+- "How does this map to payments?"
+- "Can I explain why I built it this way?"
+- "What would break if X service failed?"
+- "How would this scale to 1000 machines?"
+
+---
+
+## Final Note
+
+This project works because it's:
+1. **Real** - You'll use it
+2. **Focused** - Shows specific patterns they care about
+3. **Mappable** - Clear connection to their domain
+4. **Yours** - Not a tutorial copy, demonstrates your thinking
+
+Build it in phases, use it daily, and by interview time you'll have natural stories about trade-offs, failures, and learnings. That authenticity is more valuable than perfect code.
+
+Good luck! 🚀