first claude draft

This commit is contained in:
buenosairesam
2025-12-29 14:40:06 -03:00
commit 116d4032e2
69 changed files with 5020 additions and 0 deletions

492
CLAUDE.md Normal file
View File

@@ -0,0 +1,492 @@
# Distributed System Monitoring Platform
## Project Overview
A real-time system monitoring platform that streams metrics from multiple machines to a central hub with live web dashboard. Built to demonstrate production microservices patterns (gRPC, FastAPI, streaming, event-driven architecture) while solving a real problem: monitoring development infrastructure across multiple machines.
**Primary Goal:** Interview demonstration project for Python Microservices Engineer position
**Secondary Goal:** Actually useful tool for managing multi-machine development environment
**Time Investment:** Phased approach - MVP in weekend, polish over 2-3 weeks
## Why This Project
**Interview Alignment:**
- Demonstrates gRPC-based microservices architecture (core requirement)
- Shows streaming patterns (server-side and bidirectional)
- Real-time data aggregation and processing
- Alert/threshold monitoring (maps to fraud detection)
- Event-driven patterns
- Multiple data sources requiring normalization (maps to multiple payment processors)
**Personal Utility:**
- Monitors existing multi-machine dev setup
- Dashboard stays open, provides real value
- Solves actual pain point
- Will continue running post-interview
**Domain Mapping for Interview:**
- Machine = Payment Processor
- Metrics Stream = Transaction Stream
- Resource Thresholds = Fraud/Limit Detection
- Alert System = Risk Management
- Aggregation Service = Payment Processing Hub
## Technical Stack
### Core Technologies (Must Use - From JD)
- **Python 3.11+** - Primary language
- **FastAPI** - Web gateway, REST endpoints, WebSocket streaming
- **gRPC** - Inter-service communication, metric streaming
- **PostgreSQL/TimescaleDB** - Time-series historical data
- **Redis** - Current state, caching, alert rules
- **Docker Compose** - Orchestration
### Supporting Technologies
- **Protocol Buffers** - gRPC message definitions
- **WebSockets** - Browser streaming
- **htmx + Alpine.js** - Lightweight reactive frontend (avoid heavy SPA)
- **Chart.js or Apache ECharts** - Real-time graphs
- **asyncio** - Async patterns throughout
### Development Tools
- **grpcio & grpcio-tools** - Python gRPC
- **psutil** - System metrics collection
- **uvicorn** - FastAPI server
- **pytest** - Testing
- **docker-compose** - Local orchestration
## Architecture
```
┌─────────────────────────────────────────────────────────────┐
│ Browser │
│ ┌──────────────────────────────────────────────────────┐ │
│ │ Dashboard (htmx + Alpine.js + WebSockets) │ │
│ └──────────────────────────────────────────────────────┘ │
└────────────────────────┬────────────────────────────────────┘
│ WebSocket
┌─────────────────────────────────────────────────────────────┐
│ Web Gateway Service │
│ (FastAPI + WebSockets) │
│ - Serves dashboard │
│ - Streams updates to browser │
│ - REST API for historical queries │
└────────────────────────┬────────────────────────────────────┘
│ gRPC
┌─────────────────────────────────────────────────────────────┐
│ Aggregator Service (gRPC) │
│ - Receives metric streams from all collectors │
│ - Normalizes data from different sources │
│ - Enriches with machine context │
│ - Publishes to event stream │
│ - Checks alert thresholds │
└─────┬───────────────────────────────────┬───────────────────┘
│ │
│ Stores │ Publishes events
▼ ▼
┌──────────────┐ ┌────────────────┐
│ TimescaleDB │ │ Event Stream │
│ (historical)│ │ (Redis Pub/Sub│
└──────────────┘ │ or RabbitMQ) │
└────────┬───────┘
┌──────────────┐ │
│ Redis │ │ Subscribes
│ (current │◄───────────────────────────┘
│ state) │ │
└──────────────┘ ▼
┌────────────────┐
▲ │ Alert Service │
│ │ - Processes │
│ │ events │
│ gRPC Streaming │ - Triggers │
│ │ actions │
┌─────┴────────────────────────────┴────────────────┘
│ Multiple Collector Services (one per machine)
│ ┌───────────────────────────────────────┐
│ │ Metrics Collector (gRPC Client) │
│ │ - Gathers system metrics (psutil) │
│ │ - Streams to Aggregator via gRPC │
│ │ - CPU, Memory, Disk, Network │
│ │ - Process list │
│ │ - Docker container stats (optional) │
│ └───────────────────────────────────────┘
└──► Machine 1, Machine 2, Machine 3, ...
```
## Implementation Phases
### Phase 1: MVP - Core Streaming (Weekend - 8-12 hours)
**Goal:** Prove the gRPC streaming works end-to-end
**Deliverables:**
1. Metrics Collector Service (gRPC client)
- Collects CPU, memory, disk on localhost
- Streams to aggregator every 5 seconds
2. Aggregator Service (gRPC server)
- Receives metric stream
- Stores current state in Redis
- Logs to console
3. Proto definitions for metric messages
4. Docker Compose setup
**Success Criteria:**
- Run collector, see metrics flowing to aggregator
- Redis contains current state
- Can query Redis manually for latest metrics
### Phase 2: Web Dashboard (1 week)
**Goal:** Make it visible and useful
**Deliverables:**
1. Web Gateway Service (FastAPI)
- WebSocket endpoint for streaming
- REST endpoints for current/historical data
2. Dashboard UI
- Real-time CPU/Memory graphs per machine
- Current state table
- Simple, clean design
3. WebSocket bridge (Gateway ↔ Aggregator)
4. TimescaleDB integration
- Store historical metrics
- Query endpoints for time ranges
**Success Criteria:**
- Open dashboard, see live graphs updating
- Graphs show last hour of data
- Multiple machines displayed separately
### Phase 3: Alerts & Intelligence (1 week)
**Goal:** Add decision-making layer (interview focus)
**Deliverables:**
1. Alert Service
- Subscribes to event stream
- Evaluates threshold rules
- Triggers notifications
2. Configuration Service (gRPC)
- Dynamic threshold management
- Alert rule CRUD
- Stored in PostgreSQL
3. Event Stream implementation (Redis Pub/Sub or RabbitMQ)
4. Enhanced dashboard
- Alert indicators
- Alert history
- Threshold configuration UI
**Success Criteria:**
- Set CPU threshold at 80%
- Generate load (stress-ng)
- See alert trigger in dashboard
- Alert logged to database
### Phase 4: Interview Polish (Final week)
**Goal:** Demo-ready, production patterns visible
**Deliverables:**
1. Observability
- OpenTelemetry tracing (optional)
- Structured logging
- Health check endpoints
2. "Synthetic Transactions"
- Simulate business operations through system
- Track end-to-end latency
- Maps directly to payment processing demo
3. Documentation
- Architecture diagram
- Service interaction flows
- Deployment guide
4. Demo script
- Story to walk through
- Key talking points
- Domain mapping explanations
**Success Criteria:**
- Can deploy entire stack with one command
- Can explain every service's role
- Can map architecture to payment processing
- Demo runs smoothly without hiccups
## Key Technical Patterns to Demonstrate
### 1. gRPC Streaming Patterns
**Server-Side Streaming:**
```python
# Collector streams metrics to aggregator
service MetricsService {
rpc StreamMetrics(MetricsRequest) returns (stream Metric) {}
}
```
**Bidirectional Streaming:**
```python
# Two-way communication between services
service ControlService {
rpc ManageStream(stream Command) returns (stream Response) {}
}
```
### 2. Service Communication Patterns
- **Synchronous (gRPC):** Query current state, configuration
- **Asynchronous (Events):** Metric updates, alerts, audit logs
- **Streaming (gRPC + WebSocket):** Real-time data flow
### 3. Data Storage Patterns
- **Hot data (Redis):** Current state, recent metrics (last 5 minutes)
- **Warm data (TimescaleDB):** Historical metrics (last 30 days)
- **Cold data (Optional):** Archive to S3-compatible storage
### 4. Error Handling & Resilience
- gRPC retry logic with exponential backoff
- Circuit breaker pattern for service calls
- Graceful degradation (continue if one collector fails)
- Dead letter queue for failed events
## Proto Definitions (Starting Point)
```protobuf
syntax = "proto3";
package monitoring;
service MetricsService {
rpc StreamMetrics(MetricsRequest) returns (stream Metric) {}
rpc GetCurrentState(StateRequest) returns (MachineState) {}
}
message MetricsRequest {
string machine_id = 1;
int32 interval_seconds = 2;
}
message Metric {
string machine_id = 1;
int64 timestamp = 2;
MetricType type = 3;
double value = 4;
map<string, string> labels = 5;
}
enum MetricType {
CPU_PERCENT = 0;
MEMORY_PERCENT = 1;
MEMORY_USED_GB = 2;
DISK_PERCENT = 3;
NETWORK_SENT_MBPS = 4;
NETWORK_RECV_MBPS = 5;
}
message MachineState {
string machine_id = 1;
int64 last_seen = 2;
repeated Metric current_metrics = 3;
HealthStatus health = 4;
}
enum HealthStatus {
HEALTHY = 0;
WARNING = 1;
CRITICAL = 2;
UNKNOWN = 3;
}
```
## Project Structure
```
system-monitor/
├── docker-compose.yml
├── proto/
│ └── metrics.proto
├── services/
│ ├── collector/
│ │ ├── Dockerfile
│ │ ├── requirements.txt
│ │ ├── main.py
│ │ └── metrics.py
│ ├── aggregator/
│ │ ├── Dockerfile
│ │ ├── requirements.txt
│ │ ├── main.py
│ │ └── storage.py
│ ├── gateway/
│ │ ├── Dockerfile
│ │ ├── requirements.txt
│ │ ├── main.py
│ │ └── websocket.py
│ └── alerts/
│ ├── Dockerfile
│ ├── requirements.txt
│ ├── main.py
│ └── rules.py
├── web/
│ ├── static/
│ │ ├── css/
│ │ └── js/
│ └── templates/
│ └── dashboard.html
└── README.md
```
## Interview Talking Points
### Domain Mapping to Payments
**What you say:**
- "I built this to monitor my dev machines, but the architecture directly maps to payment processing"
- "Each machine streaming metrics is like a payment processor streaming transactions"
- "The aggregator normalizes data from different sources - same as aggregating from Stripe, PayPal, bank APIs"
- "Alert thresholds on resource usage are structurally identical to fraud detection thresholds"
- "The event stream for audit trails maps directly to payment audit logs"
### Technical Decisions to Highlight
**gRPC vs REST:**
- "I use gRPC between services for efficiency and strong typing"
- "FastAPI gateway exposes REST/WebSocket for browser clients"
- "This pattern is common - internal gRPC, external REST"
**Streaming vs Polling:**
- "Server-side streaming reduces network overhead"
- "Bidirectional streaming allows dynamic configuration updates"
- "WebSocket to browser maintains single connection"
**State Management:**
- "Redis for hot data - current state, needs fast access"
- "TimescaleDB for historical analysis - optimized for time-series"
- "This tiered storage approach scales to payment transaction volumes"
**Resilience:**
- "Each collector is independent - one failing doesn't affect others"
- "Circuit breaker prevents cascade failures"
- "Event stream decouples alert processing from metric ingestion"
### What NOT to Say
- Don't call it a "toy project" or "learning exercise"
- Don't apologize for running locally vs AWS
- Don't over-explain obvious things
- Don't claim it's production-ready when it's not
### What TO Say
- "I built this to solve a real problem I have"
- "Locally it uses PostgreSQL/Redis, in production these become Aurora/ElastiCache"
- "I focused on the architectural patterns since those transfer directly"
- "I'd keep developing this - it's genuinely useful"
## Development Guidelines
### Code Quality Standards
- Type hints throughout (Python 3.11+ syntax)
- Async/await patterns consistently
- Structured logging (JSON format)
- Error handling at all boundaries
- Unit tests for business logic
- Integration tests for service interactions
### Docker Best Practices
- Multi-stage builds
- Non-root users
- Health checks
- Resource limits
- Volume mounts for development
### Configuration Management
- Environment variables for all config
- Sensible defaults
- Config validation on startup
- No secrets in code
## AWS Mapping (For Interview Discussion)
**What you have → What it becomes:**
- PostgreSQL → Aurora PostgreSQL
- Redis → ElastiCache
- Docker Containers → ECS/Fargate or Lambda
- RabbitMQ/Redis Pub/Sub → SQS/SNS
- Docker Compose → CloudFormation/Terraform
- Local networking → VPC, Security Groups
**Key point:** "The architecture and patterns are production-ready, the infrastructure is local for development convenience"
## Common Pitfalls to Avoid
1. **Over-engineering Phase 1** - Resist adding features, just get streaming working
2. **Ugly UI** - Don't waste time on design, htmx + basic CSS is fine
3. **Perfect metrics** - Mock data is OK early on, real psutil data comes later
4. **Complete coverage** - Better to have 3 services working perfectly than 10 half-done
5. **AWS deployment** - Local is fine, AWS costs money and adds complexity
## Success Metrics
**For Yourself:**
- [ ] Actually use the dashboard daily
- [ ] Catches a real issue before you notice
- [ ] Runs stable for 1+ week without intervention
**For Interview:**
- [ ] Can demo end-to-end in 5 minutes
- [ ] Can explain every service interaction
- [ ] Can map to payment domain fluently
- [ ] Shows understanding of production patterns
## Next Steps
1. Set up project structure
2. Define proto messages
3. Build Phase 1 MVP
4. Iterate based on what feels useful
5. Polish for demo when interview approaches
## Resources
- gRPC Python docs: https://grpc.io/docs/languages/python/
- FastAPI WebSockets: https://fastapi.tiangolo.com/advanced/websockets/
- TimescaleDB: https://docs.timescale.com/
- htmx: https://htmx.org/
## Questions to Ask Yourself During Development
- "Would I actually use this feature?"
- "How does this map to payments?"
- "Can I explain why I built it this way?"
- "What would break if X service failed?"
- "How would this scale to 1000 machines?"
---
## Final Note
This project works because it's:
1. **Real** - You'll use it
2. **Focused** - Shows specific patterns they care about
3. **Mappable** - Clear connection to their domain
4. **Yours** - Not a tutorial copy, demonstrates your thinking
Build it in phases, use it daily, and by interview time you'll have natural stories about trade-offs, failures, and learnings. That authenticity is more valuable than perfect code.
Good luck! 🚀