sysmonstm/CLAUDE.md

# Distributed System Monitoring Platform

## Project Overview

A real-time system monitoring platform that streams metrics from multiple machines to a central hub with live web dashboard. Built to demonstrate production microservices patterns (gRPC, FastAPI, streaming, event-driven architecture) while solving a real problem: monitoring development infrastructure across multiple machines.

**Primary Goal:** Interview demonstration project for Python Microservices Engineer position
**Secondary Goal:** Actually useful tool for managing multi-machine development environment
**Time Investment:** Phased approach - MVP in weekend, polish over 2-3 weeks

## Why This Project

**Interview Alignment:**
- Demonstrates gRPC-based microservices architecture (core requirement)
- Shows streaming patterns (server-side and bidirectional)
- Real-time data aggregation and processing
- Alert/threshold monitoring (maps to fraud detection)
- Event-driven patterns
- Multiple data sources requiring normalization (maps to multiple payment processors)

**Personal Utility:**
- Monitors existing multi-machine dev setup
- Dashboard stays open, provides real value
- Solves actual pain point
- Will continue running post-interview

**Domain Mapping for Interview:**
- Machine = Payment Processor
- Metrics Stream = Transaction Stream
- Resource Thresholds = Fraud/Limit Detection
- Alert System = Risk Management
- Aggregation Service = Payment Processing Hub

## Technical Stack

### Core Technologies (Must Use - From JD)
- **Python 3.11+** - Primary language
- **FastAPI** - Web gateway, REST endpoints, WebSocket streaming
- **gRPC** - Inter-service communication, metric streaming
- **PostgreSQL/TimescaleDB** - Time-series historical data
- **Redis** - Current state, caching, alert rules
- **Docker Compose** - Orchestration

### Supporting Technologies
- **Protocol Buffers** - gRPC message definitions
- **WebSockets** - Browser streaming
- **htmx + Alpine.js** - Lightweight reactive frontend (avoid heavy SPA)
- **Chart.js or Apache ECharts** - Real-time graphs
- **asyncio** - Async patterns throughout

### Development Tools
- **grpcio & grpcio-tools** - Python gRPC
- **psutil** - System metrics collection
- **uvicorn** - FastAPI server
- **pytest** - Testing
- **docker-compose** - Local orchestration

## Architecture

```
┌─────────────────────────────────────────────────────────────┐
│                         Browser                              │
│  ┌──────────────────────────────────────────────────────┐  │
│  │  Dashboard (htmx + Alpine.js + WebSockets)           │  │
│  └──────────────────────────────────────────────────────┘  │
└────────────────────────┬────────────────────────────────────┘
                         │ WebSocket
                         ▼
┌─────────────────────────────────────────────────────────────┐
│                    Web Gateway Service                       │
│                    (FastAPI + WebSockets)                    │
│  - Serves dashboard                                          │
│  - Streams updates to browser                                │
│  - REST API for historical queries                           │
└────────────────────────┬────────────────────────────────────┘
                         │ gRPC
                         ▼
┌─────────────────────────────────────────────────────────────┐
│                   Aggregator Service (gRPC)                  │
│  - Receives metric streams from all collectors               │
│  - Normalizes data from different sources                    │
│  - Enriches with machine context                             │
│  - Publishes to event stream                                 │
│  - Checks alert thresholds                                   │
└─────┬───────────────────────────────────┬───────────────────┘
      │                                   │
      │ Stores                            │ Publishes events
      ▼                                   ▼
┌──────────────┐                   ┌────────────────┐
│  TimescaleDB │                   │  Event Stream  │
│  (historical)│                   │  (Redis Pub/Sub│
└──────────────┘                   │   or RabbitMQ) │
                                   └────────┬───────┘
┌──────────────┐                            │
│    Redis     │                            │ Subscribes
│  (current    │◄───────────────────────────┘
│   state)     │                            │
└──────────────┘                            ▼
                                   ┌────────────────┐
      ▲                            │ Alert Service  │
      │                            │  - Processes   │
      │                            │    events      │
      │ gRPC Streaming             │  - Triggers    │
      │                            │    actions     │
┌─────┴────────────────────────────┴────────────────┘
│
│  Multiple Collector Services (one per machine)
│  ┌───────────────────────────────────────┐
│  │  Metrics Collector (gRPC Client)      │
│  │  - Gathers system metrics (psutil)    │
│  │  - Streams to Aggregator via gRPC     │
│  │  - CPU, Memory, Disk, Network         │
│  │  - Process list                       │
│  │  - Docker container stats (optional)  │
│  └───────────────────────────────────────┘
│
└──► Machine 1, Machine 2, Machine 3, ...
```

## Implementation Phases

### Phase 1: MVP - Core Streaming (Weekend - 8-12 hours)

**Goal:** Prove the gRPC streaming works end-to-end

**Deliverables:**
1. Metrics Collector Service (gRPC client)
   - Collects CPU, memory, disk on localhost
   - Streams to aggregator every 5 seconds

2. Aggregator Service (gRPC server)
   - Receives metric stream
   - Stores current state in Redis
   - Logs to console

3. Proto definitions for metric messages

4. Docker Compose setup

**Success Criteria:**
- Run collector, see metrics flowing to aggregator
- Redis contains current state
- Can query Redis manually for latest metrics

### Phase 2: Web Dashboard (1 week)

**Goal:** Make it visible and useful

**Deliverables:**
1. Web Gateway Service (FastAPI)
   - WebSocket endpoint for streaming
   - REST endpoints for current/historical data

2. Dashboard UI
   - Real-time CPU/Memory graphs per machine
   - Current state table
   - Simple, clean design

3. WebSocket bridge (Gateway ↔ Aggregator)

4. TimescaleDB integration
   - Store historical metrics
   - Query endpoints for time ranges

**Success Criteria:**
- Open dashboard, see live graphs updating
- Graphs show last hour of data
- Multiple machines displayed separately

### Phase 3: Alerts & Intelligence (1 week)

**Goal:** Add decision-making layer (interview focus)

**Deliverables:**
1. Alert Service
   - Subscribes to event stream
   - Evaluates threshold rules
   - Triggers notifications

2. Configuration Service (gRPC)
   - Dynamic threshold management
   - Alert rule CRUD
   - Stored in PostgreSQL

3. Event Stream implementation (Redis Pub/Sub or RabbitMQ)

4. Enhanced dashboard
   - Alert indicators
   - Alert history
   - Threshold configuration UI

**Success Criteria:**
- Set CPU threshold at 80%
- Generate load (stress-ng)
- See alert trigger in dashboard
- Alert logged to database

### Phase 4: Interview Polish (Final week)

**Goal:** Demo-ready, production patterns visible

**Deliverables:**
1. Observability
   - OpenTelemetry tracing (optional)
   - Structured logging
   - Health check endpoints

2. "Synthetic Transactions"
   - Simulate business operations through system
   - Track end-to-end latency
   - Maps directly to payment processing demo

3. Documentation
   - Architecture diagram
   - Service interaction flows
   - Deployment guide

4. Demo script
   - Story to walk through
   - Key talking points
   - Domain mapping explanations

**Success Criteria:**
- Can deploy entire stack with one command
- Can explain every service's role
- Can map architecture to payment processing
- Demo runs smoothly without hiccups

## Key Technical Patterns to Demonstrate

### 1. gRPC Streaming Patterns

**Server-Side Streaming:**
```python
# Collector streams metrics to aggregator
service MetricsService {
  rpc StreamMetrics(MetricsRequest) returns (stream Metric) {}
}
```

**Bidirectional Streaming:**
```python
# Two-way communication between services
service ControlService {
  rpc ManageStream(stream Command) returns (stream Response) {}
}
```

### 2. Service Communication Patterns

- **Synchronous (gRPC):** Query current state, configuration
- **Asynchronous (Events):** Metric updates, alerts, audit logs
- **Streaming (gRPC + WebSocket):** Real-time data flow

### 3. Data Storage Patterns

- **Hot data (Redis):** Current state, recent metrics (last 5 minutes)
- **Warm data (TimescaleDB):** Historical metrics (last 30 days)
- **Cold data (Optional):** Archive to S3-compatible storage

### 4. Error Handling & Resilience

- gRPC retry logic with exponential backoff
- Circuit breaker pattern for service calls
- Graceful degradation (continue if one collector fails)
- Dead letter queue for failed events

## Proto Definitions (Starting Point)

```protobuf
syntax = "proto3";

package monitoring;

service MetricsService {
  rpc StreamMetrics(MetricsRequest) returns (stream Metric) {}
  rpc GetCurrentState(StateRequest) returns (MachineState) {}
}

message MetricsRequest {
  string machine_id = 1;
  int32 interval_seconds = 2;
}

message Metric {
  string machine_id = 1;
  int64 timestamp = 2;
  MetricType type = 3;
  double value = 4;
  map<string, string> labels = 5;
}

enum MetricType {
  CPU_PERCENT = 0;
  MEMORY_PERCENT = 1;
  MEMORY_USED_GB = 2;
  DISK_PERCENT = 3;
  NETWORK_SENT_MBPS = 4;
  NETWORK_RECV_MBPS = 5;
}

message MachineState {
  string machine_id = 1;
  int64 last_seen = 2;
  repeated Metric current_metrics = 3;
  HealthStatus health = 4;
}

enum HealthStatus {
  HEALTHY = 0;
  WARNING = 1;
  CRITICAL = 2;
  UNKNOWN = 3;
}
```

## Project Structure

```
system-monitor/
├── docker-compose.yml
├── proto/
│   └── metrics.proto
├── services/
│   ├── collector/
│   │   ├── Dockerfile
│   │   ├── requirements.txt
│   │   ├── main.py
│   │   └── metrics.py
│   ├── aggregator/
│   │   ├── Dockerfile
│   │   ├── requirements.txt
│   │   ├── main.py
│   │   └── storage.py
│   ├── gateway/
│   │   ├── Dockerfile
│   │   ├── requirements.txt
│   │   ├── main.py
│   │   └── websocket.py
│   └── alerts/
│       ├── Dockerfile
│       ├── requirements.txt
│       ├── main.py
│       └── rules.py
├── web/
│   ├── static/
│   │   ├── css/
│   │   └── js/
│   └── templates/
│       └── dashboard.html
└── README.md
```

## Interview Talking Points

### Domain Mapping to Payments

**What you say:**
- "I built this to monitor my dev machines, but the architecture directly maps to payment processing"
- "Each machine streaming metrics is like a payment processor streaming transactions"
- "The aggregator normalizes data from different sources - same as aggregating from Stripe, PayPal, bank APIs"
- "Alert thresholds on resource usage are structurally identical to fraud detection thresholds"
- "The event stream for audit trails maps directly to payment audit logs"

### Technical Decisions to Highlight

**gRPC vs REST:**
- "I use gRPC between services for efficiency and strong typing"
- "FastAPI gateway exposes REST/WebSocket for browser clients"
- "This pattern is common - internal gRPC, external REST"

**Streaming vs Polling:**
- "Server-side streaming reduces network overhead"
- "Bidirectional streaming allows dynamic configuration updates"
- "WebSocket to browser maintains single connection"

**State Management:**
- "Redis for hot data - current state, needs fast access"
- "TimescaleDB for historical analysis - optimized for time-series"
- "This tiered storage approach scales to payment transaction volumes"

**Resilience:**
- "Each collector is independent - one failing doesn't affect others"
- "Circuit breaker prevents cascade failures"
- "Event stream decouples alert processing from metric ingestion"

### What NOT to Say

- Don't call it a "toy project" or "learning exercise"
- Don't apologize for running locally vs AWS
- Don't over-explain obvious things
- Don't claim it's production-ready when it's not

### What TO Say

- "I built this to solve a real problem I have"
- "Locally it uses PostgreSQL/Redis, in production these become Aurora/ElastiCache"
- "I focused on the architectural patterns since those transfer directly"
- "I'd keep developing this - it's genuinely useful"

## Development Guidelines

### Code Quality Standards
- Type hints throughout (Python 3.11+ syntax)
- Async/await patterns consistently
- Structured logging (JSON format)
- Error handling at all boundaries
- Unit tests for business logic
- Integration tests for service interactions

### Docker Best Practices
- Multi-stage builds
- Non-root users
- Health checks
- Resource limits
- Volume mounts for development

### Configuration Management
- Environment variables for all config
- Sensible defaults
- Config validation on startup
- No secrets in code

## AWS Mapping (For Interview Discussion)

**What you have → What it becomes:**
- PostgreSQL → Aurora PostgreSQL
- Redis → ElastiCache
- Docker Containers → ECS/Fargate or Lambda
- RabbitMQ/Redis Pub/Sub → SQS/SNS
- Docker Compose → CloudFormation/Terraform
- Local networking → VPC, Security Groups

**Key point:** "The architecture and patterns are production-ready, the infrastructure is local for development convenience"

## Common Pitfalls to Avoid

1. **Over-engineering Phase 1** - Resist adding features, just get streaming working
2. **Ugly UI** - Don't waste time on design, htmx + basic CSS is fine
3. **Perfect metrics** - Mock data is OK early on, real psutil data comes later
4. **Complete coverage** - Better to have 3 services working perfectly than 10 half-done
5. **AWS deployment** - Local is fine, AWS costs money and adds complexity

## Success Metrics

**For Yourself:**
- [ ] Actually use the dashboard daily
- [ ] Catches a real issue before you notice
- [ ] Runs stable for 1+ week without intervention

**For Interview:**
- [ ] Can demo end-to-end in 5 minutes
- [ ] Can explain every service interaction
- [ ] Can map to payment domain fluently
- [ ] Shows understanding of production patterns

## Next Steps

1. Set up project structure
2. Define proto messages
3. Build Phase 1 MVP
4. Iterate based on what feels useful
5. Polish for demo when interview approaches

## Resources

- gRPC Python docs: https://grpc.io/docs/languages/python/
- FastAPI WebSockets: https://fastapi.tiangolo.com/advanced/websockets/
- TimescaleDB: https://docs.timescale.com/
- htmx: https://htmx.org/

## Questions to Ask Yourself During Development

- "Would I actually use this feature?"
- "How does this map to payments?"
- "Can I explain why I built it this way?"
- "What would break if X service failed?"
- "How would this scale to 1000 machines?"

---

## Final Note

This project works because it's:
1. **Real** - You'll use it
2. **Focused** - Shows specific patterns they care about
3. **Mappable** - Clear connection to their domain
4. **Yours** - Not a tutorial copy, demonstrates your thinking

Build it in phases, use it daily, and by interview time you'll have natural stories about trade-offs, failures, and learnings. That authenticity is more valuable than perfect code.

Good luck! 🚀