new three layer deployment
This commit is contained in:
558
CLAUDE.md
558
CLAUDE.md
@@ -4,489 +4,129 @@
|
||||
|
||||
A real-time system monitoring platform that streams metrics from multiple machines to a central hub with live web dashboard. Built to demonstrate production microservices patterns (gRPC, FastAPI, streaming, event-driven architecture) while solving a real problem: monitoring development infrastructure across multiple machines.
|
||||
|
||||
**Primary Goal:** Interview demonstration project for Python Microservices Engineer position
|
||||
**Secondary Goal:** Actually useful tool for managing multi-machine development environment
|
||||
**Time Investment:** Phased approach - MVP in weekend, polish over 2-3 weeks
|
||||
**Primary Goal:** Portfolio project demonstrating real-time streaming architecture
|
||||
**Secondary Goal:** Actually useful tool for monitoring multi-machine development environment
|
||||
**Status:** Working MVP, deployed at sysmonstm.mcrn.ar
|
||||
|
||||
## Why This Project
|
||||
## Deployment Modes
|
||||
|
||||
**Interview Alignment:**
|
||||
- Demonstrates gRPC-based microservices architecture (core requirement)
|
||||
- Shows streaming patterns (server-side and bidirectional)
|
||||
- Real-time data aggregation and processing
|
||||
- Alert/threshold monitoring (maps to fraud detection)
|
||||
- Event-driven patterns
|
||||
- Multiple data sources requiring normalization (maps to multiple payment processors)
|
||||
### Production (3-tier)
|
||||
|
||||
**Personal Utility:**
|
||||
- Monitors existing multi-machine dev setup
|
||||
- Dashboard stays open, provides real value
|
||||
- Solves actual pain point
|
||||
- Will continue running post-interview
|
||||
```
|
||||
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
|
||||
│ Collector │────▶│ Hub │────▶│ Edge │
|
||||
│ (each host) │ │ (local) │ │ (AWS) │
|
||||
└─────────────┘ └─────────────┘ └─────────────┘
|
||||
```
|
||||
|
||||
**Domain Mapping for Interview:**
|
||||
- Machine = Payment Processor
|
||||
- Metrics Stream = Transaction Stream
|
||||
- Resource Thresholds = Fraud/Limit Detection
|
||||
- Alert System = Risk Management
|
||||
- Aggregation Service = Payment Processing Hub
|
||||
- **Collector** (`ctrl/collector/`) - Lightweight agent on each monitored machine
|
||||
- **Hub** (`ctrl/hub/`) - Local aggregator, receives from collectors, forwards to edge
|
||||
- **Edge** (`ctrl/edge/`) - Cloud dashboard, public-facing
|
||||
|
||||
### Development (Full Stack)
|
||||
|
||||
```bash
|
||||
docker compose up # Uses ctrl/dev/docker-compose.yml
|
||||
```
|
||||
|
||||
- Full gRPC-based microservices architecture
|
||||
- Services: aggregator, gateway, collector, alerts
|
||||
- Storage: Redis (hot), TimescaleDB (historical)
|
||||
|
||||
## Directory Structure
|
||||
|
||||
```
|
||||
sms/
|
||||
├── services/ # gRPC-based microservices (dev stack)
|
||||
│ ├── collector/ # gRPC client, streams to aggregator
|
||||
│ ├── aggregator/ # gRPC server, stores in Redis/TimescaleDB
|
||||
│ ├── gateway/ # FastAPI, bridges gRPC to WebSocket
|
||||
│ └── alerts/ # Event subscriber for threshold alerts
|
||||
│
|
||||
├── ctrl/ # Deployment configurations
|
||||
│ ├── collector/ # Lightweight WebSocket collector
|
||||
│ ├── hub/ # Local aggregator
|
||||
│ ├── edge/ # Cloud dashboard
|
||||
│ └── dev/ # Full stack docker-compose
|
||||
│
|
||||
├── proto/ # Protocol Buffer definitions
|
||||
├── shared/ # Shared Python modules
|
||||
├── web/ # Dashboard templates and static files
|
||||
├── infra/ # Terraform for AWS deployment
|
||||
└── k8s/ # Kubernetes manifests
|
||||
```
|
||||
|
||||
## Current Setup
|
||||
|
||||
**Machines being monitored:**
|
||||
- `mcrn` - Primary workstation (runs hub + collector)
|
||||
- `nfrt` - Secondary machine (runs collector only)
|
||||
|
||||
**Topology:**
|
||||
```
|
||||
mcrn nfrt AWS
|
||||
├── hub ◄─────────────────── collector edge (sysmonstm.mcrn.ar)
|
||||
│ │ ▲
|
||||
│ └────────────────────────────────────────────────┘
|
||||
└── collector
|
||||
```
|
||||
|
||||
## Technical Stack
|
||||
|
||||
### Core Technologies (Must Use - From JD)
|
||||
### Core Technologies
|
||||
- **Python 3.11+** - Primary language
|
||||
- **FastAPI** - Web gateway, REST endpoints, WebSocket streaming
|
||||
- **gRPC** - Inter-service communication, metric streaming
|
||||
- **PostgreSQL/TimescaleDB** - Time-series historical data
|
||||
- **Redis** - Current state, caching, alert rules
|
||||
- **Docker Compose** - Orchestration
|
||||
|
||||
### Supporting Technologies
|
||||
- **Protocol Buffers** - gRPC message definitions
|
||||
- **WebSockets** - Browser streaming
|
||||
- **htmx + Alpine.js** - Lightweight reactive frontend (avoid heavy SPA)
|
||||
- **Chart.js or Apache ECharts** - Real-time graphs
|
||||
- **asyncio** - Async patterns throughout
|
||||
|
||||
### Development Tools
|
||||
- **grpcio & grpcio-tools** - Python gRPC
|
||||
- **gRPC** - Inter-service communication (dev stack)
|
||||
- **WebSockets** - Production deployment communication
|
||||
- **psutil** - System metrics collection
|
||||
- **uvicorn** - FastAPI server
|
||||
- **pytest** - Testing
|
||||
- **docker-compose** - Local orchestration
|
||||
|
||||
## Architecture
|
||||
### Storage (Dev Stack Only)
|
||||
- **PostgreSQL/TimescaleDB** - Time-series historical data
|
||||
- **Redis** - Current state, caching, event pub/sub
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────────────────────┐
|
||||
│ Browser │
|
||||
│ ┌──────────────────────────────────────────────────────┐ │
|
||||
│ │ Dashboard (htmx + Alpine.js + WebSockets) │ │
|
||||
│ └──────────────────────────────────────────────────────┘ │
|
||||
└────────────────────────┬────────────────────────────────────┘
|
||||
│ WebSocket
|
||||
▼
|
||||
┌─────────────────────────────────────────────────────────────┐
|
||||
│ Web Gateway Service │
|
||||
│ (FastAPI + WebSockets) │
|
||||
│ - Serves dashboard │
|
||||
│ - Streams updates to browser │
|
||||
│ - REST API for historical queries │
|
||||
└────────────────────────┬────────────────────────────────────┘
|
||||
│ gRPC
|
||||
▼
|
||||
┌─────────────────────────────────────────────────────────────┐
|
||||
│ Aggregator Service (gRPC) │
|
||||
│ - Receives metric streams from all collectors │
|
||||
│ - Normalizes data from different sources │
|
||||
│ - Enriches with machine context │
|
||||
│ - Publishes to event stream │
|
||||
│ - Checks alert thresholds │
|
||||
└─────┬───────────────────────────────────┬───────────────────┘
|
||||
│ │
|
||||
│ Stores │ Publishes events
|
||||
▼ ▼
|
||||
┌──────────────┐ ┌────────────────┐
|
||||
│ TimescaleDB │ │ Event Stream │
|
||||
│ (historical)│ │ (Redis Pub/Sub│
|
||||
└──────────────┘ │ or RabbitMQ) │
|
||||
└────────┬───────┘
|
||||
┌──────────────┐ │
|
||||
│ Redis │ │ Subscribes
|
||||
│ (current │◄───────────────────────────┘
|
||||
│ state) │ │
|
||||
└──────────────┘ ▼
|
||||
┌────────────────┐
|
||||
▲ │ Alert Service │
|
||||
│ │ - Processes │
|
||||
│ │ events │
|
||||
│ gRPC Streaming │ - Triggers │
|
||||
│ │ actions │
|
||||
┌─────┴────────────────────────────┴────────────────┘
|
||||
│
|
||||
│ Multiple Collector Services (one per machine)
|
||||
│ ┌───────────────────────────────────────┐
|
||||
│ │ Metrics Collector (gRPC Client) │
|
||||
│ │ - Gathers system metrics (psutil) │
|
||||
│ │ - Streams to Aggregator via gRPC │
|
||||
│ │ - CPU, Memory, Disk, Network │
|
||||
│ │ - Process list │
|
||||
│ │ - Docker container stats (optional) │
|
||||
│ └───────────────────────────────────────┘
|
||||
│
|
||||
└──► Machine 1, Machine 2, Machine 3, ...
|
||||
```
|
||||
### Infrastructure
|
||||
- **Docker Compose** - Orchestration
|
||||
- **Woodpecker CI** - Build pipeline at ppl/pipelines/sysmonstm/
|
||||
- **Registry** - registry.mcrn.ar/sysmonstm/
|
||||
|
||||
## Implementation Phases
|
||||
## Images
|
||||
|
||||
### Phase 1: MVP - Core Streaming (Weekend - 8-12 hours)
|
||||
|
||||
**Goal:** Prove the gRPC streaming works end-to-end
|
||||
|
||||
**Deliverables:**
|
||||
1. Metrics Collector Service (gRPC client)
|
||||
- Collects CPU, memory, disk on localhost
|
||||
- Streams to aggregator every 5 seconds
|
||||
|
||||
2. Aggregator Service (gRPC server)
|
||||
- Receives metric stream
|
||||
- Stores current state in Redis
|
||||
- Logs to console
|
||||
|
||||
3. Proto definitions for metric messages
|
||||
|
||||
4. Docker Compose setup
|
||||
|
||||
**Success Criteria:**
|
||||
- Run collector, see metrics flowing to aggregator
|
||||
- Redis contains current state
|
||||
- Can query Redis manually for latest metrics
|
||||
|
||||
### Phase 2: Web Dashboard (1 week)
|
||||
|
||||
**Goal:** Make it visible and useful
|
||||
|
||||
**Deliverables:**
|
||||
1. Web Gateway Service (FastAPI)
|
||||
- WebSocket endpoint for streaming
|
||||
- REST endpoints for current/historical data
|
||||
|
||||
2. Dashboard UI
|
||||
- Real-time CPU/Memory graphs per machine
|
||||
- Current state table
|
||||
- Simple, clean design
|
||||
|
||||
3. WebSocket bridge (Gateway ↔ Aggregator)
|
||||
|
||||
4. TimescaleDB integration
|
||||
- Store historical metrics
|
||||
- Query endpoints for time ranges
|
||||
|
||||
**Success Criteria:**
|
||||
- Open dashboard, see live graphs updating
|
||||
- Graphs show last hour of data
|
||||
- Multiple machines displayed separately
|
||||
|
||||
### Phase 3: Alerts & Intelligence (1 week)
|
||||
|
||||
**Goal:** Add decision-making layer (interview focus)
|
||||
|
||||
**Deliverables:**
|
||||
1. Alert Service
|
||||
- Subscribes to event stream
|
||||
- Evaluates threshold rules
|
||||
- Triggers notifications
|
||||
|
||||
2. Configuration Service (gRPC)
|
||||
- Dynamic threshold management
|
||||
- Alert rule CRUD
|
||||
- Stored in PostgreSQL
|
||||
|
||||
3. Event Stream implementation (Redis Pub/Sub or RabbitMQ)
|
||||
|
||||
4. Enhanced dashboard
|
||||
- Alert indicators
|
||||
- Alert history
|
||||
- Threshold configuration UI
|
||||
|
||||
**Success Criteria:**
|
||||
- Set CPU threshold at 80%
|
||||
- Generate load (stress-ng)
|
||||
- See alert trigger in dashboard
|
||||
- Alert logged to database
|
||||
|
||||
### Phase 4: Interview Polish (Final week)
|
||||
|
||||
**Goal:** Demo-ready, production patterns visible
|
||||
|
||||
**Deliverables:**
|
||||
1. Observability
|
||||
- OpenTelemetry tracing (optional)
|
||||
- Structured logging
|
||||
- Health check endpoints
|
||||
|
||||
2. "Synthetic Transactions"
|
||||
- Simulate business operations through system
|
||||
- Track end-to-end latency
|
||||
- Maps directly to payment processing demo
|
||||
|
||||
3. Documentation
|
||||
- Architecture diagram
|
||||
- Service interaction flows
|
||||
- Deployment guide
|
||||
|
||||
4. Demo script
|
||||
- Story to walk through
|
||||
- Key talking points
|
||||
- Domain mapping explanations
|
||||
|
||||
**Success Criteria:**
|
||||
- Can deploy entire stack with one command
|
||||
- Can explain every service's role
|
||||
- Can map architecture to payment processing
|
||||
- Demo runs smoothly without hiccups
|
||||
|
||||
## Key Technical Patterns to Demonstrate
|
||||
|
||||
### 1. gRPC Streaming Patterns
|
||||
|
||||
**Server-Side Streaming:**
|
||||
```python
|
||||
# Collector streams metrics to aggregator
|
||||
service MetricsService {
|
||||
rpc StreamMetrics(MetricsRequest) returns (stream Metric) {}
|
||||
}
|
||||
```
|
||||
|
||||
**Bidirectional Streaming:**
|
||||
```python
|
||||
# Two-way communication between services
|
||||
service ControlService {
|
||||
rpc ManageStream(stream Command) returns (stream Response) {}
|
||||
}
|
||||
```
|
||||
|
||||
### 2. Service Communication Patterns
|
||||
|
||||
- **Synchronous (gRPC):** Query current state, configuration
|
||||
- **Asynchronous (Events):** Metric updates, alerts, audit logs
|
||||
- **Streaming (gRPC + WebSocket):** Real-time data flow
|
||||
|
||||
### 3. Data Storage Patterns
|
||||
|
||||
- **Hot data (Redis):** Current state, recent metrics (last 5 minutes)
|
||||
- **Warm data (TimescaleDB):** Historical metrics (last 30 days)
|
||||
- **Cold data (Optional):** Archive to S3-compatible storage
|
||||
|
||||
### 4. Error Handling & Resilience
|
||||
|
||||
- gRPC retry logic with exponential backoff
|
||||
- Circuit breaker pattern for service calls
|
||||
- Graceful degradation (continue if one collector fails)
|
||||
- Dead letter queue for failed events
|
||||
|
||||
## Proto Definitions (Starting Point)
|
||||
|
||||
```protobuf
|
||||
syntax = "proto3";
|
||||
|
||||
package monitoring;
|
||||
|
||||
service MetricsService {
|
||||
rpc StreamMetrics(MetricsRequest) returns (stream Metric) {}
|
||||
rpc GetCurrentState(StateRequest) returns (MachineState) {}
|
||||
}
|
||||
|
||||
message MetricsRequest {
|
||||
string machine_id = 1;
|
||||
int32 interval_seconds = 2;
|
||||
}
|
||||
|
||||
message Metric {
|
||||
string machine_id = 1;
|
||||
int64 timestamp = 2;
|
||||
MetricType type = 3;
|
||||
double value = 4;
|
||||
map<string, string> labels = 5;
|
||||
}
|
||||
|
||||
enum MetricType {
|
||||
CPU_PERCENT = 0;
|
||||
MEMORY_PERCENT = 1;
|
||||
MEMORY_USED_GB = 2;
|
||||
DISK_PERCENT = 3;
|
||||
NETWORK_SENT_MBPS = 4;
|
||||
NETWORK_RECV_MBPS = 5;
|
||||
}
|
||||
|
||||
message MachineState {
|
||||
string machine_id = 1;
|
||||
int64 last_seen = 2;
|
||||
repeated Metric current_metrics = 3;
|
||||
HealthStatus health = 4;
|
||||
}
|
||||
|
||||
enum HealthStatus {
|
||||
HEALTHY = 0;
|
||||
WARNING = 1;
|
||||
CRITICAL = 2;
|
||||
UNKNOWN = 3;
|
||||
}
|
||||
```
|
||||
|
||||
## Project Structure
|
||||
|
||||
```
|
||||
system-monitor/
|
||||
├── docker-compose.yml
|
||||
├── proto/
|
||||
│ └── metrics.proto
|
||||
├── services/
|
||||
│ ├── collector/
|
||||
│ │ ├── Dockerfile
|
||||
│ │ ├── requirements.txt
|
||||
│ │ ├── main.py
|
||||
│ │ └── metrics.py
|
||||
│ ├── aggregator/
|
||||
│ │ ├── Dockerfile
|
||||
│ │ ├── requirements.txt
|
||||
│ │ ├── main.py
|
||||
│ │ └── storage.py
|
||||
│ ├── gateway/
|
||||
│ │ ├── Dockerfile
|
||||
│ │ ├── requirements.txt
|
||||
│ │ ├── main.py
|
||||
│ │ └── websocket.py
|
||||
│ └── alerts/
|
||||
│ ├── Dockerfile
|
||||
│ ├── requirements.txt
|
||||
│ ├── main.py
|
||||
│ └── rules.py
|
||||
├── web/
|
||||
│ ├── static/
|
||||
│ │ ├── css/
|
||||
│ │ └── js/
|
||||
│ └── templates/
|
||||
│ └── dashboard.html
|
||||
└── README.md
|
||||
```
|
||||
|
||||
## Interview Talking Points
|
||||
|
||||
### Domain Mapping to Payments
|
||||
|
||||
**What you say:**
|
||||
- "I built this to monitor my dev machines, but the architecture directly maps to payment processing"
|
||||
- "Each machine streaming metrics is like a payment processor streaming transactions"
|
||||
- "The aggregator normalizes data from different sources - same as aggregating from Stripe, PayPal, bank APIs"
|
||||
- "Alert thresholds on resource usage are structurally identical to fraud detection thresholds"
|
||||
- "The event stream for audit trails maps directly to payment audit logs"
|
||||
|
||||
### Technical Decisions to Highlight
|
||||
|
||||
**gRPC vs REST:**
|
||||
- "I use gRPC between services for efficiency and strong typing"
|
||||
- "FastAPI gateway exposes REST/WebSocket for browser clients"
|
||||
- "This pattern is common - internal gRPC, external REST"
|
||||
|
||||
**Streaming vs Polling:**
|
||||
- "Server-side streaming reduces network overhead"
|
||||
- "Bidirectional streaming allows dynamic configuration updates"
|
||||
- "WebSocket to browser maintains single connection"
|
||||
|
||||
**State Management:**
|
||||
- "Redis for hot data - current state, needs fast access"
|
||||
- "TimescaleDB for historical analysis - optimized for time-series"
|
||||
- "This tiered storage approach scales to payment transaction volumes"
|
||||
|
||||
**Resilience:**
|
||||
- "Each collector is independent - one failing doesn't affect others"
|
||||
- "Circuit breaker prevents cascade failures"
|
||||
- "Event stream decouples alert processing from metric ingestion"
|
||||
|
||||
### What NOT to Say
|
||||
|
||||
- Don't call it a "toy project" or "learning exercise"
|
||||
- Don't apologize for running locally vs AWS
|
||||
- Don't over-explain obvious things
|
||||
- Don't claim it's production-ready when it's not
|
||||
|
||||
### What TO Say
|
||||
|
||||
- "I built this to solve a real problem I have"
|
||||
- "Locally it uses PostgreSQL/Redis, in production these become Aurora/ElastiCache"
|
||||
- "I focused on the architectural patterns since those transfer directly"
|
||||
- "I'd keep developing this - it's genuinely useful"
|
||||
| Image | Purpose |
|
||||
|-------|---------|
|
||||
| `collector` | Lightweight WebSocket collector for production |
|
||||
| `hub` | Local aggregator for production |
|
||||
| `edge` | Cloud dashboard for production |
|
||||
| `aggregator` | gRPC aggregator (dev stack) |
|
||||
| `gateway` | FastAPI gateway (dev stack) |
|
||||
| `collector-grpc` | gRPC collector (dev stack) |
|
||||
| `alerts` | Alert service (dev stack) |
|
||||
|
||||
## Development Guidelines
|
||||
|
||||
### Code Quality Standards
|
||||
### Code Quality
|
||||
- Type hints throughout (Python 3.11+ syntax)
|
||||
- Async/await patterns consistently
|
||||
- Structured logging (JSON format)
|
||||
- Error handling at all boundaries
|
||||
- Unit tests for business logic
|
||||
- Integration tests for service interactions
|
||||
- Logging (not print statements)
|
||||
- Error handling at boundaries
|
||||
|
||||
### Docker Best Practices
|
||||
- Multi-stage builds
|
||||
- Non-root users
|
||||
- Health checks
|
||||
- Resource limits
|
||||
- Volume mounts for development
|
||||
### Docker
|
||||
- Multi-stage builds for smaller images
|
||||
- `--network host` for collectors (accurate network metrics)
|
||||
|
||||
### Configuration Management
|
||||
### Configuration
|
||||
- Environment variables for all config
|
||||
- Sensible defaults
|
||||
- Config validation on startup
|
||||
- No secrets in code
|
||||
|
||||
## AWS Mapping (For Interview Discussion)
|
||||
## Interview/Portfolio Talking Points
|
||||
|
||||
**What you have → What it becomes:**
|
||||
- PostgreSQL → Aurora PostgreSQL
|
||||
- Redis → ElastiCache
|
||||
- Docker Containers → ECS/Fargate or Lambda
|
||||
- RabbitMQ/Redis Pub/Sub → SQS/SNS
|
||||
- Docker Compose → CloudFormation/Terraform
|
||||
- Local networking → VPC, Security Groups
|
||||
### Architecture Decisions
|
||||
- "3-tier for production: collector → hub → edge"
|
||||
- "Hub allows local aggregation and buffering before forwarding to cloud"
|
||||
- "Edge terminology shows awareness of edge computing patterns"
|
||||
- "Full gRPC stack for development demonstrates microservices patterns"
|
||||
|
||||
**Key point:** "The architecture and patterns are production-ready, the infrastructure is local for development convenience"
|
||||
|
||||
## Common Pitfalls to Avoid
|
||||
|
||||
1. **Over-engineering Phase 1** - Resist adding features, just get streaming working
|
||||
2. **Ugly UI** - Don't waste time on design, htmx + basic CSS is fine
|
||||
3. **Perfect metrics** - Mock data is OK early on, real psutil data comes later
|
||||
4. **Complete coverage** - Better to have 3 services working perfectly than 10 half-done
|
||||
5. **AWS deployment** - Local is fine, AWS costs money and adds complexity
|
||||
|
||||
## Success Metrics
|
||||
|
||||
**For Yourself:**
|
||||
- [ ] Actually use the dashboard daily
|
||||
- [ ] Catches a real issue before you notice
|
||||
- [ ] Runs stable for 1+ week without intervention
|
||||
|
||||
**For Interview:**
|
||||
- [ ] Can demo end-to-end in 5 minutes
|
||||
- [ ] Can explain every service interaction
|
||||
- [ ] Can map to payment domain fluently
|
||||
- [ ] Shows understanding of production patterns
|
||||
|
||||
## Next Steps
|
||||
|
||||
1. Set up project structure
|
||||
2. Define proto messages
|
||||
3. Build Phase 1 MVP
|
||||
4. Iterate based on what feels useful
|
||||
5. Polish for demo when interview approaches
|
||||
|
||||
## Resources
|
||||
|
||||
- gRPC Python docs: https://grpc.io/docs/languages/python/
|
||||
- FastAPI WebSockets: https://fastapi.tiangolo.com/advanced/websockets/
|
||||
- TimescaleDB: https://docs.timescale.com/
|
||||
- htmx: https://htmx.org/
|
||||
|
||||
## Questions to Ask Yourself During Development
|
||||
|
||||
- "Would I actually use this feature?"
|
||||
- "How does this map to payments?"
|
||||
- "Can I explain why I built it this way?"
|
||||
- "What would break if X service failed?"
|
||||
- "How would this scale to 1000 machines?"
|
||||
|
||||
---
|
||||
|
||||
## Final Note
|
||||
|
||||
This project works because it's:
|
||||
1. **Real** - You'll use it
|
||||
2. **Focused** - Shows specific patterns they care about
|
||||
3. **Mappable** - Clear connection to their domain
|
||||
4. **Yours** - Not a tutorial copy, demonstrates your thinking
|
||||
|
||||
Build it in phases, use it daily, and by interview time you'll have natural stories about trade-offs, failures, and learnings. That authenticity is more valuable than perfect code.
|
||||
|
||||
Good luck! 🚀
|
||||
### Trade-offs
|
||||
- Production vs Dev: simplicity/cost vs full architecture demo
|
||||
- WebSocket vs gRPC: browser compatibility vs efficiency
|
||||
- In-memory vs persistent: operational simplicity vs durability
|
||||
|
||||
82
ctrl/README.md
Normal file
82
ctrl/README.md
Normal file
@@ -0,0 +1,82 @@
|
||||
# Deployment Configurations
|
||||
|
||||
This directory contains deployment configurations for sysmonstm.
|
||||
|
||||
## Architecture
|
||||
|
||||
```
|
||||
┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐
|
||||
│ Collector │────▶│ Hub │────▶│ Edge │────▶│ Browser │
|
||||
│ (mcrn) │ │ (local) │ │ (AWS) │ │ │
|
||||
└─────────────┘ └─────────────┘ └─────────────┘ └─────────────┘
|
||||
┌─────────────┐ │
|
||||
│ Collector │────────────┘
|
||||
│ (nfrt) │
|
||||
└─────────────┘
|
||||
```
|
||||
|
||||
## Directory Structure
|
||||
|
||||
```
|
||||
ctrl/
|
||||
├── collector/ # Lightweight agent for each monitored machine
|
||||
├── hub/ # Local aggregator (receives from collectors, forwards to edge)
|
||||
├── edge/ # Cloud dashboard (public-facing, receives from hub)
|
||||
└── dev/ # Full gRPC stack for development
|
||||
```
|
||||
|
||||
## Production Deployment (3-tier)
|
||||
|
||||
### 1. Edge (AWS)
|
||||
Public-facing dashboard that receives metrics from hub.
|
||||
|
||||
```bash
|
||||
cd ctrl/edge
|
||||
docker compose up -d
|
||||
```
|
||||
|
||||
### 2. Hub (Local Server)
|
||||
Runs on your local network, receives from collectors, forwards to edge.
|
||||
|
||||
```bash
|
||||
cd ctrl/hub
|
||||
EDGE_URL=wss://sysmonstm.mcrn.ar/ws EDGE_API_KEY=xxx docker compose up -d
|
||||
```
|
||||
|
||||
### 3. Collectors (Each Machine)
|
||||
Run on each machine you want to monitor.
|
||||
|
||||
```bash
|
||||
docker run -d --name sysmonstm-collector --network host \
|
||||
-e HUB_URL=ws://hub-machine:8080/ws \
|
||||
-e MACHINE_ID=$(hostname) \
|
||||
-e API_KEY=xxx \
|
||||
registry.mcrn.ar/sysmonstm/collector:latest
|
||||
```
|
||||
|
||||
## Development (Full Stack)
|
||||
|
||||
For local development with the complete gRPC-based architecture:
|
||||
|
||||
```bash
|
||||
# From repo root
|
||||
docker compose up
|
||||
```
|
||||
|
||||
This runs: aggregator, gateway, collector, alerts, redis, timescaledb
|
||||
|
||||
## Environment Variables
|
||||
|
||||
### Collector
|
||||
- `HUB_URL` - WebSocket URL of hub (default: ws://localhost:8080/ws)
|
||||
- `MACHINE_ID` - Identifier for this machine (default: hostname)
|
||||
- `API_KEY` - Authentication key
|
||||
- `INTERVAL` - Seconds between collections (default: 5)
|
||||
|
||||
### Hub
|
||||
- `API_KEY` - Key required from collectors
|
||||
- `EDGE_URL` - WebSocket URL of edge (optional, for forwarding)
|
||||
- `EDGE_API_KEY` - Key for authenticating to edge
|
||||
|
||||
### Edge
|
||||
- `API_KEY` - Key required from hub
|
||||
16
ctrl/collector/Dockerfile
Normal file
16
ctrl/collector/Dockerfile
Normal file
@@ -0,0 +1,16 @@
|
||||
FROM python:3.11-slim
|
||||
|
||||
WORKDIR /app
|
||||
|
||||
RUN pip install --no-cache-dir psutil websockets
|
||||
|
||||
COPY collector.py .
|
||||
|
||||
# Default environment variables
|
||||
ENV HUB_URL=ws://localhost:8080/ws
|
||||
ENV MACHINE_ID=""
|
||||
ENV API_KEY=""
|
||||
ENV INTERVAL=5
|
||||
ENV LOG_LEVEL=INFO
|
||||
|
||||
CMD ["python", "collector.py"]
|
||||
136
ctrl/collector/collector.py
Normal file
136
ctrl/collector/collector.py
Normal file
@@ -0,0 +1,136 @@
|
||||
#!/usr/bin/env python3
|
||||
"""Lightweight WebSocket metrics collector for sysmonstm standalone deployment."""
|
||||
|
||||
import asyncio
|
||||
import json
|
||||
import logging
|
||||
import os
|
||||
import socket
|
||||
import time
|
||||
|
||||
import psutil
|
||||
|
||||
# Configuration from environment
|
||||
HUB_URL = os.environ.get("HUB_URL", "ws://localhost:8080/ws")
|
||||
MACHINE_ID = os.environ.get("MACHINE_ID", socket.gethostname())
|
||||
API_KEY = os.environ.get("API_KEY", "")
|
||||
INTERVAL = int(os.environ.get("INTERVAL", "5"))
|
||||
LOG_LEVEL = os.environ.get("LOG_LEVEL", "INFO")
|
||||
|
||||
# Logging setup
|
||||
logging.basicConfig(
|
||||
level=getattr(logging, LOG_LEVEL.upper(), logging.INFO),
|
||||
format="%(asctime)s [%(levelname)s] %(message)s",
|
||||
datefmt="%Y-%m-%d %H:%M:%S",
|
||||
)
|
||||
log = logging.getLogger("collector")
|
||||
|
||||
|
||||
def collect_metrics() -> dict:
|
||||
"""Collect system metrics using psutil."""
|
||||
metrics = {
|
||||
"type": "metrics",
|
||||
"machine_id": MACHINE_ID,
|
||||
"hostname": socket.gethostname(),
|
||||
"timestamp": time.time(),
|
||||
}
|
||||
|
||||
# CPU
|
||||
try:
|
||||
metrics["cpu"] = psutil.cpu_percent(interval=None)
|
||||
except Exception:
|
||||
pass
|
||||
|
||||
# Memory
|
||||
try:
|
||||
mem = psutil.virtual_memory()
|
||||
metrics["memory"] = mem.percent
|
||||
metrics["memory_used_gb"] = round(mem.used / (1024**3), 2)
|
||||
metrics["memory_total_gb"] = round(mem.total / (1024**3), 2)
|
||||
except Exception:
|
||||
pass
|
||||
|
||||
# Disk
|
||||
try:
|
||||
disk = psutil.disk_usage("/")
|
||||
metrics["disk"] = disk.percent
|
||||
metrics["disk_used_gb"] = round(disk.used / (1024**3), 2)
|
||||
metrics["disk_total_gb"] = round(disk.total / (1024**3), 2)
|
||||
except Exception:
|
||||
pass
|
||||
|
||||
# Load average (Unix only)
|
||||
try:
|
||||
load1, load5, load15 = psutil.getloadavg()
|
||||
metrics["load_1m"] = round(load1, 2)
|
||||
metrics["load_5m"] = round(load5, 2)
|
||||
metrics["load_15m"] = round(load15, 2)
|
||||
except (AttributeError, OSError):
|
||||
pass
|
||||
|
||||
# Network connections count
|
||||
try:
|
||||
metrics["connections"] = len(psutil.net_connections(kind="inet"))
|
||||
except (psutil.AccessDenied, PermissionError):
|
||||
pass
|
||||
|
||||
# Process count
|
||||
try:
|
||||
metrics["processes"] = len(psutil.pids())
|
||||
except Exception:
|
||||
pass
|
||||
|
||||
return metrics
|
||||
|
||||
|
||||
async def run_collector():
|
||||
"""Main collector loop with auto-reconnect."""
|
||||
import websockets
|
||||
|
||||
# Build URL with API key if provided
|
||||
url = HUB_URL
|
||||
if API_KEY:
|
||||
separator = "&" if "?" in url else "?"
|
||||
url = f"{url}{separator}key={API_KEY}"
|
||||
|
||||
# Prime CPU percent (first call always returns 0)
|
||||
psutil.cpu_percent(interval=None)
|
||||
|
||||
while True:
|
||||
try:
|
||||
log.info(f"Connecting to {HUB_URL}...")
|
||||
async with websockets.connect(url) as ws:
|
||||
log.info(
|
||||
f"Connected. Sending metrics every {INTERVAL}s as '{MACHINE_ID}'"
|
||||
)
|
||||
|
||||
while True:
|
||||
metrics = collect_metrics()
|
||||
await ws.send(json.dumps(metrics))
|
||||
log.debug(
|
||||
f"Sent: cpu={metrics.get('cpu', '?')}% mem={metrics.get('memory', '?')}% disk={metrics.get('disk', '?')}%"
|
||||
)
|
||||
await asyncio.sleep(INTERVAL)
|
||||
|
||||
except asyncio.CancelledError:
|
||||
log.info("Collector stopped")
|
||||
break
|
||||
except Exception as e:
|
||||
log.warning(f"Connection error: {e}. Reconnecting in 5s...")
|
||||
await asyncio.sleep(5)
|
||||
|
||||
|
||||
def main():
|
||||
log.info("sysmonstm collector starting")
|
||||
log.info(f" Hub: {HUB_URL}")
|
||||
log.info(f" Machine: {MACHINE_ID}")
|
||||
log.info(f" Interval: {INTERVAL}s")
|
||||
|
||||
try:
|
||||
asyncio.run(run_collector())
|
||||
except KeyboardInterrupt:
|
||||
log.info("Stopped")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
154
ctrl/dev/docker-compose.yml
Normal file
154
ctrl/dev/docker-compose.yml
Normal file
@@ -0,0 +1,154 @@
|
||||
version: "3.8"
|
||||
|
||||
# This file works both locally and on EC2 for demo purposes.
|
||||
# For local dev with hot-reload, use: docker compose -f docker-compose.yml -f docker-compose.override.yml up
|
||||
|
||||
x-common-env: &common-env
|
||||
REDIS_URL: redis://redis:6379
|
||||
TIMESCALE_URL: postgresql://monitor:monitor@timescaledb:5432/monitor
|
||||
EVENTS_BACKEND: redis_pubsub
|
||||
LOG_LEVEL: ${LOG_LEVEL:-INFO}
|
||||
LOG_FORMAT: json
|
||||
|
||||
x-healthcheck-defaults: &healthcheck-defaults
|
||||
interval: 10s
|
||||
timeout: 5s
|
||||
retries: 3
|
||||
start_period: 10s
|
||||
|
||||
services:
|
||||
# =============================================================================
|
||||
# Infrastructure
|
||||
# =============================================================================
|
||||
|
||||
redis:
|
||||
image: redis:7-alpine
|
||||
ports:
|
||||
- "${REDIS_PORT:-6379}:6379"
|
||||
volumes:
|
||||
- redis-data:/data
|
||||
healthcheck:
|
||||
<<: *healthcheck-defaults
|
||||
test: ["CMD", "redis-cli", "ping"]
|
||||
deploy:
|
||||
resources:
|
||||
limits:
|
||||
memory: 128M
|
||||
|
||||
timescaledb:
|
||||
image: timescale/timescaledb:latest-pg15
|
||||
environment:
|
||||
POSTGRES_USER: monitor
|
||||
POSTGRES_PASSWORD: monitor
|
||||
POSTGRES_DB: monitor
|
||||
ports:
|
||||
- "${TIMESCALE_PORT:-5432}:5432"
|
||||
volumes:
|
||||
- timescale-data:/var/lib/postgresql/data
|
||||
- ./scripts/init-db.sql:/docker-entrypoint-initdb.d/init.sql:ro
|
||||
healthcheck:
|
||||
<<: *healthcheck-defaults
|
||||
test: ["CMD-SHELL", "pg_isready -U monitor -d monitor"]
|
||||
deploy:
|
||||
resources:
|
||||
limits:
|
||||
memory: 512M
|
||||
|
||||
# =============================================================================
|
||||
# Application Services
|
||||
# =============================================================================
|
||||
|
||||
aggregator:
|
||||
build:
|
||||
context: .
|
||||
dockerfile: services/aggregator/Dockerfile
|
||||
environment:
|
||||
<<: *common-env
|
||||
GRPC_PORT: 50051
|
||||
SERVICE_NAME: aggregator
|
||||
ports:
|
||||
- "${AGGREGATOR_GRPC_PORT:-50051}:50051"
|
||||
depends_on:
|
||||
redis:
|
||||
condition: service_healthy
|
||||
timescaledb:
|
||||
condition: service_healthy
|
||||
healthcheck:
|
||||
<<: *healthcheck-defaults
|
||||
test: ["CMD", "/bin/grpc_health_probe", "-addr=:50051"]
|
||||
deploy:
|
||||
resources:
|
||||
limits:
|
||||
memory: 256M
|
||||
|
||||
gateway:
|
||||
build:
|
||||
context: .
|
||||
dockerfile: services/gateway/Dockerfile
|
||||
environment:
|
||||
<<: *common-env
|
||||
HTTP_PORT: 8000
|
||||
AGGREGATOR_URL: aggregator:50051
|
||||
SERVICE_NAME: gateway
|
||||
ports:
|
||||
- "${GATEWAY_PORT:-8000}:8000"
|
||||
depends_on:
|
||||
- aggregator
|
||||
- redis
|
||||
healthcheck:
|
||||
<<: *healthcheck-defaults
|
||||
test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
|
||||
deploy:
|
||||
resources:
|
||||
limits:
|
||||
memory: 256M
|
||||
|
||||
alerts:
|
||||
build:
|
||||
context: .
|
||||
dockerfile: services/alerts/Dockerfile
|
||||
environment:
|
||||
<<: *common-env
|
||||
SERVICE_NAME: alerts
|
||||
depends_on:
|
||||
redis:
|
||||
condition: service_healthy
|
||||
timescaledb:
|
||||
condition: service_healthy
|
||||
healthcheck:
|
||||
<<: *healthcheck-defaults
|
||||
test: ["CMD", "python", "-c", "import sys; sys.exit(0)"]
|
||||
deploy:
|
||||
resources:
|
||||
limits:
|
||||
memory: 128M
|
||||
|
||||
# Collector runs separately on each machine being monitored
|
||||
# For local testing, we run one instance
|
||||
collector:
|
||||
build:
|
||||
context: .
|
||||
dockerfile: services/collector/Dockerfile
|
||||
environment:
|
||||
<<: *common-env
|
||||
AGGREGATOR_URL: aggregator:50051
|
||||
MACHINE_ID: ${MACHINE_ID:-local-dev}
|
||||
COLLECTION_INTERVAL: ${COLLECTION_INTERVAL:-5}
|
||||
SERVICE_NAME: collector
|
||||
depends_on:
|
||||
- aggregator
|
||||
deploy:
|
||||
resources:
|
||||
limits:
|
||||
memory: 64M
|
||||
# For actual system metrics, you might need:
|
||||
# privileged: true
|
||||
# pid: host
|
||||
|
||||
volumes:
|
||||
redis-data:
|
||||
timescale-data:
|
||||
|
||||
networks:
|
||||
default:
|
||||
name: sysmonstm
|
||||
14
ctrl/edge/Dockerfile
Normal file
14
ctrl/edge/Dockerfile
Normal file
@@ -0,0 +1,14 @@
|
||||
FROM python:3.11-slim
|
||||
|
||||
WORKDIR /app
|
||||
|
||||
RUN pip install --no-cache-dir fastapi uvicorn[standard] websockets
|
||||
|
||||
COPY edge.py .
|
||||
|
||||
ENV API_KEY=""
|
||||
ENV LOG_LEVEL=INFO
|
||||
|
||||
EXPOSE 8080
|
||||
|
||||
CMD ["uvicorn", "edge:app", "--host", "0.0.0.0", "--port", "8080"]
|
||||
@@ -1,8 +1,11 @@
|
||||
services:
|
||||
sysmonstm:
|
||||
edge:
|
||||
build: .
|
||||
container_name: sysmonstm
|
||||
container_name: sysmonstm-edge
|
||||
restart: unless-stopped
|
||||
environment:
|
||||
- API_KEY=${API_KEY:-}
|
||||
- LOG_LEVEL=${LOG_LEVEL:-INFO}
|
||||
ports:
|
||||
- "8080:8080"
|
||||
networks:
|
||||
@@ -1,11 +1,26 @@
|
||||
"""Minimal sysmonstm gateway - standalone mode without dependencies."""
|
||||
|
||||
from fastapi import FastAPI, WebSocket, WebSocketDisconnect
|
||||
from fastapi.responses import HTMLResponse
|
||||
import json
|
||||
import asyncio
|
||||
import json
|
||||
import logging
|
||||
import os
|
||||
from datetime import datetime
|
||||
|
||||
from fastapi import FastAPI, Query, WebSocket, WebSocketDisconnect
|
||||
from fastapi.responses import HTMLResponse
|
||||
|
||||
# Configuration
|
||||
API_KEY = os.environ.get("API_KEY", "")
|
||||
LOG_LEVEL = os.environ.get("LOG_LEVEL", "INFO")
|
||||
|
||||
# Logging setup
|
||||
logging.basicConfig(
|
||||
level=getattr(logging, LOG_LEVEL.upper(), logging.INFO),
|
||||
format="%(asctime)s [%(levelname)s] %(message)s",
|
||||
datefmt="%Y-%m-%d %H:%M:%S",
|
||||
)
|
||||
log = logging.getLogger("gateway")
|
||||
|
||||
app = FastAPI(title="sysmonstm")
|
||||
|
||||
# Store connected websockets
|
||||
@@ -107,7 +122,8 @@ HTML = """
|
||||
let machines = {};
|
||||
|
||||
function connect() {
|
||||
const ws = new WebSocket(`wss://${location.host}/ws`);
|
||||
const protocol = location.protocol === 'https:' ? 'wss:' : 'ws:';
|
||||
const ws = new WebSocket(`${protocol}//${location.host}/ws`);
|
||||
|
||||
ws.onopen = () => {
|
||||
statusDot.classList.add('ok');
|
||||
@@ -130,7 +146,7 @@ HTML = """
|
||||
}
|
||||
|
||||
function render() {
|
||||
const ids = Object.keys(machines);
|
||||
const ids = Object.keys(machines).sort();
|
||||
if (ids.length === 0) {
|
||||
machinesEl.innerHTML = '<div class="empty"><h2>No collectors connected</h2><p>Start a collector to see metrics</p></div>';
|
||||
return;
|
||||
@@ -138,13 +154,16 @@ HTML = """
|
||||
|
||||
machinesEl.innerHTML = ids.map(id => {
|
||||
const m = machines[id];
|
||||
const ts = m.timestamp ? new Date(m.timestamp * 1000).toLocaleTimeString() : '-';
|
||||
return `
|
||||
<div class="machine">
|
||||
<h3>${id}</h3>
|
||||
<div class="metric"><span>CPU</span><span>${m.cpu?.toFixed(1) || '-'}%</span></div>
|
||||
<div class="metric"><span>Memory</span><span>${m.memory?.toFixed(1) || '-'}%</span></div>
|
||||
<div class="metric"><span>Disk</span><span>${m.disk?.toFixed(1) || '-'}%</span></div>
|
||||
<div class="metric"><span>Updated</span><span>${new Date(m.timestamp).toLocaleTimeString()}</span></div>
|
||||
<div class="metric"><span>Load (1m)</span><span>${m.load_1m?.toFixed(2) || '-'}</span></div>
|
||||
<div class="metric"><span>Processes</span><span>${m.processes || '-'}</span></div>
|
||||
<div class="metric"><span>Updated</span><span>${ts}</span></div>
|
||||
</div>
|
||||
`;
|
||||
}).join('');
|
||||
@@ -156,43 +175,82 @@ HTML = """
|
||||
</html>
|
||||
"""
|
||||
|
||||
|
||||
@app.get("/", response_class=HTMLResponse)
|
||||
async def index():
|
||||
return HTML
|
||||
|
||||
|
||||
@app.get("/health")
|
||||
async def health():
|
||||
return {"status": "ok", "machines": len(machines)}
|
||||
|
||||
|
||||
@app.get("/api/machines")
|
||||
async def get_machines():
|
||||
return machines
|
||||
|
||||
|
||||
@app.websocket("/ws")
|
||||
async def websocket_endpoint(websocket: WebSocket):
|
||||
async def websocket_endpoint(websocket: WebSocket, key: str = Query(default="")):
|
||||
# API key validation for collectors (browsers don't need key)
|
||||
# Check if this looks like a collector (will send metrics) or browser (will receive)
|
||||
# We validate key only when metrics are received, allowing browsers to connect freely
|
||||
|
||||
await websocket.accept()
|
||||
connections.append(websocket)
|
||||
client = websocket.client.host if websocket.client else "unknown"
|
||||
log.info(f"WebSocket connected: {client}")
|
||||
|
||||
try:
|
||||
# Send current state
|
||||
# Send current state to new connection
|
||||
for machine_id, data in machines.items():
|
||||
await websocket.send_json({"type": "metrics", "machine_id": machine_id, **data})
|
||||
# Keep alive
|
||||
await websocket.send_json(
|
||||
{"type": "metrics", "machine_id": machine_id, **data}
|
||||
)
|
||||
|
||||
# Main loop
|
||||
while True:
|
||||
try:
|
||||
msg = await asyncio.wait_for(websocket.receive_text(), timeout=30)
|
||||
data = json.loads(msg)
|
||||
|
||||
if data.get("type") == "metrics":
|
||||
# Validate API key for metric submissions
|
||||
if API_KEY and key != API_KEY:
|
||||
log.warning(f"Invalid API key from {client}")
|
||||
await websocket.close(code=4001, reason="Invalid API key")
|
||||
return
|
||||
|
||||
machine_id = data.get("machine_id", "unknown")
|
||||
machines[machine_id] = {**data, "timestamp": datetime.utcnow().isoformat()}
|
||||
# Broadcast to all
|
||||
machines[machine_id] = data
|
||||
log.debug(f"Metrics from {machine_id}: cpu={data.get('cpu')}%")
|
||||
|
||||
# Broadcast to all connected clients
|
||||
for conn in connections:
|
||||
try:
|
||||
await conn.send_json({"type": "metrics", "machine_id": machine_id, **machines[machine_id]})
|
||||
except:
|
||||
await conn.send_json(
|
||||
{"type": "metrics", "machine_id": machine_id, **data}
|
||||
)
|
||||
except Exception:
|
||||
pass
|
||||
|
||||
except asyncio.TimeoutError:
|
||||
# Send ping to keep connection alive
|
||||
await websocket.send_json({"type": "ping"})
|
||||
|
||||
except WebSocketDisconnect:
|
||||
pass
|
||||
log.info(f"WebSocket disconnected: {client}")
|
||||
except Exception as e:
|
||||
log.error(f"WebSocket error: {e}")
|
||||
finally:
|
||||
if websocket in connections:
|
||||
connections.remove(websocket)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
import uvicorn
|
||||
|
||||
log.info("Starting sysmonstm gateway")
|
||||
log.info(f" API key: {'configured' if API_KEY else 'not set (open)'}")
|
||||
uvicorn.run(app, host="0.0.0.0", port=8080)
|
||||
16
ctrl/hub/Dockerfile
Normal file
16
ctrl/hub/Dockerfile
Normal file
@@ -0,0 +1,16 @@
|
||||
FROM python:3.11-slim
|
||||
|
||||
WORKDIR /app
|
||||
|
||||
RUN pip install --no-cache-dir fastapi uvicorn[standard] websockets
|
||||
|
||||
COPY hub.py .
|
||||
|
||||
ENV API_KEY=""
|
||||
ENV EDGE_URL=""
|
||||
ENV EDGE_API_KEY=""
|
||||
ENV LOG_LEVEL=INFO
|
||||
|
||||
EXPOSE 8080
|
||||
|
||||
CMD ["uvicorn", "hub:app", "--host", "0.0.0.0", "--port", "8080"]
|
||||
12
ctrl/hub/docker-compose.yml
Normal file
12
ctrl/hub/docker-compose.yml
Normal file
@@ -0,0 +1,12 @@
|
||||
services:
|
||||
hub:
|
||||
build: .
|
||||
container_name: sysmonstm-hub
|
||||
restart: unless-stopped
|
||||
environment:
|
||||
- API_KEY=${API_KEY:-}
|
||||
- EDGE_URL=${EDGE_URL:-}
|
||||
- EDGE_API_KEY=${EDGE_API_KEY:-}
|
||||
- LOG_LEVEL=${LOG_LEVEL:-INFO}
|
||||
ports:
|
||||
- "8080:8080"
|
||||
151
ctrl/hub/hub.py
Normal file
151
ctrl/hub/hub.py
Normal file
@@ -0,0 +1,151 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
sysmonstm hub - Local aggregator that receives from collectors and forwards to edge.
|
||||
|
||||
Runs on the local network, receives metrics from collectors via WebSocket,
|
||||
and forwards them to the cloud edge.
|
||||
"""
|
||||
|
||||
import asyncio
|
||||
import json
|
||||
import logging
|
||||
import os
|
||||
|
||||
from fastapi import FastAPI, Query, WebSocket, WebSocketDisconnect
|
||||
|
||||
# Configuration
|
||||
API_KEY = os.environ.get("API_KEY", "")
|
||||
EDGE_URL = os.environ.get("EDGE_URL", "") # e.g., wss://sysmonstm.mcrn.ar/ws
|
||||
EDGE_API_KEY = os.environ.get("EDGE_API_KEY", "")
|
||||
LOG_LEVEL = os.environ.get("LOG_LEVEL", "INFO")
|
||||
|
||||
# Logging setup
|
||||
logging.basicConfig(
|
||||
level=getattr(logging, LOG_LEVEL.upper(), logging.INFO),
|
||||
format="%(asctime)s [%(levelname)s] %(message)s",
|
||||
datefmt="%Y-%m-%d %H:%M:%S",
|
||||
)
|
||||
log = logging.getLogger("hub")
|
||||
|
||||
app = FastAPI(title="sysmonstm-hub")
|
||||
|
||||
# State
|
||||
collector_connections: list[WebSocket] = []
|
||||
machines: dict = {}
|
||||
edge_ws = None
|
||||
|
||||
|
||||
async def connect_to_edge():
|
||||
"""Maintain persistent connection to edge and forward metrics."""
|
||||
global edge_ws
|
||||
|
||||
if not EDGE_URL:
|
||||
log.info("No EDGE_URL configured, running in local-only mode")
|
||||
return
|
||||
|
||||
import websockets
|
||||
|
||||
url = EDGE_URL
|
||||
if EDGE_API_KEY:
|
||||
separator = "&" if "?" in url else "?"
|
||||
url = f"{url}{separator}key={EDGE_API_KEY}"
|
||||
|
||||
while True:
|
||||
try:
|
||||
log.info(f"Connecting to edge: {EDGE_URL}")
|
||||
async with websockets.connect(url) as ws:
|
||||
edge_ws = ws
|
||||
log.info("Connected to edge")
|
||||
|
||||
while True:
|
||||
try:
|
||||
msg = await asyncio.wait_for(ws.recv(), timeout=30)
|
||||
# Ignore messages from edge (pings, etc)
|
||||
except asyncio.TimeoutError:
|
||||
await ws.ping()
|
||||
|
||||
except asyncio.CancelledError:
|
||||
break
|
||||
except Exception as e:
|
||||
edge_ws = None
|
||||
log.warning(f"Edge connection error: {e}. Reconnecting in 5s...")
|
||||
await asyncio.sleep(5)
|
||||
|
||||
|
||||
async def forward_to_edge(data: dict):
|
||||
"""Forward metrics to edge if connected."""
|
||||
global edge_ws
|
||||
if edge_ws:
|
||||
try:
|
||||
await edge_ws.send(json.dumps(data))
|
||||
log.debug(f"Forwarded to edge: {data.get('machine_id')}")
|
||||
except Exception as e:
|
||||
log.warning(f"Failed to forward to edge: {e}")
|
||||
|
||||
|
||||
@app.on_event("startup")
|
||||
async def startup():
|
||||
asyncio.create_task(connect_to_edge())
|
||||
|
||||
|
||||
@app.get("/health")
|
||||
async def health():
|
||||
return {
|
||||
"status": "ok",
|
||||
"machines": len(machines),
|
||||
"collectors": len(collector_connections),
|
||||
"edge_connected": edge_ws is not None,
|
||||
}
|
||||
|
||||
|
||||
@app.get("/api/machines")
|
||||
async def get_machines():
|
||||
return machines
|
||||
|
||||
|
||||
@app.websocket("/ws")
|
||||
async def websocket_endpoint(websocket: WebSocket, key: str = Query(default="")):
|
||||
# Validate API key
|
||||
if API_KEY and key != API_KEY:
|
||||
log.warning(f"Invalid API key from {websocket.client}")
|
||||
await websocket.close(code=4001, reason="Invalid API key")
|
||||
return
|
||||
|
||||
await websocket.accept()
|
||||
collector_connections.append(websocket)
|
||||
client = websocket.client.host if websocket.client else "unknown"
|
||||
log.info(f"Collector connected: {client}")
|
||||
|
||||
try:
|
||||
while True:
|
||||
try:
|
||||
msg = await asyncio.wait_for(websocket.receive_text(), timeout=30)
|
||||
data = json.loads(msg)
|
||||
|
||||
if data.get("type") == "metrics":
|
||||
machine_id = data.get("machine_id", "unknown")
|
||||
machines[machine_id] = data
|
||||
log.debug(f"Metrics from {machine_id}: cpu={data.get('cpu')}%")
|
||||
|
||||
# Forward to edge
|
||||
await forward_to_edge(data)
|
||||
|
||||
except asyncio.TimeoutError:
|
||||
await websocket.send_json({"type": "ping"})
|
||||
|
||||
except WebSocketDisconnect:
|
||||
log.info(f"Collector disconnected: {client}")
|
||||
except Exception as e:
|
||||
log.error(f"WebSocket error: {e}")
|
||||
finally:
|
||||
if websocket in collector_connections:
|
||||
collector_connections.remove(websocket)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
import uvicorn
|
||||
|
||||
log.info("Starting sysmonstm hub")
|
||||
log.info(f" API key: {'configured' if API_KEY else 'not set (open)'}")
|
||||
log.info(f" Edge URL: {EDGE_URL or 'not configured (local only)'}")
|
||||
uvicorn.run(app, host="0.0.0.0", port=8080)
|
||||
@@ -1,6 +0,0 @@
|
||||
FROM python:3.11-slim
|
||||
WORKDIR /app
|
||||
RUN pip install --no-cache-dir fastapi uvicorn[standard] websockets
|
||||
COPY main.py .
|
||||
EXPOSE 8080
|
||||
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8080"]
|
||||
@@ -1,154 +0,0 @@
|
||||
version: "3.8"
|
||||
|
||||
# This file works both locally and on EC2 for demo purposes.
|
||||
# For local dev with hot-reload, use: docker compose -f docker-compose.yml -f docker-compose.override.yml up
|
||||
|
||||
x-common-env: &common-env
|
||||
REDIS_URL: redis://redis:6379
|
||||
TIMESCALE_URL: postgresql://monitor:monitor@timescaledb:5432/monitor
|
||||
EVENTS_BACKEND: redis_pubsub
|
||||
LOG_LEVEL: ${LOG_LEVEL:-INFO}
|
||||
LOG_FORMAT: json
|
||||
|
||||
x-healthcheck-defaults: &healthcheck-defaults
|
||||
interval: 10s
|
||||
timeout: 5s
|
||||
retries: 3
|
||||
start_period: 10s
|
||||
|
||||
services:
|
||||
# =============================================================================
|
||||
# Infrastructure
|
||||
# =============================================================================
|
||||
|
||||
redis:
|
||||
image: redis:7-alpine
|
||||
ports:
|
||||
- "${REDIS_PORT:-6379}:6379"
|
||||
volumes:
|
||||
- redis-data:/data
|
||||
healthcheck:
|
||||
<<: *healthcheck-defaults
|
||||
test: ["CMD", "redis-cli", "ping"]
|
||||
deploy:
|
||||
resources:
|
||||
limits:
|
||||
memory: 128M
|
||||
|
||||
timescaledb:
|
||||
image: timescale/timescaledb:latest-pg15
|
||||
environment:
|
||||
POSTGRES_USER: monitor
|
||||
POSTGRES_PASSWORD: monitor
|
||||
POSTGRES_DB: monitor
|
||||
ports:
|
||||
- "${TIMESCALE_PORT:-5432}:5432"
|
||||
volumes:
|
||||
- timescale-data:/var/lib/postgresql/data
|
||||
- ./scripts/init-db.sql:/docker-entrypoint-initdb.d/init.sql:ro
|
||||
healthcheck:
|
||||
<<: *healthcheck-defaults
|
||||
test: ["CMD-SHELL", "pg_isready -U monitor -d monitor"]
|
||||
deploy:
|
||||
resources:
|
||||
limits:
|
||||
memory: 512M
|
||||
|
||||
# =============================================================================
|
||||
# Application Services
|
||||
# =============================================================================
|
||||
|
||||
aggregator:
|
||||
build:
|
||||
context: .
|
||||
dockerfile: services/aggregator/Dockerfile
|
||||
environment:
|
||||
<<: *common-env
|
||||
GRPC_PORT: 50051
|
||||
SERVICE_NAME: aggregator
|
||||
ports:
|
||||
- "${AGGREGATOR_GRPC_PORT:-50051}:50051"
|
||||
depends_on:
|
||||
redis:
|
||||
condition: service_healthy
|
||||
timescaledb:
|
||||
condition: service_healthy
|
||||
healthcheck:
|
||||
<<: *healthcheck-defaults
|
||||
test: ["CMD", "/bin/grpc_health_probe", "-addr=:50051"]
|
||||
deploy:
|
||||
resources:
|
||||
limits:
|
||||
memory: 256M
|
||||
|
||||
gateway:
|
||||
build:
|
||||
context: .
|
||||
dockerfile: services/gateway/Dockerfile
|
||||
environment:
|
||||
<<: *common-env
|
||||
HTTP_PORT: 8000
|
||||
AGGREGATOR_URL: aggregator:50051
|
||||
SERVICE_NAME: gateway
|
||||
ports:
|
||||
- "${GATEWAY_PORT:-8000}:8000"
|
||||
depends_on:
|
||||
- aggregator
|
||||
- redis
|
||||
healthcheck:
|
||||
<<: *healthcheck-defaults
|
||||
test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
|
||||
deploy:
|
||||
resources:
|
||||
limits:
|
||||
memory: 256M
|
||||
|
||||
alerts:
|
||||
build:
|
||||
context: .
|
||||
dockerfile: services/alerts/Dockerfile
|
||||
environment:
|
||||
<<: *common-env
|
||||
SERVICE_NAME: alerts
|
||||
depends_on:
|
||||
redis:
|
||||
condition: service_healthy
|
||||
timescaledb:
|
||||
condition: service_healthy
|
||||
healthcheck:
|
||||
<<: *healthcheck-defaults
|
||||
test: ["CMD", "python", "-c", "import sys; sys.exit(0)"]
|
||||
deploy:
|
||||
resources:
|
||||
limits:
|
||||
memory: 128M
|
||||
|
||||
# Collector runs separately on each machine being monitored
|
||||
# For local testing, we run one instance
|
||||
collector:
|
||||
build:
|
||||
context: .
|
||||
dockerfile: services/collector/Dockerfile
|
||||
environment:
|
||||
<<: *common-env
|
||||
AGGREGATOR_URL: aggregator:50051
|
||||
MACHINE_ID: ${MACHINE_ID:-local-dev}
|
||||
COLLECTION_INTERVAL: ${COLLECTION_INTERVAL:-5}
|
||||
SERVICE_NAME: collector
|
||||
depends_on:
|
||||
- aggregator
|
||||
deploy:
|
||||
resources:
|
||||
limits:
|
||||
memory: 64M
|
||||
# For actual system metrics, you might need:
|
||||
# privileged: true
|
||||
# pid: host
|
||||
|
||||
volumes:
|
||||
redis-data:
|
||||
timescale-data:
|
||||
|
||||
networks:
|
||||
default:
|
||||
name: sysmonstm
|
||||
1
docker-compose.yml
Symbolic link
1
docker-compose.yml
Symbolic link
@@ -0,0 +1 @@
|
||||
ctrl/dev/docker-compose.yml
|
||||
Reference in New Issue
Block a user