update docs

2026-05-11 20:13:11 -03:00
commit 2ffabb672e
40 changed files with 5869 additions and 0 deletions
--- a/docs/lambdas-md/lambda-10-observability.md
+++ b/docs/lambdas-md/lambda-10-observability.md
@@ -0,0 +1,93 @@
+# Observability
+
+> CloudWatch logs, structured JSON, X-Ray, Lambda Insights, EMF. Brief Prometheus/Grafana orientation.
+
+## CloudWatch Logs — what you get for free
+
+Every Lambda function automatically writes to a CloudWatch Log Group named `/aws/lambda/<function-name>`. Each execution environment gets its own Log Stream. Lambda writes two special lines automatically:
+
+```
+START RequestId: abc-123 Version: $LATEST
+END RequestId: abc-123
+REPORT RequestId: abc-123  Duration: 312.45 ms  Billed Duration: 313 ms
+        Memory Size: 256 MB  Max Memory Used: 89 MB
+        Init Duration: 423.12 ms   # only on cold starts
+```
+
+The REPORT line is your free performance telemetry. `Init Duration` appears only on cold invocations. `Max Memory Used` helps right-size memory configuration.
+
+**Retention:** Default is "Never Expire." Set it explicitly — 7, 14, or 30 days covers most needs. Every MB of retained logs costs money.
+
+## Structured logging
+
+Emit JSON instead of plain strings. CloudWatch Logs Insights can filter and aggregate JSON fields efficiently; plain strings require regex and are slow. Example:
+
+```python
+import json, logging
+logger = logging.getLogger()
+logger.setLevel(logging.INFO)
+
+def handler(event, context):
+    logger.info(json.dumps({
+        "event": "pdf_scan_start",
+        "bucket": BUCKET,
+        "prefix": PREFIX,
+        "request_id": context.aws_request_id,
+    }))
+```
+
+With this, Logs Insights can run: `filter event = "pdf_scan_start" | stats count() by bin(5m)` in seconds.
+
+## X-Ray tracing
+
+X-Ray gives you request traces across services — how long the Lambda itself ran vs how long S3 calls took. Three things must all be true:
+
+1. **Tracing enabled on the function** — console toggle or `TracingConfig: Active` in SAM/CDK
+2. **X-Ray SDK instrumented in your code** — `from aws_xray_sdk.core import patch_all; patch_all()` wraps boto3 calls automatically
+3. **IAM permission** — execution role needs `xray:PutTraceSegments` and `xray:PutTelemetryRecords`
+
+Without all three, traces are either absent or incomplete. People flip one and conclude X-Ray is broken.
+
+## Lambda Insights
+
+Lambda Insights is a CloudWatch feature (not a separate service) that surfaces system-level metrics: CPU usage, memory utilisation, network I/O, disk I/O — things the REPORT line doesn't include. To enable it:
+
+- Add the Lambda Insights extension layer (`arn:aws:lambda:<region>:580247275435:layer:LambdaInsightsExtension:38`)
+- Add `cloudwatch:PutMetricData` to the execution role
+
+It's useful when you suspect memory or CPU contention but the REPORT line's "Max Memory Used" isn't granular enough.
+
+## EMF — Embedded Metrics Format
+
+EMF lets you emit custom CloudWatch metrics by writing structured JSON to stdout. No `PutMetricData` API call needed — the Lambda runtime parses the log line and publishes the metric asynchronously. This is far more efficient than calling CloudWatch from inside the handler (which adds latency + cost per invocation).
+
+```python
+import json
+
+def emit_metric(name, value, unit="Count", **dims):
+    print(json.dumps({
+        "_aws": {
+            "Timestamp": int(time.time() * 1000),
+            "CloudWatchMetrics": [{
+                "Namespace": "MyApp",
+                "Dimensions": [list(dims.keys())],
+                "Metrics": [{"Name": name, "Unit": unit}]
+            }]
+        },
+        name: value,
+        **dims,
+    }))
+
+# usage
+emit_metric("PDFsProcessed", count, Unit="Count", Function="pdf-scanner")
+```
+
+## Prometheus & Grafana (brief)
+
+Prometheus uses a **pull model** — it scrapes HTTP endpoints. Lambda functions are ephemeral and have no persistent HTTP endpoint, so Prometheus can't scrape them directly. Approaches:
+
+- **EMF → CloudWatch → Grafana CloudWatch plugin** — easiest; Grafana queries CW as a data source
+- **Amazon Managed Prometheus (AMP) + remote_write** — Lambda pushes metrics to AMP via the Prometheus remote write API; Grafana (or Amazon Managed Grafana) reads from AMP. Requires the `prometheus_client` library and SIGV4 signing on the remote_write request.
+- **Statsd/push gateway** — Lambda pushes to a persistent push gateway; Prometheus scrapes the gateway. More infra to manage, stale metric risk if the push gateway isn't flushed between invocations.
+
+For Lambda-centric dashboards, the CloudWatch → Grafana path is usually the simplest to operate.