Observability

CloudWatch logs, structured JSON, X-Ray, Lambda Insights, EMF. Brief Prometheus/Grafana orientation.

CloudWatch Logs — what you get for free

Every Lambda function automatically writes to a CloudWatch Log Group named /aws/lambda/<function-name>. Each execution environment gets its own Log Stream. Lambda writes two special lines automatically:

START RequestId: abc-123 Version: $LATEST
END RequestId: abc-123
REPORT RequestId: abc-123  Duration: 312.45 ms  Billed Duration: 313 ms
        Memory Size: 256 MB  Max Memory Used: 89 MB
        Init Duration: 423.12 ms   # only on cold starts

The REPORT line is your free performance telemetry. Init Duration appears only on cold invocations. Max Memory Used helps right-size memory configuration.

Retention: Default is "Never Expire." Set it explicitly — 7, 14, or 30 days covers most needs. Every MB of retained logs costs money.

Structured logging

Emit JSON instead of plain strings. CloudWatch Logs Insights can filter and aggregate JSON fields efficiently; plain strings require regex and are slow. Example:

import json, logging
logger = logging.getLogger()
logger.setLevel(logging.INFO)

def handler(event, context):
    logger.info(json.dumps({
        "event": "pdf_scan_start",
        "bucket": BUCKET,
        "prefix": PREFIX,
        "request_id": context.aws_request_id,
    }))

With this, Logs Insights can run: filter event = "pdf_scan_start" | stats count() by bin(5m) in seconds.

X-Ray tracing

X-Ray gives you request traces across services — how long the Lambda itself ran vs how long S3 calls took. Three things must all be true:

Tracing enabled on the function — console toggle or TracingConfig: Active in SAM/CDK
X-Ray SDK instrumented in your code — from aws_xray_sdk.core import patch_all; patch_all() wraps boto3 calls automatically
IAM permission — execution role needs xray:PutTraceSegments and xray:PutTelemetryRecords

Without all three, traces are either absent or incomplete. People flip one and conclude X-Ray is broken.

Lambda Insights

Lambda Insights is a CloudWatch feature (not a separate service) that surfaces system-level metrics: CPU usage, memory utilisation, network I/O, disk I/O — things the REPORT line doesn't include. To enable it:

Add the Lambda Insights extension layer (arn:aws:lambda:<region>:580247275435:layer:LambdaInsightsExtension:38)
Add cloudwatch:PutMetricData to the execution role

It's useful when you suspect memory or CPU contention but the REPORT line's "Max Memory Used" isn't granular enough.

EMF — Embedded Metrics Format

EMF lets you emit custom CloudWatch metrics by writing structured JSON to stdout. No PutMetricData API call needed — the Lambda runtime parses the log line and publishes the metric asynchronously. This is far more efficient than calling CloudWatch from inside the handler (which adds latency + cost per invocation).

import json

def emit_metric(name, value, unit="Count", **dims):
    print(json.dumps({
        "_aws": {
            "Timestamp": int(time.time() * 1000),
            "CloudWatchMetrics": [{
                "Namespace": "MyApp",
                "Dimensions": [list(dims.keys())],
                "Metrics": [{"Name": name, "Unit": unit}]
            }]
        },
        name: value,
        **dims,
    }))

# usage
emit_metric("PDFsProcessed", count, Unit="Count", Function="pdf-scanner")

Prometheus & Grafana (brief)

Prometheus uses a pull model — it scrapes HTTP endpoints. Lambda functions are ephemeral and have no persistent HTTP endpoint, so Prometheus can't scrape them directly. Approaches:

EMF → CloudWatch → Grafana CloudWatch plugin — easiest; Grafana queries CW as a data source
Amazon Managed Prometheus (AMP) + remote_write — Lambda pushes metrics to AMP via the Prometheus remote write API; Grafana (or Amazon Managed Grafana) reads from AMP. Requires the prometheus_client library and SIGV4 signing on the remote_write request.
Statsd/push gateway — Lambda pushes to a persistent push gateway; Prometheus scrapes the gateway. More infra to manage, stale metric risk if the push gateway isn't flushed between invocations.

For Lambda-centric dashboards, the CloudWatch → Grafana path is usually the simplest to operate.

4.3 KiB Raw Blame History