Files

buenosairesam 2ffabb672e update docs

2026-05-11 20:13:11 -03:00

80 KiB

Raw Blame History

STUDY SCRIPT — AWS Lambda, complete coverage

For the interview on Tuesday

Same tone as the actor's script. Spoken voice, talking to one person, your rhythm. The difference: this includes everything from the lambda-* notes, in the order the files appear. Read it through once. Then re-read the sections where you stumble. The last section walks through lambda_function.py line by line and answers every follow-up question you couldn't field last time.

English draft.

1. Overview

What we have

Let me start with what we have, because the rest of this only makes sense once you've held the thing in your hands.

There's a folder. Inside it, a Python file called lambda_function.py — that's the function. There's a docker-compose.yml that brings up MinIO on your laptop. MinIO is an S3-compatible object store. There's a Makefile that wraps the whole loop into about five commands. There's an invoke.py that calls the function locally with a minimal event, the same way Lambda would call it in AWS. There's a seed.py that uploads PDFs from a local directory to MinIO so we have something to scan.

What does the function do? It lists every PDF inside an S3 prefix. It generates a fifteen-minute presigned download link for each one. It writes them all out as JSONL into /tmp. It uploads that JSONL back to S3 as a manifest. And it returns one presigned URL pointing to the manifest. The recipient clicks the URL, gets back a list of links, all expire in fifteen minutes.

It runs locally against MinIO. The same handler, same signature, same code, would run on real AWS Lambda the day you deploy it. The only thing that changes is the S3_ENDPOINT_URL environment variable. In MinIO it points to http://localhost:9000. In AWS, you don't set it, and boto3 talks to real S3.

How to use this script

Read it in your voice. The reason that matters is — when you read someone else explaining a concept, you nod along. When you read the same concept as if you were the one explaining it, you immediately notice the gaps. The places where you'd stumble in front of an interviewer. Those are the places to study harder.

Each section corresponds to one of the lambda notes files. They're numbered the same way. So when something feels thin, when you read it and think "I can't actually answer a follow-up on this," you know exactly which file to open.

The last section walks through lambda_function.py line by line. That's the section you couldn't field in the last interview. We're going to fix it.

2. Mental model

Lambda is a Linux process whose lifecycle is managed for you

That's the one sentence to memorize. Most of the surprise about Lambda comes from forgetting that it's still a process.

Each invocation runs inside an execution environment. That environment is a Firecracker microVM. Firecracker is open source, written by AWS in Rust, designed to boot a stripped-down Linux kernel in about a hundred and twenty-five milliseconds. Inside the microVM there's the Lambda runtime — for us, that's Python 3.13 — and your code is unpacked into a directory called /var/task. There's /tmp for scratch. There's an environment that holds your variables and credentials.

AWS owns the VM. You own everything inside the process. The microVM is created on demand, kept warm for a while, then torn down when traffic stops. You don't pick a server, but there is a server, and it has memory, a clock, and a filesystem. Everything that surprises people about Lambda comes from forgetting that fact.

The two phases — the most useful split in all of Lambda

Every cold start has two phases.

The init phase is your module-level code. The imports at the top of the file. The construction of any client at module scope. The reading of environment variables. Anything that lives outside def handler. This phase has a hard cap of ten seconds. It's billed at the full configured memory, even if you don't use it all. And it runs once.

The handler phase is handler(event, context) itself. It runs every invocation. Billed per millisecond at the configured memory.

Subsequent invocations on the same warm environment skip the init phase entirely. They go straight to the handler.

Heavy work at module level — pay it once per cold start, free for every warm request after. Heavy work inside the handler — pay it every invocation. This single distinction is what most "Lambda is slow" complaints actually are.

In our function, look at the top. Five environment variable reads at module scope. Read once per cold start. Reused for every warm request.

BUCKET = os.environ.get("BUCKET_NAME", "my-company-reports-bucket")
PREFIX = os.environ.get("PREFIX", "2026/04/")
EXPIRY = int(os.environ.get("URL_EXPIRY_SECONDS", "900"))
ENDPOINT = os.environ.get("S3_ENDPOINT_URL") or None
QUEUE_MAX = int(os.environ.get("QUEUE_MAX", "2000"))

If you moved any of those into _run() or into the handler, you'd be doing the lookup on every single invocation forever. Free for the rest of the function's life, vs paid every request. That's the difference.

Globals across warm invocations

Anything assigned at module scope survives between handler calls on the same environment. The intended use is good: a boto3 client at module scope means TCP keep-alive, no re-handshake on every request, the SDK reuses the connection pool.

The unintended use is the foot-gun. A list at module scope, append to it inside the handler, it grows forever. The same warm container can serve thousands of invocations in a row. A counter at module scope, increment in the handler — wrong number on every request, and inconsistent across environments because each warm container has its own copy.

The rule: anything at module scope is shared across invocations on the same env, never across environments. If you need state shared across environments, externalize it. DynamoDB, Redis, S3.

/tmp is real but local

Each environment has its own /tmp. Default 512 MB, configurable to 10 GB. It persists across warm invocations on that environment, so you can stash artifacts you'd rather not rebuild. But it is not shared between concurrent executions, and it's gone when the environment dies.

This is exactly why our function does this:

manifest_path = f"/tmp/{uuid.uuid4()}.jsonl"

Two parallel invocations on different environments are fine, they each have their own /tmp. But two warm invocations back to back on the same environment, if we used a fixed filename, would collide. UUID per invocation, no collision possible. We also os.unlink(manifest_path) at the end of _run() to make sure /tmp doesn't fill across warm invocations.

Concurrency is horizontal

If two events arrive while one is being processed, AWS spins up a second execution environment. Each environment processes one invocation at a time, single-threaded relative to your handler. There's no thread pool to tune. There's no shared memory between environments. The "concurrency" you see in CloudWatch is the count of environments running in parallel.

The reuse window

Idle environments stick around for roughly five to fifteen minutes before being recycled. AWS won't promise a number. That's why a function that sees one request a minute almost never cold-starts, and a function that sees one request a day always does.

3. Limits

The numbers worth memorizing

There are about ten numbers an interviewer might ask you. Three of them you should never have to look up.

Fifteen minutes — maximum function timeout. Default is three seconds, which is too short for almost anything that talks to S3, so set it explicitly.

Ten gigabytes — maximum memory. Also the maximum size of /tmp. It's the same number for both, which suggests it's a microVM provisioning ceiling, not a Lambda product decision.

Six megabytes — maximum sync request and response. Six in, six out. Above that, the response is silently truncated and the caller gets a 413. We design around this in our function by returning a manifest URL instead of inlining all the presigned URLs.

Compute and storage, in detail

Memory is configurable from 128 MB up to 10 240 MB. CPU scales linearly with memory. At 1769 MB you get a full vCPU. At higher tiers, multiple. So memory isn't just headroom — it's CPU. Often it's cheaper to bump memory because duration drops faster than cost rises. If your function is CPU-bound, doubling the memory might more than halve the duration.

The init phase has a hard ten-second cap. You can blow through this with a heavy ML model load, custom JIT warmups, anything that does serious work at module level.

/tmp defaults to 512 MB. Above that, you pay per invocation for the extra. It persists across warm invocations on the same env, vanishes on cold start.

Payloads and responses

Sync invocation request: 6 MB. Sync invocation response: 6 MB. Async invocation event: 256 KB — that's for Event invocations and most event-source-mapped triggers like S3, EventBridge, SNS. Larger payloads, you store in S3 and send a pointer. SQS, SNS, EventBridge messages also cap at 256 KB each.

Response streaming. There's a way around the 6 MB response limit if you use Function URLs or Lambda's response streaming mode. You flush chunks. The cap goes up to 20 MB soft, with a bandwidth ceiling. Not all clients support it.

Environment variables: 4 KB total. Per function, all keys plus values combined. If you have a big config that won't fit, you go to Parameter Store or Secrets Manager.

Packaging limits

Zip upload directly: 50 MB. Above that, you upload via S3 first.

Zip unzipped — function plus all layers extracted: 250 MB. So aioboto3 plus its dependencies is around 50 MB. We have headroom but not infinite.

Container image: 10 GB per image. This is what you reach for when you'd otherwise blow the 250 MB zip ceiling. ML dependencies with native binaries, that kind of thing.

Layers: max five per function. Order matters. Later layers overwrite earlier ones. They count toward the 250 MB unzipped cap.

Concurrency and scaling

Account concurrent executions: default 1000 per region. It's a soft quota, you can request an increase via Service Quotas. The single most common throttling cause in production.

Burst concurrency: 500 to 3000 immediate, depending on the region. That's how many fresh environments AWS will spin up right now at a traffic spike. Beyond the burst, scaling adds 500 environments per minute. So a spike from zero to 5000 concurrent requests takes several minutes to fully absorb.

Reserved concurrency: from zero up to your account quota. It carves a slice of the account pool for a function. Setting it to zero effectively disables the function, which is sometimes useful as a circuit breaker.

Provisioned concurrency: zero by default. Pre-warmed environments. Eliminates cold starts for the warmed slots, costs you for the idle capacity.

Time and rate limits at the edges

API Gateway integration timeout: 29 seconds. Hard cap. Doesn't matter what your Lambda timeout says. If you need longer with API Gateway in front, you return a job ID and have the client poll. Function URLs allow up to 15 minutes.

Async invocation event age: 6 hours. If retries don't succeed in that window, the event gets dropped — or sent to a DLQ or on-failure destination if you configured one.

Async retry attempts: default is 2. So three attempts total, including the original. Configurable down to zero.

SQS visibility timeout requirement: at least six times the function timeout. AWS recommendation. Otherwise messages reappear while still being processed and you do the work twice.

The memorization hack: three numbers cover most interview questions. Fifteen minutes, ten gigabytes, six megabytes. Everything else is a footnote until you hit a specific design.

4. Cold starts

What triggers one

A cold start happens whenever Lambda must create a new execution environment. The first request after a deployment. A traffic spike beyond the number of warm environments. An idle environment that AWS has recycled — somewhere between five and fifteen minutes of inactivity, no promise.

Deployments always cold-start the incoming version. You can't avoid the first one, only reduce how long it takes.

The cold path

What actually happens during a cold start. AWS provisions a Firecracker microVM. Boots it, attaches the network, mounts the filesystem. Downloads and unpacks your code, or pulls the container image. Starts the language runtime. Runs your module-level code. Only then is your handler called.

The timeline:

1 — environment provisioning. MicroVM boot, network, filesystem. Not billed. AWS absorbs this.

2 — init phase. Your module-level code: imports, client construction, config reads. Billed at full configured memory. Capped at 10 seconds.

3 — handler phase. handler(event, context). Billed per millisecond.

CloudWatch shows this split. The REPORT line includes an "Init Duration" only on cold invocations. Warm invocations have no Init Duration line at all.

Real numbers

Python 3.13 with minimal deps: about 150 ms median, 400 ms p99.

Python 3.13 with our aioboto3 and aiofiles: about 300 ms median, 700 ms p99.

Node.js 22: 100 ms median, 300 ms p99.

Java 21 without SnapStart: one to two seconds median, three to five at p99.

Java 21 with SnapStart: about 200 ms median, 600 ms p99.

Container image, any runtime: add 100 to 300 ms. The first pull after a deploy can be 1 to 3 seconds.

Mitigations

Each one is interesting because it's its own product.

Provisioned Concurrency — pre-warms N environments so they're always in the warm state. Eliminates cold starts for those slots. You pay for them 24/7 even when idle. Use for latency-sensitive, predictable-traffic paths. Schedule the changes via Application Auto Scaling for cost efficiency.

ARM64. Graviton2 executes the init phase about 10% faster than x86 for CPU-bound init work. Combined with the 20% price reduction, ARM64 is the default choice unless a native wheel blocks you.

Smaller packages. Lambda downloads and unpacks your zip on every cold start. Trimming unused transitive dependencies, stripping test and doc files, shaves real time. Every megabyte of extracted code costs a few milliseconds.

Lazy imports. Move rarely-used or slow imports inside the handler, or behind a lazy-init guard. The most common win is heavy ML libraries you only need for inference. Import them on first call, cache the result in a module-level variable.

SnapStart. Java only. Takes a snapshot of the initialized JVM state after init, restores from snapshot on cold starts. Collapses 1 to 5 seconds of JVM startup to about 200 ms. Not available for Python or Node.

When cold starts don't matter: batch jobs, async event pipelines, scheduled tasks. Nobody is waiting on the p99. Only optimize cold starts when a human is waiting synchronously for the response.

5. Concurrency

The fundamental model

Lambda concurrency equals the number of execution environments processing requests at the same instant. Each environment handles exactly one invocation at a time. There is no thread pool, no event loop shared across invocations. If two requests arrive simultaneously, AWS spins up two separate environments.

The equation

This is the cleanest equation in cloud computing. Memorize it.

Concurrency, roughly, equals requests per second times average duration in seconds.

A hundred RPS with a 200 ms average duration: 100 times 0.2 equals 20 concurrent environments.

A hundred RPS with a 500 ms average: 50 environments. With a 2-second average: 200. Same traffic, ten times the footprint, because the function is slower.

Latency optimization directly reduces your concurrency footprint. They're the same problem.

The account pool

Every AWS account has a regional concurrency quota. Default 1000 concurrent executions per region, shared across all functions in that region.

When the pool is full, new invocations get throttled. Synchronous calls get HTTP 429 TooManyRequestsException. Async calls get queued and retried. Raising the limit is a Service Quotas request — AWS typically grants up to 10 000 with a business justification.

This is the single most common production surprise. One function spikes and starves all others in the same region. Reserved concurrency is the fix.

Three types of concurrency

Unreserved. Draws from the shared regional pool on demand. Cost: invocation plus duration only. Use for most functions.

Reserved. Carves a slice of the regional pool exclusively for one function. It acts as both a floor and a ceiling. No extra charge. Use for protecting critical paths from noisy neighbors, or for capping cost runaway.

Provisioned. Pre-warms N environments. They stay initialized 24/7. Costs Provisioned Concurrency hours plus invocation cost on top. Use for latency-sensitive functions where cold starts are unacceptable.

Reserved concurrency edge cases

Setting reserved concurrency to zero disables the function entirely. Useful as a circuit breaker.

Reserved concurrency counts against the account pool even when idle. If you set 500 reserved on one function, only 500 remain for all other functions at the default 1000.

Reserved concurrency does not pre-warm. You still cold-start. You just can't scale past the cap.

Burst scaling

When traffic spikes from zero, Lambda can spin up environments quickly — but not infinitely fast. The burst limit is region-dependent, typically 500 to 3000 immediate. Beyond that, it adds 500 new environments per minute. A spike from 0 to 5000 concurrent requests takes several minutes to fully absorb. Provisioned Concurrency or pre-warming via a ping mechanism is the fix for sudden large spikes.

The interview-answer template: "Concurrency equals RPS times duration. Default pool is 1000 per region. Reserved carves a slice and prevents both starvation and runaway. Provisioned pre-warms to eliminate cold starts but you pay for idle capacity."

6. Triggers

Three invocation models

Every trigger falls into one of three models. The model determines retry behavior, error handling, and whether the caller can see the response.

Synchronous. The caller blocks for the response. Gets the result or the error directly. No retries — that's the caller's responsibility. Max event size: 6 MB request and response.

Asynchronous. The caller gets a 202 immediately. Lambda queues and retries internally. Two retries, three attempts total, over up to 6 hours. Max event size: 256 KB.

Poll-based, also called event source mapping or ESM. Lambda polls the source on your behalf and batches records. Keeps retrying until success or the record expires. Event size depends on the source.

The trigger catalog

API Gateway, REST or HTTP. Synchronous. 29-second integration timeout regardless of Lambda timeout. HTTP API is cheaper and lower-latency than REST API. It transforms request and response.

Function URL. Synchronous. A direct HTTPS endpoint on the function. No API Gateway layer. Supports up to 15 minutes timeout and response streaming. Simpler, cheaper, fewer features.

Application Load Balancer. Synchronous. Like API Gateway but routes at L7. Useful when Lambda is one target among EC2 or ECS targets. 29-second timeout.

S3 event notification. Asynchronous. Fires on object create, delete, etc. At-least-once delivery. A large PUT creates exactly one event per object, but notifications can duplicate. Common pattern: S3 to SNS to SQS to Lambda for fan-out plus replay.

SNS. Asynchronous. Fan-out: one message to multiple subscribers. At-least-once. Dead-letter queue lives on the subscription, not the topic.

EventBridge. Asynchronous. An event bus with content-based routing rules. Also the managed scheduler — cron and rate expressions, timezone-aware since 2022. At-least-once.

SQS. Poll-based. Lambda polls and batches up to 10 000 messages. Standard queues are at-least-once and unordered. FIFO queues are ordered per message group, exactly-once with dedup. Visibility timeout has to be at least 6 times the function timeout. Partial batch failure via batchItemFailures.

Kinesis Data Streams. Poll-based. One Lambda shard per stream shard. Records expire — 24 hours to a year. Lambda retries until success or expiry. Use bisect-on-error and batchItemFailures to avoid one bad record blocking an entire shard.

DynamoDB Streams. Poll-based. Captures item-level changes. Ordered per partition key. 24-hour retention. Same retry behavior as Kinesis. Use it for change-data-capture patterns.

Step Functions. Synchronous, when the state machine has a Task state pointing at the function. Step Functions calls it synchronously and waits for the result. Retries and timeouts are defined in the state machine, not in Lambda.

Cognito, SES, IoT, others. Service-specific. Cognito triggers like pre-signup or pre-token are sync and block the auth flow.

Use plain SQS to Lambda when you have one consumer and want to buffer, batch, and retry. Use SNS to SQS to Lambda when you need fan-out — multiple independent consumers each get a copy — or when the producer is an AWS service that speaks SNS natively. The SNS layer decouples producers from the queue topology.

7. IAM and permissions

Two independent permission layers

Lambda has two separate permission surfaces. Each must be correct independently. Confusing them is the most common "it works locally but not in AWS" failure.

Execution role: what can this Lambda function do once it's running? Call S3, write to DynamoDB, publish to SNS. You attach this at function creation.

Resource policy: who is allowed to invoke this Lambda function? API Gateway, another AWS account, EventBridge. AWS adds this automatically for most triggers when you wire them up through the console. You add it manually for cross-account grants.

Execution role

The execution role is an IAM role that Lambda assumes when running your function. Every Lambda must have one. The attached policies determine what AWS API calls the function can make. The minimum, for any function, is permission to write its own logs:

logs:CreateLogGroup
logs:CreateLogStream
logs:PutLogEvents

For our function, which reads and writes S3, you need at minimum:

s3:GetObject
s3:PutObject
s3:ListBucket   # needed for the paginator; often forgotten
kms:Decrypt     # if the bucket uses a CMK

The AWSLambdaBasicExecutionRole managed policy covers logs only. It is intentionally minimal. AWSLambdaVPCAccessExecutionRole adds the ENI permissions needed when the function is in a VPC.

Resource policy

The resource policy is attached to the Lambda function itself, not to an IAM identity. When you add an S3 event notification or an API Gateway integration through the console, AWS automatically adds a resource policy entry allowing that service to invoke the function. For cross-account invocations, you add this manually:

aws lambda add-permission \
  --function-name my-function \
  --principal 123456789012 \
  --action lambda:InvokeFunction \
  --statement-id cross-account-invoke

The four common mistakes

1: missing s3:ListBucket on the bucket resource. ListObjectsV2, which our paginator uses, requires ListBucket on the bucket ARN — not on the object ARN. Forgetting this causes AccessDenied on the paginator even when GetObject works fine on individual files. This is the most common one.

2: wrong resource ARN scope. s3:GetObject belongs on arn:aws:s3:::bucket-name/*. s3:ListBucket belongs on arn:aws:s3:::bucket-name. Without the wildcard. Swapping the two is a frequent typo.

3: KMS not in the execution role. If the bucket's objects are encrypted with a customer-managed key, you need both s3:GetObject and kms:Decrypt. The KMS key policy must also allow the role. Two separate policy documents, two separate denial points.

4: no resource policy for a manually wired trigger. If you wire EventBridge through the CLI or the SDK and skip the console, the trigger silently fails because there's no resource policy entry granting EventBridge lambda:InvokeFunction.

Diagnosing permission errors

CloudTrail is the ground truth. Filter by errorCode: "AccessDenied" and userIdentity.arn matching the execution role ARN. The event tells you exactly which action on which resource was denied. CloudWatch will show the error in the Lambda log if you let the exception propagate, but CloudTrail shows it even when the calling library swallows the error.

8. Packaging

Three deployment formats

Zip, direct upload. Up to 50 MB upload, 250 MB unzipped. Best for most Python and Node functions with pure-Python or pre-built wheels. Must match Lambda's architecture. No custom runtime.

Zip via S3. Same 250 MB unzipped limit, but the zip itself can be larger because you're uploading to S3 first. The S3 bucket has to be in the same region as the function.

Layers. 250 MB total — that's the function plus all layers combined. Best for shared dependencies across functions, like a company-wide logging layer. Maximum five layers per function. Later layers overwrite earlier ones in the merge.

Container image. Up to 10 GB per image. Best for ML models, native binary deps, custom runtimes. Slower first cold start because of the image pull. Larger attack surface.

Layers in practice

A layer is a zip file that Lambda extracts into /opt before running your function. Your code in /var/task can import from /opt/python for Python without any path manipulation.

Use cases. Shared internal libraries deployed independently of business logic. Large dependencies that change rarely — numpy, pandas — cached in a layer so deployments of the business logic stay fast. AWS-provided layers like the Lambda Insights extension or the X-Ray SDK.

Layers count toward the 250 MB unzipped limit. If you have 5 layers at 40 MB each, plus a 50 MB function zip, you're at 250 MB. No room left.

Container images

Container images must be based on AWS-provided base images, like public.ecr.aws/lambda/python:3.13, or implement the Lambda Runtime Interface. They must be stored in ECR — Elastic Container Registry — in the same region as the function.

The Lambda service caches images on the underlying host after the first pull, so subsequent cold starts on the same host are fast. The very first invocation after a new image is deployed can be slow for large images.

Container images bypass the 250 MB unzipped limit. That's why they're the standard for Python ML workloads that bundle PyTorch or TensorFlow.

ARM64 vs x86_64

Graviton2-based ARM64 is about 20% cheaper per GB-second than x86_64. Typically faster at compute-heavy work, too.

The decision tree. Check all your dependencies for ARM64 wheels. Run pip download with --platform manylinux2014_aarch64 and --only-binary :all:. If any fail, you either build from source (which needs a Dockerfile) or stay on x86. For pure-Python deps and most modern packages, ARM64 works out of the box. Native extensions like cryptography, numpy, psycopg2 — they've had ARM64 wheels on PyPI since around 2022. Check the exact version you need.

The common foot-gun

Lambda runs on Amazon Linux 2023. pip install on macOS produces wheels compiled for macOS, which will segfault or import-error on Lambda.

The fix is to build inside the Lambda runtime image:

docker run --rm \
  -v "$PWD":/var/task \
  public.ecr.aws/lambda/python:3.13 \
  pip install -r requirements.txt -t python/

zip -r layer.zip python/

Architecture matters here too. Use the :3.13-arm64 tag when building for ARM64.

For our project specifically: aioboto3 and aiofiles are pure-Python and have no native extensions, so they build cleanly on any architecture. The Makefile creates a local .venv for development. A real CI pipeline would build the deployment zip inside the Lambda image.

9. VPC and networking

Default: no VPC

By default, Lambda runs in an AWS-managed network with internet access. It can reach S3, DynamoDB, SQS, and other AWS services via their public endpoints.

Do not put Lambda in a VPC unless you have a specific reason. Most applications don't need it. This is a strong default. The mistakes that come from VPC placement are expensive in dollars and in latency.

When you actually need VPC

Connecting to RDS or Aurora, which live in private subnets. ElastiCache — Redis or Memcached — which is VPC-only by design. Private REST APIs or internal services on private subnets. Compliance requirements that mandate network isolation.

S3, DynamoDB, SQS, SNS, and most AWS managed services do not require VPC placement. They're public services with public endpoints.

ENI attachment and cold start

When Lambda is VPC-attached, each execution environment gets an Elastic Network Interface — an ENI — in your VPC. Pre-2019, ENIs were allocated per cold start, adding 10 to 30 seconds to init. AWS fixed this in 2019 with hyperplane ENIs that are shared across environments. Today the VPC cold start penalty is about 100 to 500 ms on the first cold start of a new deployment, then negligible. It's no longer the dealbreaker it used to be, but it's not zero.

Subnet and AZ placement

Specify at least two subnets in different AZs for availability. Lambda will distribute environments across AZs. If a subnet runs out of available ENI slots — IP exhaustion — Lambda scaling fails. Size subnets accordingly. A /24 with 254 IPs is often too small for a high-concurrency function.

The NAT money pit

VPC Lambda can't reach the internet by default. If your function needs to call an external API, or reach an AWS service for which there's no VPC endpoint, you need a NAT gateway in a public subnet.

NAT gateways cost: $0.045 per hour, which is about $32 a month, just to exist. Per AZ. Plus $0.045 per gigabyte of data processed.

A function that pushes 100 GB a month through NAT costs $4.50 in data alone, on top of the always-on hourly charge. Two AZs for HA: about $64 a month base cost before a single byte of traffic.

This is frequently the largest unexpected cost in VPC Lambda setups.

VPC endpoints — the free alternative

For AWS services, VPC endpoints bypass NAT and the public internet entirely.

Two types. Gateway endpoints — S3 and DynamoDB only. Free. They're route table entries. No data charge. Interface endpoints, also called PrivateLink — any AWS service. $0.01 per AZ per hour, plus $0.01 per gigabyte. Expensive at high throughput, but often cheaper than NAT for AWS-service-heavy workloads.

For a VPC Lambda that only talks to S3 and DynamoDB: create gateway endpoints for both. No NAT needed. Near-zero networking cost.

Security groups

VPC Lambda gets a security group. Outbound rules control where it can connect. The security group of the RDS or ElastiCache instance must allow inbound from the Lambda's security group.

A common pattern: create a dedicated Lambda SG, reference it in the database's SG inbound rules. Avoids IP-range rules that break when Lambda ENIs change.

10. Observability

CloudWatch Logs — what you get for free

Every Lambda function automatically writes to a CloudWatch Log Group named /aws/lambda/<function-name>. Each execution environment gets its own Log Stream.

Lambda writes two special lines automatically:

START RequestId: abc-123 Version: $LATEST
END RequestId: abc-123
REPORT RequestId: abc-123  Duration: 312.45 ms  Billed Duration: 313 ms
        Memory Size: 256 MB  Max Memory Used: 89 MB
        Init Duration: 423.12 ms   # only on cold starts

The REPORT line is your free performance telemetry. Init Duration appears only on cold invocations. Max Memory Used helps you right-size memory configuration.

Retention defaults to "Never Expire." Set it explicitly. 7, 14, or 30 days covers most needs. Every megabyte of retained logs costs money.

Structured logging

Emit JSON instead of plain strings. CloudWatch Logs Insights can filter and aggregate JSON fields efficiently. Plain strings require regex and are slow.

import json, logging
logger = logging.getLogger()
logger.setLevel(logging.INFO)

def handler(event, context):
    logger.info(json.dumps({
        "event": "pdf_scan_start",
        "bucket": BUCKET,
        "prefix": PREFIX,
        "request_id": context.aws_request_id,
    }))

With this, Logs Insights can run something like filter event = "pdf_scan_start" | stats count() by bin(5m) in seconds.

X-Ray tracing

X-Ray gives you request traces across services. How long the Lambda itself ran, versus how long the S3 calls took.

Three things have to all be true for X-Ray to work.

1: tracing enabled on the function. Console toggle, or TracingConfig: Active in SAM or CDK.

2: the X-Ray SDK instrumented in your code. from aws_xray_sdk.core import patch_all; patch_all() wraps boto3 calls automatically.

3: IAM permission. The execution role needs xray:PutTraceSegments and xray:PutTelemetryRecords.

Without all three, traces are either absent or incomplete. People flip one and conclude X-Ray is broken.

Lambda Insights

Lambda Insights is a CloudWatch feature, not a separate service. It surfaces system-level metrics: CPU usage, memory utilization, network I/O, disk I/O — things the REPORT line doesn't include.

To enable: add the Lambda Insights extension layer (the ARN follows the pattern arn:aws:lambda:<region>:580247275435:layer:LambdaInsightsExtension:38), and add cloudwatch:PutMetricData to the execution role.

Useful when you suspect memory or CPU contention but the REPORT line's "Max Memory Used" isn't granular enough.

EMF — Embedded Metrics Format

EMF lets you emit custom CloudWatch metrics by writing structured JSON to stdout. No PutMetricData API call needed. The Lambda runtime parses the log line and publishes the metric asynchronously.

This is far more efficient than calling CloudWatch from inside the handler. A PutMetricData call adds latency and cost per invocation. EMF is essentially free.

print(json.dumps({
    "_aws": {
        "Timestamp": int(time.time() * 1000),
        "CloudWatchMetrics": [{
            "Namespace": "MyApp",
            "Dimensions": [["Function"]],
            "Metrics": [{"Name": "PDFsProcessed", "Unit": "Count"}]
        }]
    },
    "PDFsProcessed": count,
    "Function": "pdf-scanner",
}))

Prometheus and Grafana, briefly

Prometheus uses a pull model. It scrapes HTTP endpoints. Lambda functions are ephemeral and have no persistent HTTP endpoint, so Prometheus can't scrape them directly. Three approaches:

EMF to CloudWatch to Grafana. Easiest. Grafana queries CloudWatch as a data source. Zero extra infrastructure.

Amazon Managed Prometheus with remote_write. Lambda pushes metrics to AMP via the Prometheus remote write API. Grafana, or Amazon Managed Grafana, reads from AMP. Requires the prometheus_client library and SIGV4 signing on the request.

A push gateway. Lambda pushes to a persistent push gateway. Prometheus scrapes the gateway. More infrastructure to manage, plus stale metric risk if the push gateway isn't flushed between invocations.

For Lambda-centric dashboards, the CloudWatch-to-Grafana path is usually the simplest to operate.

11. Async and errors

Sync vs async invocation

Synchronous, called RequestResponse. The caller blocks. Waits for the result. Response is visible to the caller. No retries — that's the caller's responsibility. Max event size: 6 MB.

Asynchronous, called Event. The caller gets a 202 immediately. The response is not visible to the caller. Lambda retries automatically — twice, three attempts total. Backoff: about 1 minute, then about 2 minutes. Event age limit: 6 hours. Max event size: 256 KB.

The async retry flow

When Lambda invokes asynchronously and the function throws an unhandled exception, or gets throttled, Lambda retries. Twice. Exponential backoff starting at about a minute.

If all three attempts fail, or if the event ages past 6 hours, Lambda sends the event to the configured failure destination or DLQ. If neither is configured, the event is silently dropped.

DLQ vs Destinations

These are two different mechanisms that overlap in purpose but have different capabilities.

Dead-Letter Queue, introduced in 2016. Triggers on failure only. Payload is the original event only. Targets: SQS or SNS.

Event Destinations, introduced in 2019. Triggers on either success or failure, with separate configurations for each. Payload includes the original event plus the result or error plus metadata. Targets: SQS, SNS, Lambda, EventBridge.

Use Destinations for new code. DLQ is still useful when the downstream consumer must be SQS and you don't need success notifications.

Idempotency

Because async invocations retry, and most event sources are at-least-once, your handler will occasionally execute more than once for the same logical event. Design handlers to be idempotent: the same input produces the same outcome regardless of how many times it runs.

The standard pattern is to use a unique key from the event — S3 ETag plus key, SQS MessageId, EventBridge detail.id — as a deduplication key. On first execution, write the key plus the result to DynamoDB with a TTL. On retry, check DynamoDB first. If already processed, return the cached result without re-running.

dedup_key = event["Records"][0]["messageId"]
existing = table.get_item(Key={"id": dedup_key})
if existing.get("Item"):
    return existing["Item"]["result"]

result = do_the_work(event)
table.put_item(Item={
    "id": dedup_key,
    "result": result,
    "ttl": now + 86400
})
return result

AWS PowerTools for Lambda — Python — has a built-in @idempotent decorator that implements this pattern with DynamoDB.

For our function: it's idempotent in spirit, because re-running it produces a fresh manifest with new presigned URLs. The previous manifests stay in S3 unless we clean them up. If the requirement was "exactly one manifest per logical job," we'd add a dedup table.

Partial batch failures

When Lambda processes a batch of records and one record fails, the default behavior differs by source.

SQS by default: if the handler raises an exception, the entire batch is retried. One bad message blocks all others and can cause infinite retry loops.

With ReportBatchItemFailures enabled, you return a batchItemFailures list containing only the failed message IDs. Lambda re-queues only those. Successful messages are deleted.

def handler(event, context):
    failures = []
    for record in event["Records"]:
        try:
            process(record)
        except Exception:
            failures.append({"itemIdentifier": record["messageId"]})
    return {"batchItemFailures": failures}

Enable ReportBatchItemFailures in the ESM configuration. Always implement partial-batch failure reporting for SQS and Kinesis handlers. A single poison-pill record can otherwise block an entire shard or queue indefinitely.

The intersection that bites: with partial failures, successful records in the batch are deleted from SQS. But if your function crashes before returning the failure list, the entire batch including the successes gets retried. Idempotency guards must cover every record, not just the ones in batchItemFailures.

12. Step Functions

When Lambda alone isn't enough

A single Lambda function works well for one discrete task. Problems start when you need to chain multiple tasks, retry selectively, wait on human approval, or fan out across thousands of items.

Doing this with Lambda alone means writing orchestration logic inside your functions. Tracking state, implementing retry delays, deciding what "done" means. Step Functions externalizes that orchestration into a state machine where every state transition is durable, auditable, and resumable.

Reach for Step Functions when you need: sequential steps with state passing, conditional branching, parallel fan-out with join, wait states longer than 15 minutes, retry-with-exponential-backoff built in.

Standard vs Express

Standard. Max duration: 1 year. Execution semantics: exactly-once per state. Full execution history in the AWS console. Pricing: $0.025 per 1000 state transitions. Use for: long-running business workflows, human approvals, compliance audit trails.

Express. Max duration: 5 minutes. Execution semantics: at-least-once. CloudWatch Logs only — no per-execution audit trail. Pricing: $0.00001 per state transition plus duration. Use for: high-volume short-duration event processing — IoT, streaming.

For most application orchestration, Standard is the right choice. The exactly-once semantic matters when steps have side effects — charging a card, sending an email. Express is for high-throughput pipelines where at-least-once is acceptable and cost per transition matters.

The Map state — fan-out

The Map state runs the same workflow branch for every item in an array, in parallel. This is the core fan-out primitive.

For our project, a Step Functions version could fan out across S3 prefixes — run one Lambda per prefix, collect results in a fan-in step:

{
  "Type": "Map",
  "ItemsPath": "$.prefixes",
  "MaxConcurrency": 10,
  "Iterator": {
    "StartAt": "ScanPrefix",
    "States": {
      "ScanPrefix": {
        "Type": "Task",
        "Resource": "arn:aws:lambda:...:function:pdf-scanner",
        "End": true
      }
    }
  }
}

MaxConcurrency: 0 means unlimited — bounded only by the Lambda concurrency pool. Set an explicit cap to avoid saturating the account quota.

Other useful states

Wait — pause for a duration or until a timestamp. The only way to implement delays longer than 15 minutes without polling.

Choice — conditional branching on input values. Replaces if/else logic that would otherwise live inside a Lambda.

Parallel — run multiple independent branches simultaneously and join their results.

Task with SDK integrations — Step Functions can call DynamoDB, SQS, ECS, Glue, etc. directly without a Lambda wrapper. Reduces cost and latency for simple operations.

Step Functions vs Airflow

DAG definition. Step Functions: JSON or YAML state machine, called Amazon States Language. Airflow: Python code, DAG files.

Scheduling. Step Functions: event-driven, on-demand, cron via EventBridge. Airflow: built-in rich scheduler, cron, data-interval-aware.

Backfill. Step Functions: manual or custom. Airflow: first-class, built-in.

Operators. Step Functions: AWS services plus Lambda, AWS ecosystem only. Airflow: 600-plus providers including Spark, BigQuery, dbt, Kubernetes.

Infrastructure. Step Functions: serverless, zero infra. Airflow: managed Airflow (MWAA) starts at about $400 a month.

Debugging. Step Functions: console execution graph, CloudWatch for logs. Airflow: rich UI with task logs, Gantt charts, retry visualization.

Step Functions is the right choice when your workflow is AWS-native, event-driven, and you want zero infrastructure. Airflow is the right choice when you need complex scheduling, data-interval backfill, cross-cloud operators, or a data-engineering team that already knows Python DAGs.

13. Cost

The pricing formula

Two components, both with permanent free tiers.

Requests. $0.20 per million on x86, same on ARM. Free tier: 1 million per month, forever.

Duration. $0.0000166667 per GB-second on x86. $0.0000133334 per GB-second on ARM — about 20% cheaper. Free tier: 400 000 GB-seconds per month, forever.

GB-seconds equals memory configured in gigabytes times duration in seconds. A 512 MB function running for 300 ms is 0.5 times 0.3, which is 0.15 GB-seconds. At a million invocations, that's 150 000 GB-seconds. Well inside the free tier.

Duration is billed in 1 ms increments. The old 100 ms minimum is gone, since 2020.

Memory vs cost — more can be cheaper

CPU scales linearly with memory. A function configured at 1769 MB gets a full vCPU. Below that, it's a fraction. Doubling memory often more than halves duration for CPU-bound work. The total GB-seconds cost stays the same or decreases. Latency drops.

AWS Lambda Power Tuning is a Step Functions state machine that automatically benchmarks your function at multiple memory sizes and produces a cost/performance curve. Run it before guessing at the right memory setting. The optimal point is almost never the default 128 MB.

ARM64 saves about 20%

ARM64 duration pricing is 20% cheaper than x86. Same request price.

If your function is compute-bound — not I/O-bound sleeping on S3 calls — ARM64 also runs faster, compounding the saving.

For I/O-bound functions like ours, which spends most of its time waiting on S3, the duration difference is smaller. The 20% price reduction still applies.

Provisioned Concurrency billing

Provisioned Concurrency is billed separately. $0.0000097222 per GB-second of provisioned time on x86. Even when idle.

Math: 10 environments at 512 MB provisioned for 24 hours. 10 times 0.5 GB times 86 400 seconds is 432 000 GB-seconds per day. About $4.20 a day. About $126 a month. Just for the warm slots. Before you count any actual invocation cost on top.

Provisioned Concurrency is for latency, not cost. It always increases your bill.

The hidden costs — the real bill

NAT Gateway. $0.045 per hour per AZ (about $32 a month) plus $0.045 per gigabyte. Often the largest line item for VPC Lambda.

API Gateway. REST API: $3.50 per million calls. HTTP API: $1 per million. Can dwarf Lambda cost at high RPS.

CloudWatch Logs. $0.50 per gigabyte for ingestion, $0.03 per gigabyte per month for storage. Verbose Lambda logs accumulate fast. Set retention.

Lambda Insights. Additional CloudWatch Logs plus custom metrics charges.

X-Ray. $5 per million traces, after the free 100 000 per month.

Data transfer. Traffic leaving a region or going through a NAT has per-gigabyte charges.

S3 API calls. LIST and GET requests are billed per 1000. A function that does 10 000 LIST calls per invocation, at a million invocations, is 10 billion API calls. Real money.

For our function, at 1000 invocations a day with 500 ms average duration and 256 MB memory: about $0.002 per day. Essentially free. Lambda's economics only require attention above about 100 000 invocations a day with non-trivial memory or duration.

14. Local dev

The local dev problem

Lambda has no local runtime by default. Without tooling, your only loop is: zip, upload, invoke, read CloudWatch logs, repeat. Minutes per cycle.

The tools below collapse that to seconds. Different trade-offs between fidelity, setup cost, and scope.

SAM CLI

What it is: AWS's official local Lambda emulator. Wraps Docker to run your function inside a container that matches the Lambda runtime exactly. Also emulates API Gateway.

sam local invoke -e event.json
sam local start-api
sam local invoke --debug-port 5858

Fidelity is high. Same Amazon Linux image, same runtime, same filesystem layout. Catches architecture issues, like an x86 wheel running on ARM64, that a plain venv would miss.

Downsides. Requires Docker. Slow to start because it pulls the image on first run. No MinIO or SQS or DynamoDB emulation built in. You wire those up separately.

Lambda Runtime Interface Emulator — RIE

A lightweight binary embedded in all AWS-provided Lambda base images. When you run the image locally, RIE exposes a local HTTP endpoint that accepts invocations in the Lambda API format. You don't need SAM CLI — just Docker:

docker build -t my-fn .
docker run -p 9000:8080 my-fn
curl -XPOST http://localhost:9000/2015-03-31/functions/function/invocations \
  -d '{"key": "value"}'

Use RIE when you're building container-image Lambdas and want to test them without the SAM overhead.

LocalStack

A full AWS mock that emulates Lambda, S3, SQS, DynamoDB, API Gateway, and dozens more services in a single container. Community edition is free. Pro is $35 a month for more services and persistent state.

Use it when you need integration tests across multiple AWS services. An EventBridge rule that triggers a Lambda that writes to DynamoDB, all on your laptop. Without LocalStack you'd need a real AWS account for these tests.

Avoid it when you only need one service — just S3, use MinIO; just Lambda, use SAM or RIE. LocalStack's Lambda emulation has occasional edge-case differences from the real runtime.

docker run --rm -p 4566:4566 localstack/localstack
AWS_DEFAULT_REGION=us-east-1 \
  AWS_ACCESS_KEY_ID=test \
  AWS_SECRET_ACCESS_KEY=test \
  aws --endpoint-url=http://localhost:4566 s3 ls

MinIO — what we use

MinIO is an S3-compatible object store that runs locally in Docker. It implements the S3 API precisely enough that boto3 or aioboto3 needs only an endpoint_url override to work against it.

It is not a Lambda emulator. It replaces S3 only.

make up           # MinIO on :9000 (API) and :9001 (console)
SOURCE_DIR=~/pdfs make seed
make invoke

This is the lightest possible local setup. No Docker-in-Docker. No SAM overhead. Minimal latency. The function handler runs in your local Python process against a real S3-compatible store.

Differences from real Lambda — no execution environment lifecycle, no /tmp isolation between runs — are acceptable for the development loop, but not for environment-fidelity tests. For those you'd reach for SAM.

The decision matrix

Fast iteration on handler logic: MinIO plus invoke.py. Our setup.

Emulate Lambda runtime plus API Gateway locally: SAM CLI.

Test a container-image Lambda: Lambda RIE via Docker.

Integration test across multiple AWS services: LocalStack.

Full-fidelity staging before prod: real AWS account, separate environment.

15. CI/CD

Versions and aliases

Versions are immutable snapshots of a function's code and configuration. When you publish a version with aws lambda publish-version, AWS creates an immutable ARN like arn:...:function:my-fn:7. $LATEST is the only mutable version — it always reflects the most recent code upload.

Aliases are named pointers to a version. prod might point to version 7. staging might point to version 8.

Event source mappings, API Gateway integrations, Step Functions tasks — they should target aliases, not version ARNs. This decouples deployment (publishing a new version) from promotion (updating the alias).

Traffic shifting — blue/green

An alias can split traffic across two versions with weighted routing.

aws lambda update-alias \
  --function-name my-fn \
  --name prod \
  --function-version 8 \
  --routing-config AdditionalVersionWeights={"7"=0.9}
# 10% of prod traffic to v8, 90% still to v7

Start at 10% canary. Watch error rates in CloudWatch. Shift to 50%. Then 100%.

Rollback is instant — point the alias back to the stable version. No instance drain, no connection draining. Lambda is stateless. Cutover is atomic.

CodeDeploy integration

SAM and CDK can wire up CodeDeploy for automatic traffic shifting with automatic rollback on CloudWatch alarms. You declare the deployment preference in the template:

DeploymentPreference:
  Type: Canary10Percent5Minutes
  Alarms:
    - !Ref ErrorRateAlarm

CodeDeploy manages the alias weight changes and triggers the rollback if the alarm fires. Fully automated blue/green without manual traffic management.

Deployment tooling — the progression

AWS CLI or SDK. Good for one-off deployments, scripting, deep control. Verbose. No state management. Drift-prone at scale.

SAM, the CloudFormation extension. Good for Lambda-first projects. Built-in local testing. CodeDeploy integration. CloudFormation speed and YAML verbosity. AWS-only.

CDK. Good for complex infra in TypeScript or Python. Reusable constructs. Type safety. Still compiles to CloudFormation. Learning curve. Bootstrapping required.

Terraform with the AWS provider. Good for multi-cloud orgs, large existing Terraform estate, strong community modules. No built-in Lambda local testing. Plan-and-apply cycle slower than SAM deploy.

Serverless Framework. Multi-cloud serverless, plugin ecosystem. V3 to V4 became paid for teams. Community plugin quality varies.

A CI pipeline skeleton

jobs:
  deploy:
    steps:
      - uses: actions/checkout@v4
      - name: Build zip
        run: |
          docker run --rm -v $PWD:/var/task \
            public.ecr.aws/lambda/python:3.13 \
            pip install -r requirements.txt -t package/
          cd package && zip -r ../function.zip . && cd ..
          zip function.zip lambda_function.py
      - name: Deploy
        run: |
          aws lambda update-function-code \
            --function-name my-fn --zip-file fileb://function.zip
          aws lambda wait function-updated --function-name my-fn
          aws lambda publish-version --function-name my-fn
          aws lambda update-alias --function-name my-fn \
            --name prod --function-version $VERSION

The wait function-updated call is important. update-function-code is asynchronous. publish-version has to wait for it to complete.

16. Pitfalls — the must-knows

Execution model

Module-level state leaks across invocations. A list you append to in the handler grows forever on warm calls. A counter you increment is wrong by the second request. If it's mutable and lives at module scope, treat it as either a deliberate cache or a bug.
Handler globals are shared by every invocation on that env, but not across envs. "I cached the result" works locally. In production, half your traffic gets the cached value, the other half doesn't, depending on which warm container they hit. Externalize, or accept the variance.
/tmp is per-environment, not per-invocation. If you write /tmp/output.json with a fixed name, the next warm invocation finds yesterday's file. Always use a per-invocation suffix — UUID, request ID. Like our function does.
Init phase has a hard 10-second cap. Importing TensorFlow, hydrating a 500 MB model, doing a network call at module scope — you can blow this budget on cold start. Defer expensive work until first handler call (lazy init), or move it to a layer that ships pre-warmed.
asyncio.run in a sync handler creates a fresh event loop per invocation. Acceptable, but it means async clients can't be shared across invocations the way sync boto3 clients can. Profile before assuming async is faster. (More on this in the project walkthrough.)

Payload and size limits

The 6 MB sync response cap is silent. Returning a JSON list of 50 000 items "works" in the function but the API Gateway caller gets a 413. The fix in our function — returning a presigned URL to a manifest file rather than the full list — is the standard pattern.
API Gateway caps integration time at 29 seconds. Doesn't matter if your Lambda timeout is 15 minutes. For longer work, return a job ID and poll, or use Function URLs (15 min) with response streaming.
Environment variables max 4 KB total. Big secrets — RSA keys, JSON config blobs — blow this. Parameter Store or Secrets Manager and read on init.

Concurrency and throttling

Default account concurrency is 1000 per region. Most teams hit this before they realize. It sets a hard ceiling on RPS — at 100 ms latency, that's 10 000 RPS account-wide. At 1 second, 1000 RPS.
Reserved concurrency at zero disables the function. Looks weird. Used as a circuit breaker.
Provisioned concurrency double-bills. You pay for the warm slots and for invocations against them. Worth it for latency-sensitive paths. Wasteful for batch.
Burst limit is regional and finite. A traffic spike from 0 to 5000 RPS will throttle until AWS scales up at +500 envs per minute. Provisioned concurrency or pre-warming is the fix.

Triggers, retries, idempotency

Async invocation retries 2 times by default. Total 3 attempts. If your handler isn't idempotent, you can charge a card three times.
S3, SNS, EventBridge invoke async — at-least-once. Plan for duplicates. SQS standard is also at-least-once. SQS FIFO and Kinesis are exactly-once-ish per shard but with their own quirks.
SQS visibility timeout must be at least 6 times the function timeout. Otherwise the message comes back while you're still processing it, and you do the work twice or more.
Partial batch failures need explicit signaling. Returning batchItemFailures for SQS or Kinesis tells AWS which records to retry. Otherwise the entire batch retries or none does.
API Gateway error responses are JSON-shaped if you don't say otherwise. Throw an unhandled exception, the client sees a JSON body with errorMessage and errorType, status 502. Map errors yourself.

Networking, IAM, observability

Putting Lambda in a VPC adds an ENI cold-start penalty — improved a lot in 2019, but still real for the first invocation. Only do it if you genuinely need private-subnet resources. Outbound internet from VPC Lambda needs NAT, which costs money 24/7.
S3 access from a VPC Lambda needs a VPC gateway endpoint or NAT. Without one, your S3 calls hang and time out. Looks like a code bug, isn't.
CloudWatch log groups default to "Never expire" retention. Verbose Lambdas can rack up real cost in CloudWatch Logs alone. Set retention — 7, 14, or 30 days — on every log group you create.
Lambda execution role is implicit on every action. Forgetting s3:GetObject or kms:Decrypt on the bucket's CMK is the most common "but it works locally" failure. CloudTrail tells you what was denied.
Resource policy versus execution role are different layers. Resource policy says "who can invoke this Lambda." Execution role says "what this Lambda can do." Both must allow.
X-Ray needs an SDK call and tracing enabled on the function and IAM permission. Three switches. People flip one and conclude X-Ray is broken.

Deployment, dependencies, runtimes

The boto3 in the Python runtime lags pip. If you need a recent API, bundle current boto3 in your zip.
Native wheels must match Lambda's runtime architecture. pip install on a Mac and zip-uploading cryptography is a classic foot-gun. Build in a Docker image matching the Lambda runtime.
ARM64 saves about 20% at the same memory, but some wheels are still x86-only. Audit your deps before flipping.
Layers are merge-ordered. Later layers overwrite earlier. A "base" layer for shared dependencies works. Conflicting layers silently shadow each other.
Container-image deploys are cached on the Lambda host. First cold start can be slow because of the image pull. Subsequent are normal. Keep images small even though the limit is 10 GB.

Time, scheduling, secrets

EventBridge schedule (cron or rate) is always UTC. "9 AM" in your local time means something different in production. Use the new EventBridge Scheduler from 2022 for time-zone-aware schedules.
Async invocations have a 6-hour event age. If retries fail past that, the event is silently dropped unless you've set a DLQ or on-failure destination.
Secrets in env vars are visible to anyone with lambda:GetFunctionConfiguration. Encrypted at rest. Plaintext in the console. Use Secrets Manager or Parameter Store for actual secrets.

The skim test: if you can re-state the cold-start split (Init / Handler), the 6 MB / 256 KB / 4 KB / 250 MB / 10 GB constants, and the difference between resource policy and execution role from memory — you'll handle most "tell me about Lambda" interview questions.

17. Adjacent — Glue, Prometheus, Grafana

AWS Glue

Glue is a managed Spark-based ETL service. Lambda and Glue solve different problems.

Runtime model. Lambda: serverless, up to 15 minutes, one handler at a time per env. Glue: managed Spark cluster, hours-long jobs, distributed compute.

Data scale. Lambda: up to a few gigabytes comfortably. Glue: terabytes to petabytes natively.

Language. Lambda: Python, Node, Java, Go, custom runtime. Glue: PySpark, Scala, plus Glue Studio for no-code.

Startup time. Lambda: milliseconds when warm. Glue: 1 to 2 minutes to provision the Spark cluster.

Cost model. Lambda: per request plus per millisecond. Glue: per DPU-hour (1 DPU is $0.44 an hour), 10-minute minimum billing.

Use Lambda for light transforms, event reactions, API backends. Use Glue for large-scale joins, aggregations, schema inference on a data lake.

Key Glue concepts to know. DynamicFrame — Glue's DataFrame variant with schema flexibility. Glue Catalog — centralized metadata store for table schemas, also used by Athena. Job Bookmarks — Glue tracks processed S3 partitions to avoid reprocessing on incremental runs.

The decision is usually straightforward. If the data fits in Lambda's memory and the job finishes in under 15 minutes, use Lambda. If you're joining multiple large S3 datasets or transforming daily partition files, use Glue.

Prometheus and Grafana

Prometheus is a pull-based time-series metrics system. It scrapes HTTP /metrics endpoints on a schedule.

The fundamental tension with Lambda: Lambda functions are ephemeral. There's no persistent HTTP endpoint to scrape. The function may be at zero concurrency between invocations.

Three options for Lambda-to-Prometheus.

EMF to CloudWatch to Grafana with the CloudWatch plugin. No Prometheus involved. Grafana reads directly from CloudWatch. Easiest for AWS-native stacks.

Remote write to Amazon Managed Prometheus (AMP). The function pushes metrics to AMP via the Prometheus remote_write API at the end of each invocation. Grafana — or Amazon Managed Grafana — reads from AMP. Requires the prometheus_client library and SIGV4 signing on the request.

Push gateway. A persistent intermediate that Lambda pushes to. Prometheus scrapes the gateway. More infrastructure to manage, plus stale metric risk if the push gateway isn't flushed between invocations.

Grafana itself is a dashboarding layer. It doesn't store data — it queries data sources. CloudWatch is the data source most useful for Lambda observability. Built-in Grafana plugin. Queries CloudWatch Metrics and Logs Insights. Zero extra infrastructure. The standard choice for Lambda metrics: invocations, errors, duration, throttles, concurrent executions.

For a Lambda-only stack with no existing Prometheus investment, the practical answer is: EMF for custom metrics, CloudWatch for the built-in Lambda metrics, Grafana connected to CloudWatch. No extra infrastructure. Dashboards in an hour.

18. The project — walking through `lambda_function.py`

What the function does, end to end

In one paragraph: the function lists every PDF inside an S3 prefix. For each one, it generates a presigned download URL that expires in 15 minutes. It writes those (key, URL) pairs into a JSONL file in /tmp as it goes. When the listing is done, it uploads the JSONL to S3 as a manifest, generates one more presigned URL pointing to the manifest itself, deletes the local file, and returns the manifest URL plus the count.

The use case: you want to ship a batch of files to someone who isn't on your AWS account. Send them one URL. They open it, get back a list of links, every link works for 15 minutes, then everything dies.

Now let's walk through it. Top to bottom.

Imports and module-scope config

import asyncio
import json
import os
import uuid

import aioboto3
import aiofiles

Standard library first, third-party after. aioboto3 is the async version of boto3 — async S3 calls, so we can overlap I/O. aiofiles is async filesystem access — same reason.

BUCKET = os.environ.get("BUCKET_NAME", "my-company-reports-bucket")
PREFIX = os.environ.get("PREFIX", "2026/04/")
EXPIRY = int(os.environ.get("URL_EXPIRY_SECONDS", "900"))
ENDPOINT = os.environ.get("S3_ENDPOINT_URL") or None
QUEUE_MAX = int(os.environ.get("QUEUE_MAX", "2000"))

_DONE = object()

Five environment reads at module scope. Init phase. They run once per cold start, get cached as Python module attributes, and every warm invocation reuses them for free.

ENDPOINT is the trick that lets this run against MinIO locally. When you run on real Lambda, you don't set the env var, the value is None, and aioboto3 talks to real S3. When you run locally with MinIO, you set it to http://localhost:9000 and the same code talks to MinIO. The function doesn't know the difference.

_DONE is a sentinel. A unique singleton that we'll put on the queue to signal "no more items coming." We'll get to why in a moment. The reason it's an object() and not a string — a string could theoretically collide with a real S3 key. An object() instance has a unique identity. Comparing with is — not == — is unambiguous.

The handler — minimal on purpose

def handler(event, context):
    result = asyncio.run(_run())
    return {"statusCode": 200, "body": json.dumps(result)}

The handler is sync because Lambda's contract is sync. AWS calls handler(event, context) and waits for it to return.

Inside, we open an asyncio event loop with asyncio.run, run our async coroutine, get back a result, wrap it in an API-Gateway-style response shape with statusCode and body. The response shape is a habit — useful when the function gets fronted by API Gateway later. A pure Lambda invoke doesn't need it, but it doesn't hurt.

asyncio.run creates a fresh event loop per invocation. This is one of the small inefficiencies of doing async inside a sync Lambda handler. The cost is small — tens of microseconds — but it means async clients can't be shared across invocations the way sync boto3 clients could.

Why async at all in Lambda? Because Lambda's billing model is per-millisecond of wall-clock time. Anything you can overlap, you save money on. Our function does a lot of S3 calls — listing pages, generating presigned URLs, writing files. While S3 is preparing the next page of results, we can already be presigning and writing the previous page. That overlap directly reduces duration, which directly reduces cost and latency. That's why async.

`_run()` — the actual work

async def _run():
    session = aioboto3.Session()
    async with session.client("s3", endpoint_url=ENDPOINT) as s3:
        queue: asyncio.Queue = asyncio.Queue(maxsize=QUEUE_MAX)
        manifest_path = f"/tmp/{uuid.uuid4()}.jsonl"

Open an aioboto3 session. Create an S3 client, with the optional endpoint override for MinIO. The async with block makes sure the client is properly closed when we're done — connections cleaned up, session closed.

Why is the session created inside _run instead of at module scope? Because aioboto3 async clients don't cleanly support cross-invocation reuse. The async context manager is tied to the event loop, and each invocation gets a fresh event loop via asyncio.run. Sync boto3 clients you'd put at module scope. Async ones, you create per invocation.

Inside the async with, two things. An asyncio.Queue with a maximum size of QUEUE_MAX — that's 2000 by default. And a path in /tmp with a UUID in the filename.

Why a queue at all? Because we want producer and consumer running concurrently. A queue is the standard channel for that.

Why bounded? Because if the producer is faster than the consumer, an unbounded queue grows in memory. Lambda has at most 10 GB of memory, and probably has only 256 or 512 MB by default. If we're scanning a bucket with a million PDFs and the producer loads them all into the queue before we presign even one, we OOM. The bounded queue gives us backpressure: when it's full, await queue.put(...) blocks until the consumer takes something off. Producer waits for consumer. Memory stays flat.

Why 2000? Big enough that the producer doesn't block on a normal-sized run. Small enough that even at 100 bytes per key, the queue is at most 200 KB of memory. Comfortable margin.

Why the UUID in the manifest path? Because /tmp persists across warm invocations on the same environment. Two invocations back to back, both writing to a fixed path like /tmp/manifest.jsonl, would collide. With uuid4, no collision possible.

The producer

        async def producer():
            paginator = s3.get_paginator("list_objects_v2")
            async for page in paginator.paginate(Bucket=BUCKET, Prefix=PREFIX):
                for obj in page.get("Contents", []) or []:
                    key = obj["Key"]
                    if key.lower().endswith(".pdf"):
                        await queue.put(key)
            await queue.put(_DONE)

Why is producer defined inside _run? Two reasons. 1: it's a closure. It captures s3, queue from the enclosing scope without us having to pass them as arguments. Cleaner. 2: it's a private implementation detail of _run — nobody else needs to call it. Defining it inside makes that scope explicit.

What does it do? It uses S3's list_objects_v2 operation through a paginator. S3 returns at most 1000 objects per page. The paginator hides that — you async for page in paginator.paginate(...) and it transparently calls the next page when needed.

For each object on each page, check if it ends in .pdf (case-insensitive). If yes, put it on the queue.

When the paginator is exhausted — no more pages — put the _DONE sentinel on the queue. That tells the consumer "I'm done, you can stop reading."

Why a sentinel and not, say, closing the queue? Because asyncio.Queue doesn't have a "close" method. The standard pattern for "no more items" is the sentinel. The consumer checks if item is _DONE and breaks.

Note that await queue.put(key) will block if the queue is full. The producer pauses there until the consumer takes something off. That's the backpressure I mentioned. Memory bounded.

The consumer

        async def consumer():
            count = 0
            async with aiofiles.open(manifest_path, "w") as f:
                while True:
                    item = await queue.get()
                    if item is _DONE:
                        break
                    url = await s3.generate_presigned_url(
                        "get_object",
                        Params={"Bucket": BUCKET, "Key": item},
                        ExpiresIn=EXPIRY,
                    )
                    await f.write(json.dumps({"key": item, "url": url}) + "\n")
                    count += 1
            return count

Same closure pattern. Captures queue, manifest_path, s3 from the enclosing scope.

Open the manifest file for writing, async, in /tmp. The async with makes sure it's flushed and closed when we exit.

Loop forever. Take items off the queue. If the item is the sentinel, break. Otherwise it's a PDF key. Generate a presigned URL for it — 15 minutes by default. Write a JSONL line: a JSON object with key and url, plus a newline. Increment the count.

When the loop breaks, close the file (via the async with), return the count.

generate_presigned_url is a local computation, not a network call. It takes your AWS credentials, your bucket name, your key, the expiry, and your region, and produces a signed URL deterministically. No HTTP request. Fast.

Why JSONL — JSON Lines — and not a JSON array? Because JSONL streams. You can write one line at a time without buffering the whole array in memory. The reader can process one line at a time. If the manifest grows to gigabytes, JSONL stays usable.

Running them together

        prod_task = asyncio.create_task(producer())
        count = await consumer()
        await prod_task

This is the concurrency. asyncio.create_task(producer()) schedules the producer coroutine to run on the event loop, returns immediately with a task handle. The producer is now running in the background.

count = await consumer() runs the consumer in the foreground. It blocks until the consumer returns, which happens when the consumer sees the sentinel.

await prod_task makes sure the producer task has fully completed and any exceptions get raised. By the time the consumer sees the sentinel, the producer has put it on the queue, so prod_task should be done — but awaiting it makes that guarantee explicit and propagates errors.

Why this two-task structure? Because we want overlap. While S3 is preparing the next page of LIST results — network round trip — the consumer is presigning and writing the previous page. If we did it sequentially — list everything, then presign everything — we'd add the listing latency and the presigning latency. With overlap, we add the larger of the two.

For a small number of files, the difference is negligible. For a thousand or ten thousand files, async + queue cuts wall-clock time noticeably. Less duration, less cost.

Uploading the manifest

        manifest_key = f"manifests/{uuid.uuid4()}.jsonl"
        async with aiofiles.open(manifest_path, "rb") as f:
            body = await f.read()
        await s3.put_object(
            Bucket=BUCKET,
            Key=manifest_key,
            Body=body,
            ContentType="application/x-ndjson",
        )

Generate an S3 key for the manifest under manifests/, with another UUID. Read the local /tmp file as bytes. Upload it with put_object, setting the content type to application/x-ndjson — that's the registered MIME type for newline-delimited JSON.

Why not use s3.upload_file instead of read + put_object? Because upload_file doesn't have a clean async equivalent in aioboto3 that handles the multipart logic the same way. For files this size — hundreds of kilobytes to a few megabytes — read-then-put is fine. For very large files we'd want multipart upload.

Generating the manifest URL and cleaning up

        manifest_url = await s3.generate_presigned_url(
            "get_object",
            Params={"Bucket": BUCKET, "Key": manifest_key},
            ExpiresIn=EXPIRY,
        )

        os.unlink(manifest_path)

        return {
            "count": count,
            "manifest_key": manifest_key,
            "manifest_url": manifest_url,
        }

Generate a presigned URL for the manifest itself — 15 minutes again. Delete the local file from /tmp so it doesn't accumulate across warm invocations on the same environment. Return the count, the S3 key, and the URL.

The handler wraps that in {"statusCode": 200, "body": json.dumps(result)} and returns. Done.

Why this design? — answers to the interview questions you missed

Why presigned URLs, not return the data directly? Because the response is small — just a few hundred bytes — and the recipient doesn't need an AWS account to use the URL. The URL is signed by your credentials, expires in 15 minutes, and works for anyone who has it.

Why upload the manifest to S3 and return a URL to it, instead of returning the manifest contents in the response body? Because of the 6 MB sync response cap. Ten thousand presigned URLs in JSONL is around 3 to 5 MB. Twenty thousand blows the cap, and the cap is silent — the function succeeds, the caller gets a 413 with no warning. The manifest-in-S3 pattern has no upper bound.

Why async, not sync? Two reasons. 1: we want to overlap S3 LIST calls with presigning and file writes. Async + queue is the standard pattern for that. 2: even though presigning is local, we still have the LIST round trips and the final upload, both of which benefit from being non-blocking.

Why a producer and consumer instead of one loop that does both? Because the producer is bursty — when a page comes back, it has up to 1000 keys to dump on the queue. The consumer is steady. Decoupling them with a queue means the producer can race ahead while the consumer steadily drains, instead of LIST-then-presign-then-LIST-then-presign serially.

Why a bounded queue? For backpressure. Without the bound, the producer can outrun the consumer and exhaust memory. With the bound, when the queue fills, the producer's await queue.put(...) blocks until the consumer takes something off. Memory stays flat regardless of how many files we're scanning.

Why a sentinel and not closing the queue? Because asyncio.Queue doesn't have a close method. The sentinel is the standard "I'm done" signal. The consumer checks if item is _DONE and breaks the loop.

Why nested functions? Because they're closures over s3, queue, manifest_path from the enclosing scope. We don't have to pass those as arguments. They're also private implementation details of _run — defining them inside makes that scope explicit.

Why UUID in the /tmp filename? Because /tmp persists across warm invocations on the same environment. A fixed filename collides between back-to-back warm runs. UUID guarantees uniqueness.

Why _DONE = object() instead of a string sentinel? Because an object() instance has a unique identity that can't possibly collide with any real S3 key. Comparing with is (identity, not equality) is unambiguous.

Why os.unlink(manifest_path) at the end? Because /tmp persists across warm invocations and is at most 10 GB. If the function ran a thousand times on the same warm env without cleanup, /tmp would fill and subsequent invocations would fail.

Cold start vs warm — what you'd see in CloudWatch

First invocation, cold start. The REPORT line shows something like:

REPORT RequestId: ...  Duration: 312.45 ms  Init Duration: 423.12 ms

Init Duration: about 400 ms. That covers importing aioboto3 and aiofiles, reading the five env vars. Heavy because aioboto3 pulls in aiobotocore, which pulls in botocore.

Duration: about 300 ms. That's the actual scan: listing the bucket, presigning 50 PDFs, writing the manifest, uploading it.

Second invocation within 30 seconds — warm:

REPORT RequestId: ...  Duration: 287.91 ms

No Init Duration line. We jumped straight to the handler. About 30 ms saved.

For a function that runs once a day, every invocation is cold. Init Duration matters. For a function that runs every few seconds, almost everything is warm. Init Duration is irrelevant.

What happens if it times out

The default function timeout is 3 seconds. Almost certainly not enough — set it explicitly to something like 30 or 60 seconds for this function. Maximum is 15 minutes.

If the function does time out, Lambda kills the process. The execution environment is still alive but the invocation is over. The local /tmp file may or may not have been deleted, depending on how far we got. If we wrote the manifest to S3 before the timeout, it's there. If not, the partial work is lost.

The function isn't quite idempotent in the strict sense — re-running it produces a fresh manifest with new UUIDs and new presigned URLs. The previous manifest stays in S3. If "exactly one manifest per logical job" was a requirement, we'd add a dedup table — DynamoDB with the request ID as the key — to skip re-runs.

How would you scale this

Two natural scaling moves.

1 — fan out by prefix. Wrap this function in a Step Functions Map state. The orchestrator passes a list of prefixes. Each map iteration runs one Lambda for one prefix. Concurrency cap controlled by the Map state's MaxConcurrency, not by Lambda's account quota.

2 — go async with S3 events. Skip the LIST entirely. Subscribe the function to S3 ObjectCreated events filtered to *.pdf. The function fires once per upload, handles one file at a time, no producer/consumer needed because there's nothing to enumerate. Way simpler. Different use case — that's "process new files as they arrive," not "scan the existing bucket."

For the existing-bucket scan, the current design is right.

What I'd change before production

A few things.

1: make BUCKET and PREFIX come from the event payload, not from environment variables. Currently they're set at deploy time. If you want the same function to scan different prefixes on different invocations, the event-driven version is more flexible.

2: enable ReportBatchItemFailures if this becomes part of an SQS-fed pipeline later. Currently it's not, but it's good defensive design.

3: add structured logging. JSON to stdout, with request_id, bucket, prefix, count. Logs Insights can then aggregate.

4: emit an EMF metric for count. Free CloudWatch metric, no additional API calls. Lets you dashboard "PDFs processed per invocation" over time.

5: explicit error handling on the producer. Currently if paginator.paginate raises, the producer task fails, the consumer keeps waiting on queue.get forever, and the function times out. Better: wrap the producer in a try/except that puts _DONE on the queue in a finally block, so the consumer always exits.

6: shorten the imports. aioboto3 adds 200+ ms to the cold start. If cold start matters, consider sync boto3 — the function isn't actually doing enough I/O concurrency to make async pay off until file counts get large.

Those are the six things. None of them are wrong about the current design — they're refinements for moving from "weekend project" to "production service."

That's everything in the notes. Thirty-one pitfalls, eighteen sections, one project, one set of answers to the questions that tripped you up last time.

Tuesday, you'll know it.

80 KiB Raw Blame History