Pitfalls — The Must-Knows

The list to skim before the next interview or design review. Each item has bitten someone in production.

Execution model

Module-level state leaks across invocations. A list you append to in the handler grows forever on warm calls. A counter you increment is wrong by the second request. If it's mutable and lives at module scope, treat it as either a deliberate cache or a bug.
Handler globals are shared by every invocation on that env, but not across envs. "I cached the result" works locally; in production half your traffic gets the cached value, the other half doesn't, depending on which warm container they hit. Externalise (Redis, DynamoDB) or accept the variance.
/tmp is per-environment, not per-invocation. If you write /tmp/output.json with a fixed name, the next warm invocation finds yesterday's file. Always use a per-invocation suffix (UUID, request ID).
Init phase has a hard 10 s cap. If you import TensorFlow, hydrate a 500 MB model, or do a network call at module scope, you can blow this budget on cold start. Defer expensive work until first handler call (lazy init), or move it to a layer that ships pre-warmed.
Async asyncio.run in a sync handler creates a fresh event loop per invocation. Acceptable, but means async clients can't be shared across invocations the way sync boto3 clients can. Profile before assuming async is faster.

Payload & size limits

6 MB sync response cap is silent. Returning a JSON list of 50 000 items "works" in the function but the API GW caller gets 413. The fix in lambda_function.py — return a presigned URL to a manifest file rather than the full list — is the standard pattern.
API Gateway caps integration time at 29 s. Doesn't matter if your Lambda timeout is 15 minutes. For longer work, return a job ID and poll, or use Function URLs (15 min) with response streaming.
Environment variables max 4 KB total. Big secrets (RSA keys, JSON config blobs) blow this. Parameter Store / Secrets Manager and read on init.

Concurrency & throttling

Default account concurrency is 1 000 per region. Most teams hit this before they realise. Sets a hard ceiling on RPS — at 100 ms latency, that's 10 000 RPS account-wide; at 1 s, 1 000 RPS.
Reserved concurrency = 0 disables the function. Looks weird, used as a circuit breaker.
Provisioned concurrency double-bills. You pay for the warm slots and for invocations against them. Worth it for latency-sensitive paths; wasteful for batch.
Burst limit is regional and finite. A traffic spike from 0 to 5 000 RPS will throttle until AWS scales up at +500 envs/min. Provisioned concurrency or pre-warming is the fix.

Triggers, retries, idempotency

Async invocation retries 2 times by default. Total 3 attempts. If your handler isn't idempotent, you can charge a card three times.
S3, SNS, EventBridge invoke async — at-least-once. Plan for duplicates. SQS standard is also at-least-once. SQS FIFO and Kinesis are exactly-once-ish per shard but with their own quirks.
SQS visibility timeout must be ≥ 6× function timeout. Otherwise the message comes back while you're still processing it, and you do the work twice (or more).
Partial batch failures need explicit signalling. Returning batchItemFailures for SQS/Kinesis tells AWS which records to retry; otherwise the entire batch retries or none does.
API Gateway error responses are JSON-shaped if you don't say otherwise. Throw an unhandled exception and the client sees {"errorMessage": "...", "errorType": "..."} with status 502. Map errors yourself.

Networking, IAM, observability

Putting Lambda in a VPC adds an ENI cold-start penalty (improved a lot in 2019, but still real for first invocation). Only do it if you genuinely need private-subnet resources. Outbound internet from VPC Lambda needs NAT, which costs money 24/7.
S3 access from a VPC Lambda needs a VPC gateway endpoint or NAT. Without one, your S3 calls hang and time out — looks like a code bug, isn't.
CloudWatch log groups default to "Never expire" retention. Verbose Lambdas can rack up real cost in CW Logs alone — set retention (7/14/30 days) on every log group you create.
Lambda execution role is implicit on every action. Forgetting s3:GetObject or kms:Decrypt on the bucket's CMK is the most common "but it works locally" failure. CloudTrail tells you what was denied.
Resource policy vs execution role are different layers. Resource policy says "who can invoke this Lambda"; execution role says "what this Lambda can do". Both must allow.
X-Ray needs an SDK call and tracing enabled on the function and IAM permission. Three switches. People flip one and conclude X-Ray is broken.

Deployment, dependencies, runtimes

The boto3 in the Python runtime lags pip. If you need a recent API (e.g. new S3 features), bundle current boto3 in your zip. The runtime version is "good enough" for stable APIs, "sometimes wrong" for fresh ones.
Native wheels must match Lambda's runtime architecture. pip install on a Mac and zip-uploading cryptography is a classic foot-gun. Build in a Docker image matching public.ecr.aws/lambda/python:3.13.
arm64 saves ~20 % at the same memory but some wheels are still x86-only. Audit your deps before flipping the architecture.
Layers are merge-ordered; later layers overwrite earlier. A "base" layer for your shared dependencies works; conflicting layers silently shadow each other.
Container-image deploys are cached on the Lambda host. First cold start can be slow (image pull); subsequent are normal. Keep images small even though the limit is 10 GB.

Time, scheduling, secrets

EventBridge schedule (cron/rate) is always UTC. "9 AM" in your local time means something different in production. Use the new EventBridge Scheduler (2022) for time-zone-aware schedules.
Async invocations have a 6-hour event age. If retries fail past that, the event is silently dropped unless you've set a DLQ or on-failure destination.
Secrets in env vars are visible to anyone with lambda:GetFunctionConfiguration. Encrypted at rest, plaintext in the console. Use Secrets Manager / Parameter Store for actual secrets.

⚠️ Skim test: if you can re-state the cold-start split (Init / Handler), the 6 MB / 256 KB / 4 KB / 250 MB / 10 GB constants, and the difference between resource policy and execution role from memory, you'll handle most "tell me about Lambda" interview questions.

6.5 KiB Raw Permalink Blame History Unescape Escape

Pitfalls — The Must-Knows

Execution model

Payload & size limits

Concurrency & throttling

Triggers, retries, idempotency

Networking, IAM, observability

Deployment, dependencies, runtimes

Time, scheduling, secrets

6.5 KiB

Raw Permalink Blame History