6.5 KiB
6.5 KiB
Pitfalls — The Must-Knows
The list to skim before the next interview or design review. Each item has bitten someone in production.
Execution model
- Module-level state leaks across invocations. A list you append to in the handler grows forever on warm calls. A counter you increment is wrong by the second request. If it's mutable and lives at module scope, treat it as either a deliberate cache or a bug.
- Handler globals are shared by every invocation on that env, but not across envs. "I cached the result" works locally; in production half your traffic gets the cached value, the other half doesn't, depending on which warm container they hit. Externalise (Redis, DynamoDB) or accept the variance.
- /tmp is per-environment, not per-invocation. If you write
/tmp/output.jsonwith a fixed name, the next warm invocation finds yesterday's file. Always use a per-invocation suffix (UUID, request ID). - Init phase has a hard 10 s cap. If you import TensorFlow, hydrate a 500 MB model, or do a network call at module scope, you can blow this budget on cold start. Defer expensive work until first handler call (lazy init), or move it to a layer that ships pre-warmed.
- Async
asyncio.runin a sync handler creates a fresh event loop per invocation. Acceptable, but means async clients can't be shared across invocations the way sync boto3 clients can. Profile before assuming async is faster.
Payload & size limits
- 6 MB sync response cap is silent. Returning a JSON list of 50 000 items "works" in the function but the API GW caller gets 413. The fix in
lambda_function.py— return a presigned URL to a manifest file rather than the full list — is the standard pattern. - API Gateway caps integration time at 29 s. Doesn't matter if your Lambda timeout is 15 minutes. For longer work, return a job ID and poll, or use Function URLs (15 min) with response streaming.
- Environment variables max 4 KB total. Big secrets (RSA keys, JSON config blobs) blow this. Parameter Store / Secrets Manager and read on init.
Concurrency & throttling
- Default account concurrency is 1 000 per region. Most teams hit this before they realise. Sets a hard ceiling on RPS — at 100 ms latency, that's 10 000 RPS account-wide; at 1 s, 1 000 RPS.
- Reserved concurrency = 0 disables the function. Looks weird, used as a circuit breaker.
- Provisioned concurrency double-bills. You pay for the warm slots and for invocations against them. Worth it for latency-sensitive paths; wasteful for batch.
- Burst limit is regional and finite. A traffic spike from 0 to 5 000 RPS will throttle until AWS scales up at +500 envs/min. Provisioned concurrency or pre-warming is the fix.
Triggers, retries, idempotency
- Async invocation retries 2 times by default. Total 3 attempts. If your handler isn't idempotent, you can charge a card three times.
- S3, SNS, EventBridge invoke async — at-least-once. Plan for duplicates. SQS standard is also at-least-once. SQS FIFO and Kinesis are exactly-once-ish per shard but with their own quirks.
- SQS visibility timeout must be ≥ 6× function timeout. Otherwise the message comes back while you're still processing it, and you do the work twice (or more).
- Partial batch failures need explicit signalling. Returning
batchItemFailuresfor SQS/Kinesis tells AWS which records to retry; otherwise the entire batch retries or none does. - API Gateway error responses are JSON-shaped if you don't say otherwise. Throw an unhandled exception and the client sees
{"errorMessage": "...", "errorType": "..."}with status 502. Map errors yourself.
Networking, IAM, observability
- Putting Lambda in a VPC adds an ENI cold-start penalty (improved a lot in 2019, but still real for first invocation). Only do it if you genuinely need private-subnet resources. Outbound internet from VPC Lambda needs NAT, which costs money 24/7.
- S3 access from a VPC Lambda needs a VPC gateway endpoint or NAT. Without one, your S3 calls hang and time out — looks like a code bug, isn't.
- CloudWatch log groups default to "Never expire" retention. Verbose Lambdas can rack up real cost in CW Logs alone — set retention (7/14/30 days) on every log group you create.
- Lambda execution role is implicit on every action. Forgetting
s3:GetObjectorkms:Decrypton the bucket's CMK is the most common "but it works locally" failure. CloudTrail tells you what was denied. - Resource policy vs execution role are different layers. Resource policy says "who can invoke this Lambda"; execution role says "what this Lambda can do". Both must allow.
- X-Ray needs an SDK call and tracing enabled on the function and IAM permission. Three switches. People flip one and conclude X-Ray is broken.
Deployment, dependencies, runtimes
- The boto3 in the Python runtime lags pip. If you need a recent API (e.g. new S3 features), bundle current boto3 in your zip. The runtime version is "good enough" for stable APIs, "sometimes wrong" for fresh ones.
- Native wheels must match Lambda's runtime architecture.
pip installon a Mac and zip-uploadingcryptographyis a classic foot-gun. Build in a Docker image matchingpublic.ecr.aws/lambda/python:3.13. - arm64 saves ~20 % at the same memory but some wheels are still x86-only. Audit your deps before flipping the architecture.
- Layers are merge-ordered; later layers overwrite earlier. A "base" layer for your shared dependencies works; conflicting layers silently shadow each other.
- Container-image deploys are cached on the Lambda host. First cold start can be slow (image pull); subsequent are normal. Keep images small even though the limit is 10 GB.
Time, scheduling, secrets
- EventBridge schedule (cron/rate) is always UTC. "9 AM" in your local time means something different in production. Use the new EventBridge Scheduler (2022) for time-zone-aware schedules.
- Async invocations have a 6-hour event age. If retries fail past that, the event is silently dropped unless you've set a DLQ or on-failure destination.
- Secrets in env vars are visible to anyone with
lambda:GetFunctionConfiguration. Encrypted at rest, plaintext in the console. Use Secrets Manager / Parameter Store for actual secrets.
⚠️ Skim test: if you can re-state the cold-start split (Init / Handler), the 6 MB / 256 KB / 4 KB / 250 MB / 10 GB constants, and the difference between resource policy and execution role from memory, you'll handle most "tell me about Lambda" interview questions.