58 lines
6.5 KiB
Markdown
58 lines
6.5 KiB
Markdown
# Pitfalls — The Must-Knows
|
||
|
||
> The list to skim before the next interview or design review. Each item has bitten someone in production.
|
||
|
||
## Execution model
|
||
|
||
1. **Module-level state leaks across invocations.** A list you append to in the handler grows forever on warm calls. A counter you increment is wrong by the second request. If it's mutable and lives at module scope, treat it as either a deliberate cache or a bug.
|
||
2. **Handler globals are shared by every invocation on that env, but not across envs.** "I cached the result" works locally; in production half your traffic gets the cached value, the other half doesn't, depending on which warm container they hit. Externalise (Redis, DynamoDB) or accept the variance.
|
||
3. **/tmp is per-environment, not per-invocation.** If you write `/tmp/output.json` with a fixed name, the next warm invocation finds yesterday's file. Always use a per-invocation suffix (UUID, request ID).
|
||
4. **Init phase has a hard 10 s cap.** If you import TensorFlow, hydrate a 500 MB model, or do a network call at module scope, you can blow this budget on cold start. Defer expensive work until first handler call (lazy init), or move it to a layer that ships pre-warmed.
|
||
5. **Async `asyncio.run` in a sync handler creates a fresh event loop per invocation.** Acceptable, but means async clients can't be shared across invocations the way sync boto3 clients can. Profile before assuming async is faster.
|
||
|
||
## Payload & size limits
|
||
|
||
6. **6 MB sync response cap is silent.** Returning a JSON list of 50 000 items "works" in the function but the API GW caller gets 413. The fix in `lambda_function.py` — return a presigned URL to a manifest file rather than the full list — is the standard pattern.
|
||
7. **API Gateway caps integration time at 29 s.** Doesn't matter if your Lambda timeout is 15 minutes. For longer work, return a job ID and poll, or use Function URLs (15 min) with response streaming.
|
||
8. **Environment variables max 4 KB total.** Big secrets (RSA keys, JSON config blobs) blow this. Parameter Store / Secrets Manager and read on init.
|
||
|
||
## Concurrency & throttling
|
||
|
||
9. **Default account concurrency is 1 000 per region.** Most teams hit this before they realise. Sets a hard ceiling on RPS — at 100 ms latency, that's 10 000 RPS account-wide; at 1 s, 1 000 RPS.
|
||
10. **Reserved concurrency = 0 disables the function.** Looks weird, used as a circuit breaker.
|
||
11. **Provisioned concurrency double-bills.** You pay for the warm slots *and* for invocations against them. Worth it for latency-sensitive paths; wasteful for batch.
|
||
12. **Burst limit is regional and finite.** A traffic spike from 0 to 5 000 RPS will throttle until AWS scales up at +500 envs/min. Provisioned concurrency or pre-warming is the fix.
|
||
|
||
## Triggers, retries, idempotency
|
||
|
||
13. **Async invocation retries 2 times by default.** Total 3 attempts. If your handler isn't idempotent, you can charge a card three times.
|
||
14. **S3, SNS, EventBridge invoke async — at-least-once.** Plan for duplicates. SQS standard is also at-least-once. SQS FIFO and Kinesis are exactly-once-ish per shard but with their own quirks.
|
||
15. **SQS visibility timeout must be ≥ 6× function timeout.** Otherwise the message comes back while you're still processing it, and you do the work twice (or more).
|
||
16. **Partial batch failures need explicit signalling.** Returning `batchItemFailures` for SQS/Kinesis tells AWS which records to retry; otherwise the entire batch retries or none does.
|
||
17. **API Gateway error responses are JSON-shaped if you don't say otherwise.** Throw an unhandled exception and the client sees `{"errorMessage": "...", "errorType": "..."}` with status 502. Map errors yourself.
|
||
|
||
## Networking, IAM, observability
|
||
|
||
18. **Putting Lambda in a VPC adds an ENI cold-start penalty** (improved a lot in 2019, but still real for first invocation). Only do it if you genuinely need private-subnet resources. Outbound internet from VPC Lambda needs NAT, which costs money 24/7.
|
||
19. **S3 access from a VPC Lambda needs a VPC gateway endpoint or NAT.** Without one, your S3 calls hang and time out — looks like a code bug, isn't.
|
||
20. **CloudWatch log groups default to "Never expire" retention.** Verbose Lambdas can rack up real cost in CW Logs alone — set retention (7/14/30 days) on every log group you create.
|
||
21. **Lambda execution role is implicit on every action.** Forgetting `s3:GetObject` or `kms:Decrypt` on the bucket's CMK is the most common "but it works locally" failure. CloudTrail tells you what was denied.
|
||
22. **Resource policy vs execution role are different layers.** Resource policy says "who can *invoke* this Lambda"; execution role says "what this Lambda can *do*". Both must allow.
|
||
23. **X-Ray needs an SDK call *and* tracing enabled on the function *and* IAM permission.** Three switches. People flip one and conclude X-Ray is broken.
|
||
|
||
## Deployment, dependencies, runtimes
|
||
|
||
24. **The boto3 in the Python runtime lags pip.** If you need a recent API (e.g. new S3 features), bundle current boto3 in your zip. The runtime version is "good enough" for stable APIs, "sometimes wrong" for fresh ones.
|
||
25. **Native wheels must match Lambda's runtime architecture.** `pip install` on a Mac and zip-uploading `cryptography` is a classic foot-gun. Build in a Docker image matching `public.ecr.aws/lambda/python:3.13`.
|
||
26. **arm64 saves ~20 % at the same memory** but *some* wheels are still x86-only. Audit your deps before flipping the architecture.
|
||
27. **Layers are merge-ordered; later layers overwrite earlier.** A "base" layer for your shared dependencies works; conflicting layers silently shadow each other.
|
||
28. **Container-image deploys are cached on the Lambda host.** First cold start can be slow (image pull); subsequent are normal. Keep images small even though the limit is 10 GB.
|
||
|
||
## Time, scheduling, secrets
|
||
|
||
29. **EventBridge schedule (cron/rate) is always UTC.** "9 AM" in your local time means something different in production. Use the new EventBridge Scheduler (2022) for time-zone-aware schedules.
|
||
30. **Async invocations have a 6-hour event age.** If retries fail past that, the event is silently dropped unless you've set a DLQ or on-failure destination.
|
||
31. **Secrets in env vars are visible to anyone with `lambda:GetFunctionConfiguration`.** Encrypted at rest, plaintext in the console. Use Secrets Manager / Parameter Store for actual secrets.
|
||
|
||
> ⚠️ **Skim test:** if you can re-state the cold-start split (Init / Handler), the 6 MB / 256 KB / 4 KB / 250 MB / 10 GB constants, and the difference between resource policy and execution role from memory, you'll handle most "tell me about Lambda" interview questions.
|