Files
lambda_local_runner/docs/lambdas-md/lambda-20-sign-pdfs-optimized.md

336 lines
16 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# sign_pdfs_optimized — Walkthrough
> Production-refined fork of `sign_pdfs`. Applies 8 improvements at once; meant to be compared side-by-side with the original in the local tester.
The original `sign_pdfs` is the "before" snapshot — it is left untouched. This file (`sign_pdfs_optimized/handler.py`) is the "after": same contract, same output shape, all the rough edges from the original fixed.
---
## Block 1 — Imports and sentinel
```python
import asyncio, json, os, time, uuid
import aioboto3
import aiofiles
_DONE = object()
```
Nothing new at the import level. `_DONE` is a sentinel object used to signal consumers that the producer has finished. Using a unique object (rather than `None` or a string) means no PDF key will ever accidentally match it.
**Original issue fixed here:** the original put a single `_DONE` on the queue regardless of how many consumers were running. With N consumers, one of them would consume the sentinel and exit, but the others would block forever waiting for a key that never arrives. The fix is one sentinel per consumer — that happens in the producer's `finally` block (Block 4).
---
## Block 2 — Config from event (`_cfg`)
```python
def _cfg(event: dict) -> dict:
return {
"bucket": event.get("bucket") or os.environ.get("BUCKET_NAME", "my-company-reports-bucket"),
"prefix": event.get("prefix") or os.environ.get("PREFIX", "2026/04/"),
"expiry": int(event.get("expiry_seconds") or os.environ.get("URL_EXPIRY_SECONDS", "900")),
"endpoint": os.environ.get("S3_ENDPOINT_URL") or None,
"page_size": int(event.get("page_size") or 1000),
"concurrency": int(event.get("concurrency") or 4),
"queue_max": int(os.environ.get("QUEUE_MAX", "2000")),
}
```
**Improvement 1 — event-driven config.** The original hard-coded `bucket`, `prefix`, and `expiry` as module-level constants read once from env vars. That means one function deployment = one bucket/prefix combination. Here the event is the primary source; env vars are fallbacks. The same deployed function can serve `2026/04/`, `2026/05/`, or an entirely different bucket by changing only the event payload — no redeployment.
`endpoint` is env-only because it is infrastructure (the MinIO URL in local dev, absent in production) — callers should never need to specify it per-request.
`queue_max` bounds the in-memory queue so the producer can't buffer unbounded keys before consumers have a chance to drain it. At 1000 keys/page × 2000 queue slots, the producer is at most 2 pages ahead of the consumers.
---
## Block 3 — Structured logging (`_log`) and EMF metrics (`_emit_emf`)
```python
def _log(event_type: str, **fields):
print(json.dumps({"event": event_type, **fields}))
def _emit_emf(metrics: dict, **dims):
print(json.dumps({
"_aws": {
"Timestamp": int(time.time() * 1000),
"CloudWatchMetrics": [{
"Namespace": "eth/sign_pdfs_optimized",
"Dimensions": [list(dims.keys())],
"Metrics": [
{"Name": k, "Unit": "Bytes" if k.endswith("Bytes") else "Count"}
for k in metrics
],
}],
},
**dims, **metrics,
}))
```
**Improvement 2 — structured JSON logging.** The original printed nothing. `_log` emits one JSON object per line to stdout. CloudWatch Logs Insights can then query fields directly:
```
filter event = "complete" | stats avg(duration_ms) by bin(5m)
```
The local runner's `_extract_json_logs` picks up these lines and surfaces them in the tester's "Structured logs" panel.
**Improvement 3 — EMF metrics.** `_emit_emf` writes a CloudWatch Embedded Metrics Format block to stdout. The Lambda runtime (and locally, the runner's `_extract_emf_metrics`) parses this and publishes real CloudWatch metrics — no `PutMetricData` API call, no extra latency, no per-metric cost. The unit is inferred from the metric name: anything ending in `Bytes``"Bytes"`, everything else → `"Count"`.
EMF is emitted once at the end of the invocation, not once per item. Putting it inside a loop would flood CloudWatch with thousands of data points per invocation.
---
## Block 4 — Producer with `try/finally`
```python
async def producer():
nonlocal pages
try:
paginator = s3.get_paginator("list_objects_v2")
async for page in paginator.paginate(
Bucket=cfg["bucket"], Prefix=cfg["prefix"],
PaginationConfig={"PageSize": cfg["page_size"]},
):
pages += 1
for obj in page.get("Contents", []) or []:
key = obj["Key"]
if key.lower().endswith(".pdf"):
await queue.put(key)
finally:
for _ in range(cfg["concurrency"]):
await queue.put(_DONE)
```
**Improvement 4 — guaranteed consumer exit.** The `finally` block runs whether the producer succeeds, raises a `ClientError`, or is cancelled. Each of the N consumers gets its own `_DONE` sentinel. Without `finally`, an exception in the producer leaves all consumers blocked on `queue.get()` forever — the handler hangs and eventually times out at 15 minutes.
**Improvement 7 — `PageSize=1000`.** S3's `ListObjectsV2` returns at most 1000 keys per page by default, but specifying it explicitly avoids the paginator defaulting to a smaller value on some SDK versions. Fewer pages = fewer round-trip API calls for large prefixes.
---
## Block 5 — Consumer with per-item error handling
```python
async def consumer():
nonlocal errors
local = 0
while True:
item = await queue.get()
if item is _DONE:
return local
try:
url = await s3.generate_presigned_url(
"get_object",
Params={"Bucket": cfg["bucket"], "Key": item},
ExpiresIn=cfg["expiry"],
)
line = json.dumps({"key": item, "url": url}) + "\n"
async with write_lock:
await f.write(line)
local += 1
except Exception as exc:
errors += 1
_log("presign_error", request_id=request_id, key=item, error=str(exc))
```
**Improvement 5 — per-item error isolation.** The original let a single `generate_presigned_url` failure propagate out of the consumer and crash the whole batch. Here, each item is wrapped in its own `try/except`. A bad key increments `errors` and logs the details; the consumer continues with the next key. The caller receives `{"count": N, "errors": M}` and can decide whether to retry the M failures.
`local` counts successful presigns per consumer. The N return values are summed after `asyncio.gather` to get the total count.
The `write_lock` serialises writes to the shared JSONL file — without it, concurrent `await f.write(line)` calls would interleave partial lines.
---
## Block 6 — N concurrent consumers (`asyncio.gather`)
```python
prod_task = asyncio.create_task(producer())
counts = await asyncio.gather(*(consumer() for _ in range(cfg["concurrency"])))
await prod_task
```
**Improvement 8 — concurrent consumers.** The original ran one consumer. Here `asyncio.gather` starts N consumer coroutines simultaneously, all reading from the same queue. Since `generate_presigned_url` is an async S3 API call, each consumer yields while waiting for the response, letting the others proceed. With N=4 and 1000 PDFs, peak in-flight presign requests is 4 instead of 1 — throughput roughly scales with `concurrency` up to the point where the producer can't keep the queue full.
`asyncio.create_task(producer())` schedules the producer to run concurrently with the consumers rather than sequentially before them. The `await prod_task` after gather ensures any producer exception is re-raised (not silently swallowed as it would be in a fire-and-forget task).
---
## Block 7 — Streaming manifest upload
```python
manifest_bytes = os.path.getsize(manifest_path)
manifest_key = f"manifests/{uuid.uuid4()}.jsonl"
with open(manifest_path, "rb") as f:
await s3.put_object(
Bucket=cfg["bucket"], Key=manifest_key, Body=f,
ContentType="application/x-ndjson",
)
```
**Improvement 6 — streaming upload.** The original read the entire manifest into memory with `await f.read()` before uploading. For a bucket with 100k PDFs, the JSONL manifest could be tens of MB — a full `read()` doubles peak RAM usage. Passing a sync file handle as `Body` lets aiobotocore read the file in chunks internally. Memory usage stays flat regardless of manifest size.
`os.path.getsize` is called before opening the file for upload so the size is available for the EMF block even after the file is deleted.
---
## Block 8 — Cleanup, metrics, and response
```python
os.unlink(manifest_path)
result = {
"count": count,
"errors": errors,
"manifest_key": manifest_key,
"manifest_url": manifest_url,
}
response_bytes = len(json.dumps(result))
duration_ms = (time.monotonic() - t0) * 1000
_log("complete", request_id=request_id, count=count, errors=errors,
pages=pages, duration_ms=round(duration_ms, 2))
_emit_emf({
"PDFsProcessed": count, "S3ListPages": pages,
"PresignCount": count, "ManifestBytes": manifest_bytes,
"ResponseBytes": response_bytes,
}, Function="sign_pdfs_optimized")
return result
```
`os.unlink` removes the temp file after upload. `/tmp` is shared across warm invocations — not cleaning up leaks storage toward the 512 MB (default) or 10 GB (max) `/tmp` cap.
`response_bytes = len(json.dumps(result))` captures how large the sync response would be. Lambda's synchronous response limit is 6 MB. This function returns a URL, not a body, so `response_bytes` is typically under 300 bytes — but measuring it makes the headroom concrete.
The final `_log("complete", ...)` and `_emit_emf(...)` emit together at the end so all counts are final. In the tester, the Structured logs panel shows the `complete` event and the EMF panel shows the five metric values.
---
## Block 9 — Entrypoint
```python
def handler(event, context):
request_id = getattr(context, "aws_request_id", str(uuid.uuid4()))
result = asyncio.run(_run(event, request_id))
return {"statusCode": 200, "body": json.dumps(result)}
```
`asyncio.run` creates a fresh event loop for each invocation. Lambda's Python runtime is synchronous at the handler boundary; `asyncio.run` is the correct bridge between the sync entrypoint and the async internals. A module-level `loop = asyncio.get_event_loop()` would work on the first invocation but is deprecated and unreliable across warm invocations.
`getattr(context, "aws_request_id", ...)` falls back to a generated UUID when the context object doesn't have the attribute — which is the case in the local tester, where the runner passes a minimal stub.
---
## Summary of improvements
| # | What changed | Why it matters |
|---|---|---|
| 1 | Config from event, env as fallback | One function serves many prefixes |
| 2 | Structured JSON logging (`_log`) | Queryable in Logs Insights; visible in tester |
| 3 | EMF metrics (`_emit_emf`) | Free CloudWatch metrics, no API call overhead |
| 4 | Producer `try/finally` with N sentinels | Consumers always exit; handler never hangs |
| 5 | Per-item `try/except` in consumer | One bad key doesn't crash the batch |
| 6 | Sync file handle as `Body` | Flat memory for any manifest size |
| 7 | `PageSize=1000` | Fewer S3 round-trips on large prefixes |
| 8 | N concurrent consumers via `asyncio.gather` | Presign throughput scales with `concurrency` |
---
## Further improvements (not yet applied)
These are the next natural steps if this function were going to production. They were left out intentionally — each adds infrastructure or AWS-side configuration that goes beyond the handler itself.
### Idempotency
The function is not idempotent in the strict sense. Each invocation with the same `(bucket, prefix)` event produces a new manifest at a new UUID key. If Lambda retries the invocation (async invocations retry up to 2 times by default, and S3/SNS/EventBridge are at-least-once), you accumulate duplicate manifests in the `manifests/` prefix.
The standard fix is a DynamoDB dedup table:
```python
import hashlib, boto3 as _boto3
_ddb = _boto3.resource("dynamodb")
_table = _ddb.Table(os.environ["DEDUP_TABLE"])
def _dedup_key(cfg: dict) -> str:
raw = f"{cfg['bucket']}#{cfg['prefix']}"
return hashlib.sha256(raw.encode()).hexdigest()[:32]
# at the top of _run(), before any S3 work:
dedup_key = _dedup_key(cfg)
resp = _table.get_item(Key={"id": dedup_key})
if "Item" in resp:
return json.loads(resp["Item"]["result"]) # cached — skip all S3 work
# ... do the work ...
# at the end, before returning:
_table.put_item(Item={
"id": dedup_key,
"result": json.dumps(result),
"ttl": int(time.time()) + 86400,
})
```
The dedup key is derived from the logical job identity (`bucket + prefix`), not the `request_id`. Using `request_id` would only guard against Lambda's own retries of the same invocation; using the business key guards against a caller submitting the same job twice.
AWS PowerTools for Lambda has a built-in `@idempotent` decorator that implements this exact pattern, including TTL management and in-progress locking.
**What it requires:** a DynamoDB table, `dynamodb:GetItem` + `dynamodb:PutItem` IAM permissions on the execution role, and the PowerTools layer (or `aws-lambda-powertools` in requirements).
### Manifest lifecycle rule
Every invocation writes a new object under `manifests/`. Without cleanup, this prefix grows unbounded. The fix is an S3 lifecycle rule on the bucket — not a handler change:
```json
{
"Rules": [{
"ID": "expire-manifests",
"Filter": { "Prefix": "manifests/" },
"Status": "Enabled",
"Expiration": { "Days": 1 }
}]
}
```
Objects under `manifests/` are deleted by S3 automatically after 1 day. The presigned URLs in those manifests are already short-lived (15 minutes by default), so there's no reason to keep the manifest longer than the URL validity window.
**What it requires:** a `PutBucketLifecycleConfiguration` call during infrastructure provisioning (CDK/Terraform/console) — nothing in the handler.
### Idempotency + manifest lifecycle together
With both in place: a retry returns the same `manifest_url` pointing to the same (still-live) manifest object; after 24 hours the manifest is gone and the dedup record has expired, so the next invocation starts fresh. The combination is clean.
### ReportBatchItemFailures (SQS only)
If this function were triggered by an SQS event source mapping (one message = one `(bucket, prefix)` job), the consumer-level `errors` field isn't enough — Lambda needs to know *which SQS messages* failed so it can re-queue only those. Return a `batchItemFailures` list instead of raising:
```python
def handler(event, context):
failures = []
for record in event["Records"]:
body = json.loads(record["body"])
try:
result = asyncio.run(_run(body, context.aws_request_id))
except Exception as exc:
failures.append({"itemIdentifier": record["messageId"]})
return {"batchItemFailures": failures}
```
Without this, a single failed message causes the entire batch to retry, including messages that succeeded — work is repeated and the queue can stall on a poison-pill message indefinitely.
**What it requires:** `ReportBatchItemFailures` enabled on the ESM configuration (CDK/Terraform) and restructuring the handler to iterate over `event["Records"]`. Not applicable to direct (RequestResponse) invocations like the local tester uses.
### arm64 / Graviton2
No code change needed. Switch the function's architecture to `arm64` in the deployment config:
```yaml
# SAM
Architectures: [arm64]
```
Graviton2 costs ~20% less per GB-second and typically runs the init phase ~10% faster. The only blocker is native-code wheels: `aiobotocore` ships pure Python so there's no binary incompatibility here. Worth doing as a zero-effort cost and latency win.