336 lines
16 KiB
Markdown
336 lines
16 KiB
Markdown
# sign_pdfs_optimized — Walkthrough
|
||
|
||
> Production-refined fork of `sign_pdfs`. Applies 8 improvements at once; meant to be compared side-by-side with the original in the local tester.
|
||
|
||
The original `sign_pdfs` is the "before" snapshot — it is left untouched. This file (`sign_pdfs_optimized/handler.py`) is the "after": same contract, same output shape, all the rough edges from the original fixed.
|
||
|
||
---
|
||
|
||
## Block 1 — Imports and sentinel
|
||
|
||
```python
|
||
import asyncio, json, os, time, uuid
|
||
import aioboto3
|
||
import aiofiles
|
||
|
||
_DONE = object()
|
||
```
|
||
|
||
Nothing new at the import level. `_DONE` is a sentinel object used to signal consumers that the producer has finished. Using a unique object (rather than `None` or a string) means no PDF key will ever accidentally match it.
|
||
|
||
**Original issue fixed here:** the original put a single `_DONE` on the queue regardless of how many consumers were running. With N consumers, one of them would consume the sentinel and exit, but the others would block forever waiting for a key that never arrives. The fix is one sentinel per consumer — that happens in the producer's `finally` block (Block 4).
|
||
|
||
---
|
||
|
||
## Block 2 — Config from event (`_cfg`)
|
||
|
||
```python
|
||
def _cfg(event: dict) -> dict:
|
||
return {
|
||
"bucket": event.get("bucket") or os.environ.get("BUCKET_NAME", "my-company-reports-bucket"),
|
||
"prefix": event.get("prefix") or os.environ.get("PREFIX", "2026/04/"),
|
||
"expiry": int(event.get("expiry_seconds") or os.environ.get("URL_EXPIRY_SECONDS", "900")),
|
||
"endpoint": os.environ.get("S3_ENDPOINT_URL") or None,
|
||
"page_size": int(event.get("page_size") or 1000),
|
||
"concurrency": int(event.get("concurrency") or 4),
|
||
"queue_max": int(os.environ.get("QUEUE_MAX", "2000")),
|
||
}
|
||
```
|
||
|
||
**Improvement 1 — event-driven config.** The original hard-coded `bucket`, `prefix`, and `expiry` as module-level constants read once from env vars. That means one function deployment = one bucket/prefix combination. Here the event is the primary source; env vars are fallbacks. The same deployed function can serve `2026/04/`, `2026/05/`, or an entirely different bucket by changing only the event payload — no redeployment.
|
||
|
||
`endpoint` is env-only because it is infrastructure (the MinIO URL in local dev, absent in production) — callers should never need to specify it per-request.
|
||
|
||
`queue_max` bounds the in-memory queue so the producer can't buffer unbounded keys before consumers have a chance to drain it. At 1000 keys/page × 2000 queue slots, the producer is at most 2 pages ahead of the consumers.
|
||
|
||
---
|
||
|
||
## Block 3 — Structured logging (`_log`) and EMF metrics (`_emit_emf`)
|
||
|
||
```python
|
||
def _log(event_type: str, **fields):
|
||
print(json.dumps({"event": event_type, **fields}))
|
||
|
||
def _emit_emf(metrics: dict, **dims):
|
||
print(json.dumps({
|
||
"_aws": {
|
||
"Timestamp": int(time.time() * 1000),
|
||
"CloudWatchMetrics": [{
|
||
"Namespace": "eth/sign_pdfs_optimized",
|
||
"Dimensions": [list(dims.keys())],
|
||
"Metrics": [
|
||
{"Name": k, "Unit": "Bytes" if k.endswith("Bytes") else "Count"}
|
||
for k in metrics
|
||
],
|
||
}],
|
||
},
|
||
**dims, **metrics,
|
||
}))
|
||
```
|
||
|
||
**Improvement 2 — structured JSON logging.** The original printed nothing. `_log` emits one JSON object per line to stdout. CloudWatch Logs Insights can then query fields directly:
|
||
|
||
```
|
||
filter event = "complete" | stats avg(duration_ms) by bin(5m)
|
||
```
|
||
|
||
The local runner's `_extract_json_logs` picks up these lines and surfaces them in the tester's "Structured logs" panel.
|
||
|
||
**Improvement 3 — EMF metrics.** `_emit_emf` writes a CloudWatch Embedded Metrics Format block to stdout. The Lambda runtime (and locally, the runner's `_extract_emf_metrics`) parses this and publishes real CloudWatch metrics — no `PutMetricData` API call, no extra latency, no per-metric cost. The unit is inferred from the metric name: anything ending in `Bytes` → `"Bytes"`, everything else → `"Count"`.
|
||
|
||
EMF is emitted once at the end of the invocation, not once per item. Putting it inside a loop would flood CloudWatch with thousands of data points per invocation.
|
||
|
||
---
|
||
|
||
## Block 4 — Producer with `try/finally`
|
||
|
||
```python
|
||
async def producer():
|
||
nonlocal pages
|
||
try:
|
||
paginator = s3.get_paginator("list_objects_v2")
|
||
async for page in paginator.paginate(
|
||
Bucket=cfg["bucket"], Prefix=cfg["prefix"],
|
||
PaginationConfig={"PageSize": cfg["page_size"]},
|
||
):
|
||
pages += 1
|
||
for obj in page.get("Contents", []) or []:
|
||
key = obj["Key"]
|
||
if key.lower().endswith(".pdf"):
|
||
await queue.put(key)
|
||
finally:
|
||
for _ in range(cfg["concurrency"]):
|
||
await queue.put(_DONE)
|
||
```
|
||
|
||
**Improvement 4 — guaranteed consumer exit.** The `finally` block runs whether the producer succeeds, raises a `ClientError`, or is cancelled. Each of the N consumers gets its own `_DONE` sentinel. Without `finally`, an exception in the producer leaves all consumers blocked on `queue.get()` forever — the handler hangs and eventually times out at 15 minutes.
|
||
|
||
**Improvement 7 — `PageSize=1000`.** S3's `ListObjectsV2` returns at most 1000 keys per page by default, but specifying it explicitly avoids the paginator defaulting to a smaller value on some SDK versions. Fewer pages = fewer round-trip API calls for large prefixes.
|
||
|
||
---
|
||
|
||
## Block 5 — Consumer with per-item error handling
|
||
|
||
```python
|
||
async def consumer():
|
||
nonlocal errors
|
||
local = 0
|
||
while True:
|
||
item = await queue.get()
|
||
if item is _DONE:
|
||
return local
|
||
try:
|
||
url = await s3.generate_presigned_url(
|
||
"get_object",
|
||
Params={"Bucket": cfg["bucket"], "Key": item},
|
||
ExpiresIn=cfg["expiry"],
|
||
)
|
||
line = json.dumps({"key": item, "url": url}) + "\n"
|
||
async with write_lock:
|
||
await f.write(line)
|
||
local += 1
|
||
except Exception as exc:
|
||
errors += 1
|
||
_log("presign_error", request_id=request_id, key=item, error=str(exc))
|
||
```
|
||
|
||
**Improvement 5 — per-item error isolation.** The original let a single `generate_presigned_url` failure propagate out of the consumer and crash the whole batch. Here, each item is wrapped in its own `try/except`. A bad key increments `errors` and logs the details; the consumer continues with the next key. The caller receives `{"count": N, "errors": M}` and can decide whether to retry the M failures.
|
||
|
||
`local` counts successful presigns per consumer. The N return values are summed after `asyncio.gather` to get the total count.
|
||
|
||
The `write_lock` serialises writes to the shared JSONL file — without it, concurrent `await f.write(line)` calls would interleave partial lines.
|
||
|
||
---
|
||
|
||
## Block 6 — N concurrent consumers (`asyncio.gather`)
|
||
|
||
```python
|
||
prod_task = asyncio.create_task(producer())
|
||
counts = await asyncio.gather(*(consumer() for _ in range(cfg["concurrency"])))
|
||
await prod_task
|
||
```
|
||
|
||
**Improvement 8 — concurrent consumers.** The original ran one consumer. Here `asyncio.gather` starts N consumer coroutines simultaneously, all reading from the same queue. Since `generate_presigned_url` is an async S3 API call, each consumer yields while waiting for the response, letting the others proceed. With N=4 and 1000 PDFs, peak in-flight presign requests is 4 instead of 1 — throughput roughly scales with `concurrency` up to the point where the producer can't keep the queue full.
|
||
|
||
`asyncio.create_task(producer())` schedules the producer to run concurrently with the consumers rather than sequentially before them. The `await prod_task` after gather ensures any producer exception is re-raised (not silently swallowed as it would be in a fire-and-forget task).
|
||
|
||
---
|
||
|
||
## Block 7 — Streaming manifest upload
|
||
|
||
```python
|
||
manifest_bytes = os.path.getsize(manifest_path)
|
||
manifest_key = f"manifests/{uuid.uuid4()}.jsonl"
|
||
|
||
with open(manifest_path, "rb") as f:
|
||
await s3.put_object(
|
||
Bucket=cfg["bucket"], Key=manifest_key, Body=f,
|
||
ContentType="application/x-ndjson",
|
||
)
|
||
```
|
||
|
||
**Improvement 6 — streaming upload.** The original read the entire manifest into memory with `await f.read()` before uploading. For a bucket with 100k PDFs, the JSONL manifest could be tens of MB — a full `read()` doubles peak RAM usage. Passing a sync file handle as `Body` lets aiobotocore read the file in chunks internally. Memory usage stays flat regardless of manifest size.
|
||
|
||
`os.path.getsize` is called before opening the file for upload so the size is available for the EMF block even after the file is deleted.
|
||
|
||
---
|
||
|
||
## Block 8 — Cleanup, metrics, and response
|
||
|
||
```python
|
||
os.unlink(manifest_path)
|
||
|
||
result = {
|
||
"count": count,
|
||
"errors": errors,
|
||
"manifest_key": manifest_key,
|
||
"manifest_url": manifest_url,
|
||
}
|
||
response_bytes = len(json.dumps(result))
|
||
duration_ms = (time.monotonic() - t0) * 1000
|
||
|
||
_log("complete", request_id=request_id, count=count, errors=errors,
|
||
pages=pages, duration_ms=round(duration_ms, 2))
|
||
_emit_emf({
|
||
"PDFsProcessed": count, "S3ListPages": pages,
|
||
"PresignCount": count, "ManifestBytes": manifest_bytes,
|
||
"ResponseBytes": response_bytes,
|
||
}, Function="sign_pdfs_optimized")
|
||
|
||
return result
|
||
```
|
||
|
||
`os.unlink` removes the temp file after upload. `/tmp` is shared across warm invocations — not cleaning up leaks storage toward the 512 MB (default) or 10 GB (max) `/tmp` cap.
|
||
|
||
`response_bytes = len(json.dumps(result))` captures how large the sync response would be. Lambda's synchronous response limit is 6 MB. This function returns a URL, not a body, so `response_bytes` is typically under 300 bytes — but measuring it makes the headroom concrete.
|
||
|
||
The final `_log("complete", ...)` and `_emit_emf(...)` emit together at the end so all counts are final. In the tester, the Structured logs panel shows the `complete` event and the EMF panel shows the five metric values.
|
||
|
||
---
|
||
|
||
## Block 9 — Entrypoint
|
||
|
||
```python
|
||
def handler(event, context):
|
||
request_id = getattr(context, "aws_request_id", str(uuid.uuid4()))
|
||
result = asyncio.run(_run(event, request_id))
|
||
return {"statusCode": 200, "body": json.dumps(result)}
|
||
```
|
||
|
||
`asyncio.run` creates a fresh event loop for each invocation. Lambda's Python runtime is synchronous at the handler boundary; `asyncio.run` is the correct bridge between the sync entrypoint and the async internals. A module-level `loop = asyncio.get_event_loop()` would work on the first invocation but is deprecated and unreliable across warm invocations.
|
||
|
||
`getattr(context, "aws_request_id", ...)` falls back to a generated UUID when the context object doesn't have the attribute — which is the case in the local tester, where the runner passes a minimal stub.
|
||
|
||
---
|
||
|
||
## Summary of improvements
|
||
|
||
| # | What changed | Why it matters |
|
||
|---|---|---|
|
||
| 1 | Config from event, env as fallback | One function serves many prefixes |
|
||
| 2 | Structured JSON logging (`_log`) | Queryable in Logs Insights; visible in tester |
|
||
| 3 | EMF metrics (`_emit_emf`) | Free CloudWatch metrics, no API call overhead |
|
||
| 4 | Producer `try/finally` with N sentinels | Consumers always exit; handler never hangs |
|
||
| 5 | Per-item `try/except` in consumer | One bad key doesn't crash the batch |
|
||
| 6 | Sync file handle as `Body` | Flat memory for any manifest size |
|
||
| 7 | `PageSize=1000` | Fewer S3 round-trips on large prefixes |
|
||
| 8 | N concurrent consumers via `asyncio.gather` | Presign throughput scales with `concurrency` |
|
||
|
||
---
|
||
|
||
## Further improvements (not yet applied)
|
||
|
||
These are the next natural steps if this function were going to production. They were left out intentionally — each adds infrastructure or AWS-side configuration that goes beyond the handler itself.
|
||
|
||
### Idempotency
|
||
|
||
The function is not idempotent in the strict sense. Each invocation with the same `(bucket, prefix)` event produces a new manifest at a new UUID key. If Lambda retries the invocation (async invocations retry up to 2 times by default, and S3/SNS/EventBridge are at-least-once), you accumulate duplicate manifests in the `manifests/` prefix.
|
||
|
||
The standard fix is a DynamoDB dedup table:
|
||
|
||
```python
|
||
import hashlib, boto3 as _boto3
|
||
|
||
_ddb = _boto3.resource("dynamodb")
|
||
_table = _ddb.Table(os.environ["DEDUP_TABLE"])
|
||
|
||
def _dedup_key(cfg: dict) -> str:
|
||
raw = f"{cfg['bucket']}#{cfg['prefix']}"
|
||
return hashlib.sha256(raw.encode()).hexdigest()[:32]
|
||
|
||
# at the top of _run(), before any S3 work:
|
||
dedup_key = _dedup_key(cfg)
|
||
resp = _table.get_item(Key={"id": dedup_key})
|
||
if "Item" in resp:
|
||
return json.loads(resp["Item"]["result"]) # cached — skip all S3 work
|
||
|
||
# ... do the work ...
|
||
|
||
# at the end, before returning:
|
||
_table.put_item(Item={
|
||
"id": dedup_key,
|
||
"result": json.dumps(result),
|
||
"ttl": int(time.time()) + 86400,
|
||
})
|
||
```
|
||
|
||
The dedup key is derived from the logical job identity (`bucket + prefix`), not the `request_id`. Using `request_id` would only guard against Lambda's own retries of the same invocation; using the business key guards against a caller submitting the same job twice.
|
||
|
||
AWS PowerTools for Lambda has a built-in `@idempotent` decorator that implements this exact pattern, including TTL management and in-progress locking.
|
||
|
||
**What it requires:** a DynamoDB table, `dynamodb:GetItem` + `dynamodb:PutItem` IAM permissions on the execution role, and the PowerTools layer (or `aws-lambda-powertools` in requirements).
|
||
|
||
### Manifest lifecycle rule
|
||
|
||
Every invocation writes a new object under `manifests/`. Without cleanup, this prefix grows unbounded. The fix is an S3 lifecycle rule on the bucket — not a handler change:
|
||
|
||
```json
|
||
{
|
||
"Rules": [{
|
||
"ID": "expire-manifests",
|
||
"Filter": { "Prefix": "manifests/" },
|
||
"Status": "Enabled",
|
||
"Expiration": { "Days": 1 }
|
||
}]
|
||
}
|
||
```
|
||
|
||
Objects under `manifests/` are deleted by S3 automatically after 1 day. The presigned URLs in those manifests are already short-lived (15 minutes by default), so there's no reason to keep the manifest longer than the URL validity window.
|
||
|
||
**What it requires:** a `PutBucketLifecycleConfiguration` call during infrastructure provisioning (CDK/Terraform/console) — nothing in the handler.
|
||
|
||
### Idempotency + manifest lifecycle together
|
||
|
||
With both in place: a retry returns the same `manifest_url` pointing to the same (still-live) manifest object; after 24 hours the manifest is gone and the dedup record has expired, so the next invocation starts fresh. The combination is clean.
|
||
|
||
### ReportBatchItemFailures (SQS only)
|
||
|
||
If this function were triggered by an SQS event source mapping (one message = one `(bucket, prefix)` job), the consumer-level `errors` field isn't enough — Lambda needs to know *which SQS messages* failed so it can re-queue only those. Return a `batchItemFailures` list instead of raising:
|
||
|
||
```python
|
||
def handler(event, context):
|
||
failures = []
|
||
for record in event["Records"]:
|
||
body = json.loads(record["body"])
|
||
try:
|
||
result = asyncio.run(_run(body, context.aws_request_id))
|
||
except Exception as exc:
|
||
failures.append({"itemIdentifier": record["messageId"]})
|
||
return {"batchItemFailures": failures}
|
||
```
|
||
|
||
Without this, a single failed message causes the entire batch to retry, including messages that succeeded — work is repeated and the queue can stall on a poison-pill message indefinitely.
|
||
|
||
**What it requires:** `ReportBatchItemFailures` enabled on the ESM configuration (CDK/Terraform) and restructuring the handler to iterate over `event["Records"]`. Not applicable to direct (RequestResponse) invocations like the local tester uses.
|
||
|
||
### arm64 / Graviton2
|
||
|
||
No code change needed. Switch the function's architecture to `arm64` in the deployment config:
|
||
|
||
```yaml
|
||
# SAM
|
||
Architectures: [arm64]
|
||
```
|
||
|
||
Graviton2 costs ~20% less per GB-second and typically runs the init phase ~10% faster. The only blocker is native-code wheels: `aiobotocore` ships pure Python so there's no binary incompatibility here. Worth doing as a zero-effort cost and latency win.
|