diff --git a/docs/index.html b/docs/index.html index dd76247..d10cf42 100644 --- a/docs/index.html +++ b/docs/index.html @@ -1432,24 +1432,22 @@ jobs:
Tree of eth/ — the sandbox plus this study site.
eth/ +├── lambda_function.py — handler: async PDF scan → presigned URLs → JSONL manifest ├── invoke.py — local runner: calls handler() with a minimal event, prints result ├── seed.py — uploads PDFs from a local directory to MinIO ├── requirements.txt — aioboto3, aiofiles (+ transitive: aiobotocore, botocore…) ├── docker-compose.yml — runs MinIO on :9000 (S3 API) and :9001 (web console) ├── Makefile — install / up / down / seed / invoke / graphs / docs ├── def/ -│ └── task.md — original interview exercise specification +│ └── task.md — original interview exercise specification └── docs/ - ├── index.html — this study site (single-page, no build step) - ├── viewer.html — pan/zoom SVG viewer (opened by graph links) + ├── index.html — this study site (single-page, no build step) + ├── viewer.html — pan/zoom SVG viewer (opened by graph links) └── graphs/ - ├── system_overview.dot / .svg — caller → handler → MinIO/S3 → manifest - ├── lifecycle.dot / .svg — init / handler / freeze / thaw / shutdown - └── cold_warm_timeline.dot / .svg — cold vs warm invocation timeline -
aioboto3 adds ~200 ms to the cold start. If cold start matters and file counts are small, sync boto3 with threading is simpler and starts faster. Async pays off only when file counts are large enough that overlap is significant.Architectural questions that came up and are worth keeping so they don't have to be re-derived.
+ +Why not split producer and consumer into separate Lambda functions?
+ generate_presigned_url is local computation — no network call, just CPU. The consumer isn't blocked on anything external. The async coroutine pattern already provides the overlap benefit (S3 LIST wait ↔ presign + write) without any infrastructure overhead. Splitting into two Lambdas would mean: in-process asyncio.Queue → SQS (latency + cost per message), two cold starts, two IAM roles, coordination logic — and the actual bottleneck (S3 LIST pagination) would be unchanged. Split producer/consumer into separate Lambdas when the consumer does real per-item I/O (external API calls, content downloads), per-item processing takes seconds not milliseconds, or you need independent retry semantics. None of those apply here. Scale-out for this function belongs one level up: Step Functions Map state across prefixes (one Lambda per prefix), not within-prefix producer/consumer separation.
| Target | What it does |
|---|