-
Notifications
You must be signed in to change notification settings - Fork 49
Complete two-stage build pipeline: separate dependency discovery from compilation #995
Description
Why
Building Python packages from source involves two fundamentally different kinds of work:
-
Discovery — figuring out what to build: resolving versions from package indexes, running PEP 517 hooks (
get_requires_for_build_wheel,get_requires_for_build_sdist) to find build dependencies, downloading source code. These operations involve network access and executing arbitrary code from upstream packages. -
Compilation — actually building it: running
build_wheel()andbuild_sdist()on already-downloaded, already-resolved source code to produce wheels.
Today, fromager separates these into two commands (bootstrap and build-sequence), with build-order.json and graph.json as the bridge between them.
But the separation is incomplete — build-sequence still falls back to running PEP 517 discovery hooks and making network requests when cached data is missing.
Security
Discovery hooks are the primary attack surface in Python packaging. A compromised package's get_requires_for_build_wheel() runs arbitrary code.
If we confine all discovery to Stage 1 and ensure Stage 2 never runs discovery hooks, we create a clear boundary: untrusted discovery code executes once, in a controlled environment, and only auditable data crosses to the build stage.
Offline and air-gapped builds
A complete separation enables: run Stage 1 on a connected machine, package everything into a transferable directory, build on an air-gapped machine using only local files.
Auditability
When the build plan (build-order.json + graph.json) contains everything needed to build, the entire plan becomes a reviewable, diffable artifact. Security teams can inspect every package, version, source URL, and build dependency before any compilation happens.
Current state
The two-stage architecture is partially implemented:
-
Stage 1 (bootstrap) is complete. It produces
build-order.json,graph.json, downloads all sources, and caches build requirement files (build-system-requirements.txt,build-backend-requirements.txt,build-sdist-requirements.txt) alongside unpacked sources.graph.jsonalready records all dependency edges annotated by type (build-system,build-backend,build-sdist,install,toplevel). -
Stage 2 (build-sequence / build-parallel) is functional but has fallback paths that re-run PEP 517 discovery hooks and access the network when cached files are missing.
-
download-sequence downloads sdist archives and optionally pre-built wheels, but skips git and override source types.
The caching mechanism in dependencies.py already provides the foundation — if build-*-requirements.txt files exist, cached data is returned without running hooks. The gap is that there is no formal way to package these files for transfer to a separate build environment, and no mode to prevent Stage 2 from falling back to hook execution when cached data is absent.
Goal
Stage 2 should be able to run with zero network access and zero PEP 517 discovery hook execution, using only data artifacts produced by Stage 1. The only untrusted code that Stage 2 should execute is compilation hooks (build_wheel, build_sdist) — the operations that produce the actual output.
Related: #797