Skip to content

Complete two-stage build pipeline: separate dependency discovery from compilation #995

@LalatenduMohanty

Description

@LalatenduMohanty

Why

Building Python packages from source involves two fundamentally different kinds of work:

  1. Discovery — figuring out what to build: resolving versions from package indexes, running PEP 517 hooks (get_requires_for_build_wheel, get_requires_for_build_sdist) to find build dependencies, downloading source code. These operations involve network access and executing arbitrary code from upstream packages.

  2. Compilation — actually building it: running build_wheel() and build_sdist() on already-downloaded, already-resolved source code to produce wheels.

Today, fromager separates these into two commands (bootstrap and build-sequence), with build-order.json and graph.json as the bridge between them.

But the separation is incomplete — build-sequence still falls back to running PEP 517 discovery hooks and making network requests when cached data is missing.

Security

Discovery hooks are the primary attack surface in Python packaging. A compromised package's get_requires_for_build_wheel() runs arbitrary code.

If we confine all discovery to Stage 1 and ensure Stage 2 never runs discovery hooks, we create a clear boundary: untrusted discovery code executes once, in a controlled environment, and only auditable data crosses to the build stage.

Offline and air-gapped builds

A complete separation enables: run Stage 1 on a connected machine, package everything into a transferable directory, build on an air-gapped machine using only local files.

Auditability

When the build plan (build-order.json + graph.json) contains everything needed to build, the entire plan becomes a reviewable, diffable artifact. Security teams can inspect every package, version, source URL, and build dependency before any compilation happens.

Current state

The two-stage architecture is partially implemented:

  • Stage 1 (bootstrap) is complete. It produces build-order.json, graph.json, downloads all sources, and caches build requirement files (build-system-requirements.txt, build-backend-requirements.txt, build-sdist-requirements.txt) alongside unpacked sources. graph.json already records all dependency edges annotated by type (build-system, build-backend, build-sdist, install, toplevel).

  • Stage 2 (build-sequence / build-parallel) is functional but has fallback paths that re-run PEP 517 discovery hooks and access the network when cached files are missing.

  • download-sequence downloads sdist archives and optionally pre-built wheels, but skips git and override source types.

The caching mechanism in dependencies.py already provides the foundation — if build-*-requirements.txt files exist, cached data is returned without running hooks. The gap is that there is no formal way to package these files for transfer to a separate build environment, and no mode to prevent Stage 2 from falling back to hook execution when cached data is absent.

Goal

Stage 2 should be able to run with zero network access and zero PEP 517 discovery hook execution, using only data artifacts produced by Stage 1. The only untrusted code that Stage 2 should execute is compilation hooks (build_wheel, build_sdist) — the operations that produce the actual output.

Related: #797

Metadata

Metadata

Labels

enhancementNew feature or request

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions