Skip to content

[Misc] Harden xet cargo build against transient network failures#641

Merged
slin1237 merged 1 commit into
mainfrom
fix/xet-cargo-network-resilience
Jun 27, 2026
Merged

[Misc] Harden xet cargo build against transient network failures#641
slin1237 merged 1 commit into
mainfrom
fix/xet-cargo-network-resilience

Conversation

@pallasathena92

Copy link
Copy Markdown
Collaborator

What this PR does

The model-agent image build intermittently fails while compiling the
pkg/xet Rust library because a crate download gets cut off mid-transfer.
This hardens the cargo build against transient network failures and speeds
up rebuilds by caching downloads.

Why we need it

make model-agent-image fails non-deterministically at the XET build step:

error: failed to get async-trait as a dependency of package ome-xet-binding
Caused by: download of as/yn/async-trait failed
Caused by: curl failed
Caused by: [18] Transferred a partial file

curl error 18 (CURLE_PARTIAL_FILE) means the connection delivered fewer
bytes than expected. It succeeds on retry, so it's transport-level — not a
bad dependency.

Root cause

  • cargo downloads crates over HTTP/2 with multiplexing by default; on
    flaky networks / proxies / Docker bridges, a dropped multiplexed stream
    truncates a download → curl-18.
  • The build had no retry tuning (cargo default net.retry = 2) and
    no fetch caching, so a single blip aborts the whole image build and
    every rebuild re-downloads everything.

Changes

pkg/xet/.cargo/config.toml (new)

Setting Why
http.multiplexing = false Force HTTP/1.1 — removes the curl-18 trigger (primary fix)
net.retry = 4 Tolerate transient blips; 2× cargo's default, low enough to surface real infra issues
net.git-fetch-with-cli = true Use system git for the huggingface/xet-core git deps (more robust than libgit2)

dockerfiles/model-agent.Dockerfile

  • Cache-mount /root/.cargo/registry and /root/.cargo/git for the XET
    build step, so crates and git deps aren't re-downloaded on rebuild
    (faster + resilient). target/ is intentionally not cached so the
    subsequent ls … target/release/libxet.* step still sees the artifacts.

Testing

  • make model-agent-image builds successfully

Checklist

  • Tests added/updated (if applicable)
  • Docs updated (if applicable)
  • make test passes locally

@github-actions github-actions Bot added docker Dockerfile changes xet Xet/HuggingFace acceleration changes labels Jun 27, 2026
@slin1237 slin1237 merged commit d676af7 into main Jun 27, 2026
13 checks passed
@slin1237 slin1237 deleted the fix/xet-cargo-network-resilience branch June 27, 2026 01:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

docker Dockerfile changes xet Xet/HuggingFace acceleration changes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants