Skip to content

Layr-Labs/mlxfast-challenge-dev

Repository files navigation

mlxfast — DeepSeek V4 Flash

A benchmark arena for memory-bandwidth-optimal LLM inference on Apple Silicon. Run DeepSeek V4 Flash without loading all 256 experts into RAM — and beat the baseline score.

See TASK.md for the full problem statement, scoring formula, and approach space.

Quickstart

# Check local tools, build the Swift harness/MLX metallib, and fetch weights if needed
./setup.sh

# Optional: split dense weights into weights/ and write the expert streaming
# manifest. setup.sh prints this command with the exact reference path it used.
MLXFAST_OFFLINE_WRITABLE_PATHS="${PWD}/weights" \
  .github/scripts/run-offline.sh .build/release/mlxfast-swift transform \
  --reference .cache/huggingface/hub/models--mlx-community--DeepSeek-V4-Flash-4bit/snapshots/main \
  --output weights

# Run the checked-in public correctness gate.
.build/release/mlxfast-swift correctness --weights weights

# Fast edit-loop signal: uses the public 512-token correctness prompt, checks
# the prefill next token plus 16 teacher-forced decode tokens, writes
# score.local-iterate.json, and prints it to stdout.
./benchmark.sh --local-iterate

# Run the Darkbloom-compatible benchmark entrypoint.
# Official benchmark runs use the organizer-supplied hidden correctness_golden.json.
./benchmark.sh

# Local submit check used by Yukon before upload: runs the public 512-token
# prompt through a longer checked timing window, writes score.json with
# score: null, and prints it to stdout.
./benchmark.sh --local-submit

# Or call the Swift CLI directly
.build/release/mlxfast-swift correctness --weights weights
.build/release/mlxfast-swift preflight
.build/release/mlxfast-swift benchmark --local-iterate
.build/release/mlxfast-swift benchmark --score-path score.json
.build/release/mlxfast-swift benchmark --local-submit --score-path score.json

# If required model artifacts are missing, the benchmark emits a valid failed
# score.json instead of a ranked score.

The benchmark writes score.json in the format consumed by Darkbloom. score.json is a generated local output and is not tracked. Public correctness-only workflow runs use the checked-in correctness_prompts/public_longcopy_gate_english_512_256.json golden and matching prompt text. Official benchmark runs use a hidden correctness_golden.json supplied by the benchmark operator, or a harness path set with MLXFAST_CORRECTNESS_GOLDEN_PATH=/path/to/correctness_golden.json. benchmark.sh also writes score.json.sha256 and benchmark-integrity.json, which record the score file hash, golden hash, transformed weights/ hash, and transform source hash for run auditing.

Full model setup needs a large local or mounted SSD. The reference checkpoint is mlx-community/DeepSeek-V4-Flash-4bit, with 33 safetensors shards totaling about 141 GiB. setup.sh downloads the checkpoint from the fast Darkbloom/R2 mirror by default into a repo-local Hugging Face-style cache under .cache/huggingface/hub/models--mlx-community--DeepSeek-V4-Flash-4bit/snapshots/main/. It verifies cached files against fixtures/reference_deepseek_v4_flash_4bit.sha256 and redownloads only files that are missing, truncated, or hash-mismatched. A compatibility symlink is created at reference_weights/DeepSeek-V4-Flash-4bit for older commands, but current setup and CI pass the canonical cache directory to transform explicitly. The downloader uses resumable curl requests, prints numbered shard progress with elapsed time, and checks for at least 170 GiB free by default. After a full SHA-256 verification, setup writes .mlxfast-reference-cache.lock next to the checkpoint; later setup runs use cheap size/mtime checks against that lock and skip the full 141 GiB hash pass when the cache is unchanged. Use MLXFAST_REFERENCE_CACHE_DIR=/Volumes/ssd/hf-cache/.../snapshots/main or MLXFAST_REFERENCE_DIR=/Volumes/ssd/DeepSeek-V4-Flash-4bit to point at a larger volume, or MLXFAST_SKIP_WEIGHTS_DOWNLOAD=1 ./setup.sh when the checkpoint will be supplied separately. If you use a custom cache path, either copy the exact transform command printed by setup.sh or set MLXFAST_REFERENCE_DIR before running transform or benchmark.sh. The Swift CLI also honors MLXFAST_REFERENCE_DIR, MLXFAST_WEIGHTS_PATH, MLXFAST_CORRECTNESS_GOLDEN_PATH, and MLXFAST_SCORE_PATH as defaults; explicit CLI flags take precedence. For benchmark.sh, use those MLXFAST_* environment variables for path overrides; pass --weights, --golden, and --score-path only to .build/release/mlxfast-swift benchmark directly. Set MLXFAST_REFERENCE_BASE_URL to use another HTTP checkpoint prefix, including Hugging Face. Run ./setup.sh --help for the full local setup knobs.

For manual GitHub Actions benchmark runs, dispatch benchmark.yml on the trusted repository workflow. The workflow uses the protected MLXFAST_REFERENCE_BASE_URL secret when present, otherwise the fixed Darkbloom reference mirror. It downloads the reference checkpoint into the same repo-local Hugging Face-style cache path used by local setup, passes that path explicitly to the offline transform, then prepares the correctness golden after transform completes. Correctness-only workflow runs use the checked-in public correctness_prompts/public_longcopy_gate_english_512_256.json fixture. Full benchmark runs require the precomputed hidden R2 object correctness_prompts/golden_prompt_benchmark_transcription_gate_english_512_256.json. Full benchmark runs also require the private R2 correctness_prompts/gpqa_reference_cases.json object. The workflow tokenizes 5 token-budget-valid hidden GPQA multiple-choice prompts locally and attaches them as short-answer behavior gates before correctness runs. Each private GPQA case must include precomputed reference accepted_token_sequences or accepted_responses; GPQA answer keys are metadata, not an exact-token oracle, and the benchmark workflow never regenerates this reference object. The official workflow checks the first generated GPQA answer token for each case, using the stable prefix of any longer precomputed reference sequence. During that hidden behavior correctness pass, it also records TTFT by timing prompt prefill through the first greedy answer token. The uploaded score records only aggregate TTFT counts and timings; generated first-token IDs, accepted token IDs, prompts, and answers stay out of GitHub logs and artifacts. The same correctness pass captures short hidden GPQA continuations and sends only those private answer bundles to Claude for a semantic pass/fail judge. This requires the ORG_ANTHROPIC_API_KEY repository secret. The score artifact records only aggregate semantic counts and the judge model name; prompts, references, candidate answers, and judge text stay in the private runner directory. The private workflow treats semantic GPQA as a hard gate with a 3/5 threshold, calibrated to the unmodified DeepSeek V4 Flash baseline rather than to better-than-baseline GPQA answer quality. When dispatched with run_benchmark=true, the workflow fans out across 5 independent machines instead of running everything on one: three correctness-slice-N jobs each check one range_N-defined slice of the hidden base correctness case, benchmark-timing runs only the timed prefill/decode measurement, and benchmark-gates runs only the anchor/ free-run/behavior/GPQA gates. A cheap validate-slice-ranges job checks the three ranges before any of those machines start. A combine job (no checkpoint download, no private secrets) then downloads all 5 results and ANDs every machine's real verdict into the final score. Dispatching with the default run_benchmark=false instead runs the same checks serially on one machine. If none of those is configured, a full benchmark fails; it will not use a committed prompt, committed golden, or Actions cache fallback for ranked scoring. Final hidden goldens should come from protected storage. Private endpoints can pass headers through MLXFAST_REFERENCE_AUTH_HEADER and MLXFAST_CORRECTNESS_GOLDEN_AUTH_HEADER repository secrets. Private R2 golden downloads use the R2_ACCESS_KEY_ID, R2_BUCKET_ENDPOINT, and R2_SECRET_ACCESS_KEY secrets. See docs/private-benchmark-security.md for the private prompt and artifact handling model.

Why this challenge exists

DeepSeek V4 Flash has 256 routed experts per layer, 6 activated per token. The checkpoint is too large to keep fully resident on typical Apple Silicon machines. The baseline ships with SSD streaming: expert tensors stay on disk and only the routed tensors needed for the current forward pass are materialized.

That baseline is functional but naive. Expert reads block the forward pass, there is no prefetching, no cross-layer reuse, and the weights are stored in their original 4-bit form. Every one of these is an optimisation target. The generated weights/ tree is expected to stay small: it is a runtime artifact overlay on top of the frozen reference checkpoint, not a second full model copy. Submissions may change both the Swift transform and Swift runtime to adjust metadata, caching, or streaming strategy, as long as the generated runnable artifacts pass the hidden correctness and benchmark checks.

The modifiable surface

Unlike typical inference benchmarks, the entire model execution pipeline is in scope. Submissions should focus on the Swift targets listed in benchmark.json:

Path What it controls
Sources/MLXFastModel/ DeepSeek V4 Flash runtime, MLX Swift array bridge, dense/expert loading, SSD streaming, decode/prefill logic. Primary target.
Sources/MLXFastTransform/ Offline weight transform from frozen reference safetensors into benchmark-ready weights/.

The repository is Swift-only: setup, transform, correctness, and benchmark all run through the Swift package. Correctness, scoring, timing, and provenance are trusted harness code outside editablePaths; only the model and transform targets are contestant-editable.

Submissions are made with the Yukon CLI (mlxfast), a separate tool that manages your account and uploads across all Yukon benchmarks. The mlxfast-swift binary runs the benchmark domain only (transform, correctness, benchmark, preflight, verify-transform) and no longer logs in or uploads.

mlxfast login <api-key> --api https://yukon-api.fly.dev
mlxfast clone <benchmark-id-or-name>     # fresh checkout; an existing repo auto-links by its git remote
mlxfast submit --model "Claude Opus 4.8" \
  --note "Changed expert streaming prefetch policy."
mlxfast submissions

mlxfast submit reads benchmark.json and uploads only the paths listed in editablePaths as submission.tar.gz, POSTed to Yukon with Authorization: Bearer <api-key> and an idempotency key. Generated weights/, reference checkpoints, golden files, and local scores live outside editablePaths and are never uploaded; the backend re-enforces the editable surface server-side after upload. --model is required and is recorded for the leaderboard. MLXFAST_API_URL / MLXFAST_API_TOKEN (or the YUKON_* equivalents) configure the endpoint and token for scripted runs. Before uploading, Yukon runs the contract preSubmitCommand, which is ./benchmark.sh --local-submit for this benchmark. That local-submit pass is the local submit gate: it uses the public/local oracle, writes and prints score.json, and stops obviously broken or slower changes before they spend official runner time.

Local Commands

Use these two benchmark modes for local development:

Command Purpose What it checks Output
./benchmark.sh --local-iterate Fast edit-loop signal, usually under 2 minutes after setup. Public 512-token prompt, standalone prefill next-token check, decode seed-prefill check, and 16 teacher-forced decode checks. score.local-iterate.json with score: null.
./benchmark.sh --local-submit Yukon pre-submit gate, intended to be about 10 minutes after setup. Same public prompt, standalone prefill next-token check, decode seed-prefill check, and 1023 teacher-forced decode checks from a longer public fixture. score.json with score: null.

Neither local mode produces an official leaderboard score. Official ranking still runs the hidden benchmark oracle and hidden correctness gates on the trusted runner.

Scoring

decode_speedup = baseline_decode_sec_per_token / decode_sec_per_token
prefill_speedup = baseline_prefill_sec_per_token / prefill_sec_per_token
score = decode_speedup^0.75 * prefill_speedup^0.25

Higher score is better. A baseline implementation on the official runner scores about 1.0; improvements should move the score upward. Decode is weighted more heavily because it dominates interactive generation, while prefill still matters for prompt processing. Both phases must also stay within 5% of the official baseline:

decode_speedup >= 0.95
prefill_speedup >= 0.95

The floor prevents a submission from sacrificing one serving phase badly to improve the other. The exact baseline timings are emitted in each score.json and kept in MLXFastConstants after trusted Blacksmith calibration. On the current Blacksmith M4 baseline, that means decode must be at most 3.8280590145312505 seconds/token and prefill must be at most 0.18242698079358555 seconds/token. For scoring, decode is trusted parent wall-clock time for decode setup plus the checked decode-token window, not worker-reported per-step time. That charges prompt-specific seed prefill to the decode phase so submitted model code cannot hide speculative decode work before the timer starts. The harness records bandwidth_source=trusted_core_expert_slot_bank_reads and derives bandwidth_gb_per_token from trusted-core expert slot-bank file bytes during the decode window. Bandwidth, RAM, and expert-read metrics are reported for operator review and future guardrails; they are not primary score factors. Correctness is a hard gate. See TASK.md for the full correctness specification. The official run times the benchmark before correctness so the correctness gate cannot warm the measured model path. It then checks 64 public correctness positions plus the hidden GPQA behavior checks. Public local correctness uses the checked-in correctness fixture. When a local edit-loop signal is enough, --local-iterate uses that public 512-token prompt, times standalone prefill separately, then times decode including seed prefill plus 16 teacher-forced decode tokens, writes score.local-iterate.json, prints it, and leaves score null because it is not a ranked benchmark score. The submit hook --local-submit uses the same public prompt with a longer 1024-token fixture: it times the same standalone prefill and decode-seed phases plus 1023 teacher-forced decode tokens in one continuous trajectory, writes score.json, and also leaves score null because official ranking still requires the hidden benchmark oracle on the trusted runner. The score payload includes the official baseline timings, computed speedups, wall-clock phase timings, final process RSS, expert streaming counters, and transformed-weights digest.

Architecture

Sources/
  MLXFastCLI/                Swift command-line entrypoint
  MLXFastCore/               score.json, golden cases, shared contracts
  MLXFastTransform/          Swift offline weight transform
  MLXFastModel/              editable DeepSeek V4 Flash Swift runtime
  MLXFastHarness/            trusted correctness, golden, and benchmark runner
weights/                     transformed weights (harness loads from here)
  experts/
    manifest.json            baseline byte ranges for streamed expert tensors
.cache/huggingface/hub/...   canonical frozen 4-bit reference checkpoint cache
reference_weights/...        compatibility symlink to the reference cache
correctness_prompts/         public correctness prompt and checked-in golden
correctness_golden.json      hidden benchmark correctness cases and token oracle
score.json                   written after each benchmark run

The baseline runtime loads dense/shared tensors from weights/ and streams routed expert tensors from the frozen reference checkpoint named by MLXFAST_REFERENCE_DIR, falling back to the compatibility symlink when that environment variable is not set.

The standard preflight/benchmark path enforces a default 25 GiB cap on the generated weights/ tree before correctness or timing runs. Change it with MLXFAST_MAX_WEIGHTS_BYTES; use 0, none, or unlimited only for organizer debugging. For stricter organizer-side provenance, set MLXFAST_VERIFY_TRANSFORM=1 when running benchmark.sh. That re-runs the submitted Swift transform into a clean temporary directory and fails unless weights/ is byte-equal to that fresh run. This checks determinism and stale files; it does not require the baseline weights/ layout. verify-transform uses the same default cap and can also be changed with mlxfast-swift verify-transform --max-bytes N.

The public correctness-only prompt and golden live in correctness_prompts/. Private prompt manifests and hidden benchmark golden files are not committed or generated by the benchmark workflow. In private benchmark CI, the normal path downloads the precomputed correctness_prompts/golden_prompt_benchmark_transcription_gate_english_512_256.json object from R2, then downloads correctness_prompts/gpqa_reference_cases.json and merges it into the local golden as 5 hidden GPQA behavior checks. Generate final hidden benchmark goldens outside the public repository and upload the resulting file to the protected private R2 path. The benchmark workflow stores its local golden copy under $RUNNER_TEMP, not the repository workspace, and uploads only hash and byte-count sidecars. The semantic GPQA answer and judge result files are also kept under the private runner directory and are not uploaded.

The Swift make-golden generator has been removed from the public harness so CI only consumes precomputed fixtures. The last commit on this branch containing that generator is bcc9438fabf95a9b371d5749dd64f2f5ccc60fd5.

Each base correctness prompt must contain exactly 512 token IDs. The benchmark prompt must contain at least 512 token IDs. The precomputed golden file stores exact expected tokens for each 512-token correctness prompt continuation, the 512-token prefill check, the 512-token decode seed, and at least 128 tokens for the timed decode window. During correctness, the harness checks the first 64 public continuation positions by default, plus hidden behavior gates in official benchmark runs. It checks those continuation positions teacher-forced: after each accepted step it feeds the golden previous token back into the model. This keeps the gate stable across Apple GPU/software differences by preventing one earlier mismatch from cascading into unrelated later-token failures. A token is accepted only when it matches the expected token, except for a true top-logit tie within the tiny 1e-6 logit tolerance used by the harness.

Private fixtures can also include a correctness_gates object with hidden anchor logits, short free-run prefixes, and answer-token behavior checks. Those gates are additive: public local correctness still works with the checked-in fixture, while official benchmark fixtures can cover more adversarial behavior without exposing prompt or answer data. Behavior checks compare accepted answer prefixes against up to max_new_tokens generated tokens, which lets hidden GPQA questions require only a one-letter answer while tolerating tokenizer whitespace variants.

Requirements

  • Apple Silicon Mac, 24 GB+ unified memory (M2 or newer)
  • macOS Sequoia or later
  • Swift 6 through Xcode or Xcode Command Line Tools
  • Xcode Metal Toolchain for mlx.metallib; ./setup.sh tries xcodebuild -downloadComponent MetalToolchain, but users with only Command Line Tools may need full Xcode installed, opened once, and licensed with sudo xcodebuild -license accept
  • CMake, installed by ./setup.sh via Homebrew when missing and used by tools/build-mlx-metallib.sh to build mlx.metallib

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors