A benchmark arena for memory-bandwidth-optimal LLM inference on Apple Silicon. Run DeepSeek V4 Flash without loading all 256 experts into RAM — and beat the baseline score.
See TASK.md for the full problem statement, scoring formula, and approach space.
# Check local tools, build the Swift harness/MLX metallib, and fetch weights if needed
./setup.sh
# Optional: split dense weights into weights/ and write the expert streaming
# manifest. setup.sh prints this command with the exact reference path it used.
MLXFAST_OFFLINE_WRITABLE_PATHS="${PWD}/weights" \
.github/scripts/run-offline.sh .build/release/mlxfast-swift transform \
--reference .cache/huggingface/hub/models--mlx-community--DeepSeek-V4-Flash-4bit/snapshots/main \
--output weights
# Run the checked-in public correctness gate.
.build/release/mlxfast-swift correctness --weights weights
# Fast edit-loop signal: uses the public 512-token correctness prompt, checks
# the prefill next token plus 16 teacher-forced decode tokens, writes
# score.local-iterate.json, and prints it to stdout.
./benchmark.sh --local-iterate
# Run the Darkbloom-compatible benchmark entrypoint.
# Official benchmark runs use the organizer-supplied hidden correctness_golden.json.
./benchmark.sh
# Local submit check used by Yukon before upload: runs the public 512-token
# prompt through a longer checked timing window, writes score.json with
# score: null, and prints it to stdout.
./benchmark.sh --local-submit
# Or call the Swift CLI directly
.build/release/mlxfast-swift correctness --weights weights
.build/release/mlxfast-swift preflight
.build/release/mlxfast-swift benchmark --local-iterate
.build/release/mlxfast-swift benchmark --score-path score.json
.build/release/mlxfast-swift benchmark --local-submit --score-path score.json
# If required model artifacts are missing, the benchmark emits a valid failed
# score.json instead of a ranked score.The benchmark writes score.json in the format consumed by Darkbloom.
score.json is a generated local output and is not tracked. Public
correctness-only workflow runs use the checked-in
correctness_prompts/public_longcopy_gate_english_512_256.json golden and
matching prompt text. Official benchmark runs use a hidden
correctness_golden.json supplied by the benchmark operator, or a harness path
set with MLXFAST_CORRECTNESS_GOLDEN_PATH=/path/to/correctness_golden.json.
benchmark.sh also writes score.json.sha256 and benchmark-integrity.json,
which record the score file hash, golden hash, transformed weights/ hash, and
transform source hash for run auditing.
Full model setup needs a large local or mounted SSD. The reference checkpoint is
mlx-community/DeepSeek-V4-Flash-4bit, with 33 safetensors shards totaling about
141 GiB. setup.sh downloads the checkpoint from the fast Darkbloom/R2 mirror by
default into a repo-local Hugging Face-style cache under
.cache/huggingface/hub/models--mlx-community--DeepSeek-V4-Flash-4bit/snapshots/main/.
It verifies cached files against fixtures/reference_deepseek_v4_flash_4bit.sha256
and redownloads only files that are missing, truncated, or hash-mismatched. A
compatibility symlink is created at reference_weights/DeepSeek-V4-Flash-4bit
for older commands, but current setup and CI pass the canonical cache directory
to transform explicitly. The downloader uses resumable curl requests, prints
numbered shard progress with elapsed time, and checks for at least 170 GiB free
by default. After a full SHA-256 verification, setup writes
.mlxfast-reference-cache.lock next to the checkpoint; later setup runs use
cheap size/mtime checks against that lock and skip the full 141 GiB hash pass
when the cache is unchanged. Use
MLXFAST_REFERENCE_CACHE_DIR=/Volumes/ssd/hf-cache/.../snapshots/main or
MLXFAST_REFERENCE_DIR=/Volumes/ssd/DeepSeek-V4-Flash-4bit to point at a larger
volume, or MLXFAST_SKIP_WEIGHTS_DOWNLOAD=1 ./setup.sh when the checkpoint will
be supplied separately. If you use a custom cache path, either copy the exact
transform command printed by setup.sh or set MLXFAST_REFERENCE_DIR before
running transform or benchmark.sh. The Swift CLI also honors
MLXFAST_REFERENCE_DIR, MLXFAST_WEIGHTS_PATH,
MLXFAST_CORRECTNESS_GOLDEN_PATH, and MLXFAST_SCORE_PATH as defaults;
explicit CLI flags take precedence. For benchmark.sh, use those MLXFAST_*
environment variables for path overrides; pass --weights, --golden, and
--score-path only to .build/release/mlxfast-swift benchmark directly. Set
MLXFAST_REFERENCE_BASE_URL to use
another HTTP checkpoint prefix, including Hugging Face. Run ./setup.sh --help
for the full local setup knobs.
For manual GitHub Actions benchmark runs, dispatch benchmark.yml on the
trusted repository workflow. The workflow uses the protected
MLXFAST_REFERENCE_BASE_URL secret when present, otherwise the fixed Darkbloom
reference mirror. It downloads the reference checkpoint into the same
repo-local Hugging Face-style cache path used by local setup, passes that path
explicitly to the offline transform, then prepares the correctness golden after
transform completes. Correctness-only workflow runs use the checked-in public
correctness_prompts/public_longcopy_gate_english_512_256.json fixture. Full
benchmark runs require the precomputed hidden R2 object
correctness_prompts/golden_prompt_benchmark_transcription_gate_english_512_256.json.
Full benchmark runs also require the private R2
correctness_prompts/gpqa_reference_cases.json object. The workflow tokenizes
5 token-budget-valid hidden GPQA multiple-choice prompts locally and attaches
them as short-answer behavior gates before correctness runs. Each private GPQA
case must include precomputed reference accepted_token_sequences or
accepted_responses; GPQA answer keys are metadata, not an exact-token oracle,
and the benchmark workflow never regenerates this reference object.
The official workflow checks the first generated GPQA answer token for each
case, using the stable prefix of any longer precomputed reference sequence.
During that hidden behavior correctness pass, it also records TTFT by timing
prompt prefill through the first greedy answer token. The uploaded score records
only aggregate TTFT counts and timings; generated first-token IDs, accepted
token IDs, prompts, and answers stay out of GitHub logs and artifacts.
The same correctness pass captures short hidden GPQA continuations and sends
only those private answer bundles to Claude for a semantic pass/fail judge. This
requires the ORG_ANTHROPIC_API_KEY repository secret. The score artifact records
only aggregate semantic counts and the judge model name; prompts, references,
candidate answers, and judge text stay in the private runner directory.
The private workflow treats semantic GPQA as a hard gate with a 3/5 threshold,
calibrated to the unmodified DeepSeek V4 Flash baseline rather than to
better-than-baseline GPQA answer quality.
When dispatched with run_benchmark=true, the workflow fans out across 5
independent machines instead of running everything on one: three
correctness-slice-N jobs each check one range_N-defined slice of the
hidden base correctness case, benchmark-timing runs only the timed
prefill/decode measurement, and benchmark-gates runs only the anchor/
free-run/behavior/GPQA gates. A cheap validate-slice-ranges job checks the
three ranges before any of those machines start. A combine job (no
checkpoint download, no private secrets) then downloads all 5 results and ANDs
every machine's real verdict into the final score. Dispatching with the
default run_benchmark=false instead runs the same checks serially on one
machine.
If none of those is configured, a full benchmark fails; it will not use a
committed prompt, committed golden, or Actions cache fallback for ranked
scoring. Final hidden goldens should come from protected storage. Private
endpoints can pass headers through
MLXFAST_REFERENCE_AUTH_HEADER and MLXFAST_CORRECTNESS_GOLDEN_AUTH_HEADER
repository secrets. Private R2 golden downloads use the R2_ACCESS_KEY_ID,
R2_BUCKET_ENDPOINT, and R2_SECRET_ACCESS_KEY secrets. See
docs/private-benchmark-security.md for
the private prompt and artifact handling model.
DeepSeek V4 Flash has 256 routed experts per layer, 6 activated per token. The checkpoint is too large to keep fully resident on typical Apple Silicon machines. The baseline ships with SSD streaming: expert tensors stay on disk and only the routed tensors needed for the current forward pass are materialized.
That baseline is functional but naive. Expert reads block the forward pass,
there is no prefetching, no cross-layer reuse, and the weights are stored in
their original 4-bit form. Every one of these is an optimisation target.
The generated weights/ tree is expected to stay small: it is a runtime
artifact overlay on top of the frozen reference checkpoint, not a second full
model copy. Submissions may change both the Swift transform and Swift runtime
to adjust metadata, caching, or streaming strategy, as long as the generated
runnable artifacts pass the hidden correctness and benchmark checks.
Unlike typical inference benchmarks, the entire model execution pipeline is
in scope. Submissions should focus on the Swift targets listed in
benchmark.json:
| Path | What it controls |
|---|---|
Sources/MLXFastModel/ |
DeepSeek V4 Flash runtime, MLX Swift array bridge, dense/expert loading, SSD streaming, decode/prefill logic. Primary target. |
Sources/MLXFastTransform/ |
Offline weight transform from frozen reference safetensors into benchmark-ready weights/. |
The repository is Swift-only: setup, transform, correctness, and benchmark all
run through the Swift package. Correctness, scoring, timing, and provenance are
trusted harness code outside editablePaths; only the model and transform
targets are contestant-editable.
Submissions are made with the Yukon CLI (mlxfast), a separate tool that
manages your account and uploads across all Yukon benchmarks. The
mlxfast-swift binary runs the benchmark domain only (transform, correctness,
benchmark, preflight, verify-transform) and no longer logs in or uploads.
mlxfast login <api-key> --api https://yukon-api.fly.dev
mlxfast clone <benchmark-id-or-name> # fresh checkout; an existing repo auto-links by its git remote
mlxfast submit --model "Claude Opus 4.8" \
--note "Changed expert streaming prefetch policy."
mlxfast submissionsmlxfast submit reads benchmark.json and uploads only the paths listed in
editablePaths as submission.tar.gz, POSTed to Yukon with
Authorization: Bearer <api-key> and an idempotency key. Generated weights/,
reference checkpoints, golden files, and local scores live outside
editablePaths and are never uploaded; the backend re-enforces the editable
surface server-side after upload. --model is required and is recorded for the
leaderboard. MLXFAST_API_URL / MLXFAST_API_TOKEN (or the YUKON_*
equivalents) configure the endpoint and token for scripted runs.
Before uploading, Yukon runs the contract preSubmitCommand, which is
./benchmark.sh --local-submit for this benchmark. That local-submit pass is
the local submit gate: it uses the public/local oracle, writes and prints
score.json, and stops obviously broken or slower changes before they spend
official runner time.
Use these two benchmark modes for local development:
| Command | Purpose | What it checks | Output |
|---|---|---|---|
./benchmark.sh --local-iterate |
Fast edit-loop signal, usually under 2 minutes after setup. | Public 512-token prompt, standalone prefill next-token check, decode seed-prefill check, and 16 teacher-forced decode checks. | score.local-iterate.json with score: null. |
./benchmark.sh --local-submit |
Yukon pre-submit gate, intended to be about 10 minutes after setup. | Same public prompt, standalone prefill next-token check, decode seed-prefill check, and 1023 teacher-forced decode checks from a longer public fixture. | score.json with score: null. |
Neither local mode produces an official leaderboard score. Official ranking still runs the hidden benchmark oracle and hidden correctness gates on the trusted runner.
decode_speedup = baseline_decode_sec_per_token / decode_sec_per_token
prefill_speedup = baseline_prefill_sec_per_token / prefill_sec_per_token
score = decode_speedup^0.75 * prefill_speedup^0.25
Higher score is better. A baseline implementation on the official runner scores
about 1.0; improvements should move the score upward. Decode is weighted more
heavily because it dominates interactive generation, while prefill still matters
for prompt processing.
Both phases must also stay within 5% of the official baseline:
decode_speedup >= 0.95
prefill_speedup >= 0.95
The floor prevents a submission from sacrificing one serving phase badly to
improve the other. The exact baseline timings are emitted in each score.json
and kept in MLXFastConstants after trusted Blacksmith calibration.
On the current Blacksmith M4 baseline, that means decode must be at most
3.8280590145312505 seconds/token and prefill must be at most
0.18242698079358555 seconds/token.
For scoring, decode is trusted parent wall-clock time for decode setup plus the
checked decode-token window, not worker-reported per-step time. That charges
prompt-specific seed prefill to the decode phase so submitted model code cannot
hide speculative decode work before the timer starts.
The harness records bandwidth_source=trusted_core_expert_slot_bank_reads and
derives bandwidth_gb_per_token from trusted-core expert slot-bank file bytes
during the decode window. Bandwidth, RAM, and expert-read metrics are reported
for operator review and future guardrails; they are not primary score factors.
Correctness is a hard gate. See TASK.md for the full correctness specification.
The official run times the benchmark before correctness so the correctness gate
cannot warm the measured model path. It then checks 64 public correctness
positions plus the hidden GPQA behavior checks.
Public local correctness uses the checked-in correctness fixture. When a local
edit-loop signal is enough, --local-iterate uses that public 512-token prompt,
times standalone prefill separately, then times decode including seed prefill
plus 16 teacher-forced decode tokens, writes score.local-iterate.json, prints
it, and leaves score null because it is not a ranked benchmark score. The
submit hook --local-submit uses the same public prompt with a longer 1024-token
fixture: it times the same standalone prefill and decode-seed phases plus 1023
teacher-forced decode tokens in one continuous trajectory, writes
score.json, and also leaves score null because official ranking still
requires the hidden benchmark oracle on the trusted runner.
The score payload includes the official baseline timings, computed speedups,
wall-clock phase timings, final process RSS, expert streaming counters, and
transformed-weights digest.
Sources/
MLXFastCLI/ Swift command-line entrypoint
MLXFastCore/ score.json, golden cases, shared contracts
MLXFastTransform/ Swift offline weight transform
MLXFastModel/ editable DeepSeek V4 Flash Swift runtime
MLXFastHarness/ trusted correctness, golden, and benchmark runner
weights/ transformed weights (harness loads from here)
experts/
manifest.json baseline byte ranges for streamed expert tensors
.cache/huggingface/hub/... canonical frozen 4-bit reference checkpoint cache
reference_weights/... compatibility symlink to the reference cache
correctness_prompts/ public correctness prompt and checked-in golden
correctness_golden.json hidden benchmark correctness cases and token oracle
score.json written after each benchmark run
The baseline runtime loads dense/shared tensors from weights/ and streams
routed expert tensors from the frozen reference checkpoint named by
MLXFAST_REFERENCE_DIR, falling back to the compatibility symlink when that
environment variable is not set.
The standard preflight/benchmark path enforces a default 25 GiB cap on the
generated weights/ tree before correctness or timing runs. Change it with
MLXFAST_MAX_WEIGHTS_BYTES; use 0, none, or unlimited only for organizer
debugging. For stricter organizer-side provenance, set
MLXFAST_VERIFY_TRANSFORM=1 when running benchmark.sh. That re-runs the
submitted Swift transform into a clean temporary directory and fails unless
weights/ is byte-equal to that fresh run. This checks determinism and stale
files; it does not require the baseline weights/ layout. verify-transform
uses the same default cap and can also be changed with
mlxfast-swift verify-transform --max-bytes N.
The public correctness-only prompt and golden live in correctness_prompts/.
Private prompt manifests and hidden benchmark golden files are not committed or
generated by the benchmark workflow. In private benchmark CI, the normal path
downloads the precomputed
correctness_prompts/golden_prompt_benchmark_transcription_gate_english_512_256.json
object from R2, then downloads
correctness_prompts/gpqa_reference_cases.json and merges it into the local
golden as 5 hidden GPQA behavior checks. Generate
final hidden benchmark goldens outside the public repository and upload the
resulting file to the protected private R2 path. The benchmark workflow stores
its local golden copy under $RUNNER_TEMP, not the repository workspace, and
uploads only hash and byte-count sidecars. The semantic GPQA answer and judge
result files are also kept under the private runner directory and are not
uploaded.
The Swift make-golden generator has been removed from the public harness so CI
only consumes precomputed fixtures. The last commit on this branch containing
that generator is bcc9438fabf95a9b371d5749dd64f2f5ccc60fd5.
Each base correctness prompt must contain exactly 512 token IDs. The benchmark
prompt must contain at least 512 token IDs. The precomputed golden file stores
exact expected tokens for each 512-token correctness prompt continuation, the
512-token prefill check, the 512-token decode seed, and at least 128 tokens for
the timed decode window. During correctness, the harness checks the first 64
public continuation positions by default, plus hidden
behavior gates in official benchmark runs. It checks those continuation
positions teacher-forced: after each accepted step it feeds the
golden previous token back into the model. This keeps the gate stable across
Apple GPU/software differences by preventing one earlier mismatch from
cascading into unrelated later-token failures. A token is accepted only when it
matches the expected token, except for a true top-logit tie within the tiny
1e-6 logit tolerance used by the harness.
Private fixtures can also include a correctness_gates object with hidden
anchor logits, short free-run prefixes, and answer-token behavior checks.
Those gates are additive: public local correctness still works with the
checked-in fixture, while official benchmark fixtures can cover more adversarial
behavior without exposing prompt or answer data. Behavior checks compare
accepted answer prefixes against up to max_new_tokens generated tokens, which
lets hidden GPQA questions require only a one-letter answer while tolerating
tokenizer whitespace variants.
- Apple Silicon Mac, 24 GB+ unified memory (M2 or newer)
- macOS Sequoia or later
- Swift 6 through Xcode or Xcode Command Line Tools
- Xcode Metal Toolchain for
mlx.metallib;./setup.shtriesxcodebuild -downloadComponent MetalToolchain, but users with only Command Line Tools may need full Xcode installed, opened once, and licensed withsudo xcodebuild -license accept - CMake, installed by
./setup.shvia Homebrew when missing and used bytools/build-mlx-metallib.shto buildmlx.metallib