feat: PuLID-Flux identity-injection support by RapidMark · Pull Request #1542 · leejet/stable-diffusion.cpp

RapidMark · 2026-05-22T00:58:25Z

This PR adds support for PuLID-Flux
identity preservation to the Flux denoise loop. Given a single source
portrait, generated images preserve the source person's face across
arbitrary scenes and prompts.

What's included

src/pulid.hpp — PuLIDPerceiverAttentionCA, the cross-attention
module mirroring the PyTorch reference at
ToTheBeginning/PuLID/.../encoders_transformer.py.
Pure-ggml graph; runs on CPU / CUDA / Vulkan / Metal without
backend-specific code.
src/flux.hpp — adds 20 pulid_ca.<i> child blocks to Flux
(constructed conditionally when params.pulid_enabled is set),
inserts the cross-attention call between transformer blocks at the
intervals the PyTorch reference uses (every 2nd double block, every
4th single block), and threads two new optional parameters
(pulid_id, pulid_id_weight) through forward, forward_orig,
forward_chroma_radiance, forward_flux_chroma, compute, and
build_graph.
src/stable-diffusion.cpp — loads pulid_*.safetensors via
model_loader.init_from_file under the existing
model.diffusion_model. prefix so PuLID-CA tensors bind to the new
blocks naturally. PuLID-encoder keys (which live in the precompute
tool, not in C++) are correctly identified as unknown. Adds
load_pulid_id_embedding() to parse a small .pulidembd binary
file and wraps its content as a sd::Tensor<float> passed via
DiffusionParams.
include/stable-diffusion.h — public API: sd_pulid_params_t
(per-generation embedding path + weight), pulid_weights_path on
sd_ctx_params_t, pulid_params on sd_img_gen_params_t.
examples/common/common.{cpp,h} — three new CLI flags:
--pulid-weights <path>, --pulid-id-embedding <path>, and
--pulid-id-weight <float>.
src/diffusion_model.hpp — extends DiffusionParams to carry the
new identity embedding + weight; FluxModel::compute forwards both
through.
docs/pulid.md — usage, binary format spec, supported PuLID weight
versions (v0.9.0 / v0.9.1; v1.1 deferred), memory budget notes, and
a three-way SHA-256 falsification recipe.
scripts/pulid_extract_id.py — reference precompute tool that
produces the .pulidembd binary from a source portrait. Lives
outside the C++ build because identity extraction (insightface +
EVA-CLIP-L + IDFormer) is a heavy PyTorch stack that would be
impractical to port to ggml just to run once per source person.

Why split extraction from injection

PuLID-Flux's identity extractor is a stack of three large PyTorch
models (ArcFace face detector + EVA-CLIP-L vision encoder + IDFormer
perceiver-resampler). Porting all three to C++/ggml would add ~5000
lines for code that runs once per source person and produces a 131 KB
output. By making sd.cpp consume a precomputed binary file, the C++
surface area is small (~600 lines), the heavy ML stack only needs to
run once per person on any backend that supports PyTorch, and adding
PuLID is decoupled from the active development on insightface /
EVA-CLIP / IDFormer.

Binary format

offset 0   : magic "PULIDV01"      (8 bytes ASCII)
offset 8   : num_tokens (uint32 LE)
offset 12  : token_dim (uint32 LE)
offset 16  : dtype (uint8): 0=fp16, 1=bf16, 2=fp32
offset 17  : reserved zeros        (15 bytes; header total = 32)
offset 32  : tokens, row-major LE

Typical (32, 2048, fp16) = 131 KB.

Verification

The three-way SHA-256 falsification recipe in docs/pulid.md
distinguishes "the feature is wired but doesn't do anything" from
"the feature is actively altering the diffusion trajectory":

Run	Expected hash relation
A: no `--pulid-*` flags	baseline
B: PuLID flags, `--pulid-id-weight 0.0`	byte-identical to A
C: PuLID flags, `--pulid-id-weight 1.0`	differs, preserves source identity

Verified on three backends with the same source code:

Vulkan-AMD (RX 6700 XT, -DSD_VULKAN=ON): A == B byte-identical,
A != C, C visually preserves source identity.
Vulkan-NVIDIA (RTX 3060, same binary, --backend "diffusion=vulkan1"):
A == B, A != C, C visually equivalent to the AMD output at the same
seed (different bytes per the usual cross-backend nondeterminism).
CUDA-NVIDIA (RTX 3060, separate -DSD_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=86
build against CUDA 13.2): A == B byte-identical, A != C, C visually
preserves source identity. PerceiverAttentionCA's pure-ggml graph
code runs unchanged across all three backends -- no backend-specific
conditionals were needed.

Per-image sampling times at 512x512 / 4 steps / Flux Schnell Q4 + PuLID:

Backend	Sampling (s)	Notes
AMD 6700 XT (Vulkan)	22	12 GB consumer card
NVIDIA 3060 (Vulkan)	11	same binary as AMD
NVIDIA 3060 (CUDA)	9.6	separate `-DSD_CUDA=ON` build

batch_count=3 was tested separately and confirms the long-lived-worker
amortization story: per-image sampling drops from 19.6 s (cold) to
~11 s (warm) as the model stays resident across batch iterations.

Tested with Flux Schnell Q4_K_S + PuLID v0.9.1 at 512x512 / 4 steps,
and Flux Dev Q4_K_S + PuLID v0.9.1 at 768x768 / 20 steps. 1024x1024 +
Dev + PuLID OOMs on a 12 GB card unless the VAE is routed to the CPU
backend via --backend "vae=cpu" (not just --vae-on-cpu, which only
offloads weights, not the compute buffer); this is existing
stable-diffusion.cpp behavior, not a PuLID-specific issue, but
documented in docs/pulid.md because PuLID users will hit it.

Tested with batch_count > 1 (verified each image gets the same
identity, different composition).

Not yet supported (called out in docs/pulid.md)

PuLID v1.1 (pulid_v1.1.safetensors) -- has renamed key layout
(id_adapter_attn_layers.* vs pulid_ca.*) and potentially
different module structure. Follow-up PR.
Multiple ID images fused into one embedding (the reference Python
pipeline supports this; the current precompute tool accepts only
one portrait per run).
The --true-cfg negative-prompt branch -- PuLID only injects on the
positive conditioning path in the reference implementation; this
matches.

Backward compatibility

Non-PuLID generations are unaffected. The params.pulid_enabled flag
defaults to false and is only set when the model loader sees a
pulid_ca.* tensor in the loaded safetensors file. A regression run
of Flux Schnell Q4 without --pulid-* flags produces byte-identical
output to pre-patch.

File summary

include/stable-diffusion.h          +34 / -0
src/stable-diffusion.cpp           +120 / -0
src/diffusion_model.hpp              +5 / -1
src/flux.hpp                       +106 / -10
src/pulid.hpp                      +127 / -0   (new)
examples/common/common.h             +6 / -0
examples/common/common.cpp          +19 / -0
docs/pulid.md                      +220 / -0   (new)
scripts/pulid_extract_id.py        +135 / -0   (new)

Total ~770 added lines, ~10 changed. No removed functionality.

This PR adds support for [PuLID-Flux](https://github.com/ToTheBeginning/PuLID) identity preservation to the Flux denoise loop. Given a single source portrait, generated images preserve the source person's face across arbitrary scenes and prompts. ### What's included - `src/pulid.hpp` — `PuLIDPerceiverAttentionCA`, the cross-attention module mirroring the PyTorch reference at [ToTheBeginning/PuLID/.../encoders_transformer.py](https://github.com/ToTheBeginning/PuLID/blob/main/pulid/encoders_transformer.py). Pure-ggml graph; runs on CPU / CUDA / Vulkan / Metal without backend-specific code. - `src/flux.hpp` — adds 20 `pulid_ca.<i>` child blocks to `Flux` (constructed conditionally when `params.pulid_enabled` is set), inserts the cross-attention call between transformer blocks at the intervals the PyTorch reference uses (every 2nd double block, every 4th single block), and threads two new optional parameters (`pulid_id`, `pulid_id_weight`) through `forward`, `forward_orig`, `forward_chroma_radiance`, `forward_flux_chroma`, `compute`, and `build_graph`. - `src/stable-diffusion.cpp` — loads `pulid_*.safetensors` via `model_loader.init_from_file` under the existing `model.diffusion_model.` prefix so PuLID-CA tensors bind to the new blocks naturally. PuLID-encoder keys (which live in the precompute tool, not in C++) are correctly identified as unknown. Adds `load_pulid_id_embedding()` to parse a small `.pulidembd` binary file and wraps its content as a `sd::Tensor<float>` passed via `DiffusionParams`. - `include/stable-diffusion.h` — public API: `sd_pulid_params_t` (per-generation embedding path + weight), `pulid_weights_path` on `sd_ctx_params_t`, `pulid_params` on `sd_img_gen_params_t`. - `examples/common/common.{cpp,h}` — three new CLI flags: `--pulid-weights <path>`, `--pulid-id-embedding <path>`, and `--pulid-id-weight <float>`. - `src/diffusion_model.hpp` — extends `DiffusionParams` to carry the new identity embedding + weight; `FluxModel::compute` forwards both through. - `docs/pulid.md` — usage, binary format spec, supported PuLID weight versions (v0.9.0 / v0.9.1; v1.1 deferred), memory budget notes, and a three-way SHA-256 falsification recipe. - `scripts/pulid_extract_id.py` — reference precompute tool that produces the `.pulidembd` binary from a source portrait. Lives outside the C++ build because identity extraction (insightface + EVA-CLIP-L + IDFormer) is a heavy PyTorch stack that would be impractical to port to ggml just to run once per source person. ### Why split extraction from injection PuLID-Flux's identity extractor is a stack of three large PyTorch models (ArcFace face detector + EVA-CLIP-L vision encoder + IDFormer perceiver-resampler). Porting all three to C++/ggml would add ~5000 lines for code that runs once per source person and produces a 131 KB output. By making sd.cpp consume a precomputed binary file, the C++ surface area is small (~600 lines), the heavy ML stack only needs to run once per person on any backend that supports PyTorch, and adding PuLID is decoupled from the active development on insightface / EVA-CLIP / IDFormer. ### Binary format ``` offset 0 : magic "PULIDV01" (8 bytes ASCII) offset 8 : num_tokens (uint32 LE) offset 12 : token_dim (uint32 LE) offset 16 : dtype (uint8): 0=fp16, 1=bf16, 2=fp32 offset 17 : reserved zeros (15 bytes; header total = 32) offset 32 : tokens, row-major LE ``` Typical (32, 2048, fp16) = 131 KB. ### Verification The three-way SHA-256 falsification recipe in docs/pulid.md distinguishes "the feature is wired but doesn't do anything" from "the feature is actively altering the diffusion trajectory": | Run | Expected hash relation | |-----------------------------------------|--------------------------------------------| | A: no `--pulid-*` flags | baseline | | B: PuLID flags, `--pulid-id-weight 0.0` | byte-identical to A | | C: PuLID flags, `--pulid-id-weight 1.0` | differs, preserves source identity | Verified on three backends with the same source code: - **Vulkan-AMD** (RX 6700 XT, `-DSD_VULKAN=ON`): A == B byte-identical, A != C, C visually preserves source identity. - **Vulkan-NVIDIA** (RTX 3060, same binary, `--backend "diffusion=vulkan1"`): A == B, A != C, C visually equivalent to the AMD output at the same seed (different bytes per the usual cross-backend nondeterminism). - **CUDA-NVIDIA** (RTX 3060, separate `-DSD_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=86` build against CUDA 13.2): A == B byte-identical, A != C, C visually preserves source identity. PerceiverAttentionCA's pure-ggml graph code runs unchanged across all three backends -- no backend-specific conditionals were needed. Per-image sampling times at 512x512 / 4 steps / Flux Schnell Q4 + PuLID: | Backend | Sampling (s) | Notes | |------------------------|-------------:|--------------------------------| | AMD 6700 XT (Vulkan) | 22 | 12 GB consumer card | | NVIDIA 3060 (Vulkan) | 11 | same binary as AMD | | NVIDIA 3060 (CUDA) | 9.6 | separate `-DSD_CUDA=ON` build | batch_count=3 was tested separately and confirms the long-lived-worker amortization story: per-image sampling drops from 19.6 s (cold) to ~11 s (warm) as the model stays resident across batch iterations. Tested with Flux Schnell Q4_K_S + PuLID v0.9.1 at 512x512 / 4 steps, and Flux Dev Q4_K_S + PuLID v0.9.1 at 768x768 / 20 steps. 1024x1024 + Dev + PuLID OOMs on a 12 GB card unless the VAE is routed to the CPU backend via `--backend "vae=cpu"` (not just `--vae-on-cpu`, which only offloads weights, not the compute buffer); this is existing stable-diffusion.cpp behavior, not a PuLID-specific issue, but documented in docs/pulid.md because PuLID users will hit it. Tested with batch_count > 1 (verified each image gets the same identity, different composition). ### Not yet supported (called out in docs/pulid.md) - PuLID v1.1 (`pulid_v1.1.safetensors`) -- has renamed key layout (`id_adapter_attn_layers.*` vs `pulid_ca.*`) and potentially different module structure. Follow-up PR. - Multiple ID images fused into one embedding (the reference Python pipeline supports this; the current precompute tool accepts only one portrait per run). - The `--true-cfg` negative-prompt branch -- PuLID only injects on the positive conditioning path in the reference implementation; this matches. ### Backward compatibility Non-PuLID generations are unaffected. The `params.pulid_enabled` flag defaults to false and is only set when the model loader sees a `pulid_ca.*` tensor in the loaded safetensors file. A regression run of Flux Schnell Q4 without `--pulid-*` flags produces byte-identical output to pre-patch. ### File summary ``` include/stable-diffusion.h +34 / -0 src/stable-diffusion.cpp +120 / -0 src/diffusion_model.hpp +5 / -1 src/flux.hpp +106 / -10 src/pulid.hpp +127 / -0 (new) examples/common/common.h +6 / -0 examples/common/common.cpp +19 / -0 docs/pulid.md +220 / -0 (new) scripts/pulid_extract_id.py +135 / -0 (new) ``` Total ~770 added lines, ~10 changed. No removed functionality.

…onParams Upstream leejet#1569 ("simplify diffusion model runner params") split the monolithic DiffusionParams into per-model Extra structs. Re-seated the PuLID-Flux feature onto the new architecture: - diffusion_model.hpp: pulid_id + pulid_id_weight added to FluxDiffusionExtra. - flux.hpp: compute(DiffusionParams) now reads extra->pulid_id / extra->pulid_id_weight and threads them through to build_graph (the PuLID cross-attention code itself merged cleanly). - stable-diffusion.cpp: the FluxDiffusionExtra construction carries the PuLID id embedding + weight; obsolete monolithic param assignments dropped. Verified end-to-end on three GPUs/backends (compiles + the 3-way off / zero-weight / on PuLID falsification all pass; zero-weight is byte-identical to baseline, weight 1.0 alters output and preserves identity): - AMD R9700 (RDNA4, ROCm) - AMD RX 6700 XT (RDNA2, Vulkan) - NVIDIA RTX 3060 (Vulkan)

Green-Sky · 2026-06-01T16:36:57Z

I think gguf should be used as the container for binary data instead.

RapidMark · 2026-06-01T16:44:50Z

Yeah, fair point — gguf makes more sense than me rolling my own header here. Nice bonus is it can load through the same init_from_file path the pulid_ca weights already use, so I can drop the custom loader entirely. I'll redo the extract script + reader and test it on a couple backends before pushing an update. Thanks!

RapidMark · 2026-06-01T23:31:10Z

Superseding this with #1595. I rebased onto current master and reworked the id-embedding into a gguf container like you suggested, @Green-Sky — it loads through the same init_from_file path as the pulid_ca weights now, no bespoke header. This branch had drifted onto an older master and the history got messy, so a clean PR was easier than untangling a force-push. Carrying on over there.

RapidMark force-pushed the cloudhands/pulid-flux branch from 616d8d0 to aef4d29 Compare May 22, 2026 01:39

RapidMark mentioned this pull request Jun 1, 2026

feat: PuLID-Flux identity-injection support #1595

Open

RapidMark closed this Jun 1, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: PuLID-Flux identity-injection support#1542

feat: PuLID-Flux identity-injection support#1542
RapidMark wants to merge 2 commits into
leejet:masterfrom
CloudhandsAI:cloudhands/pulid-flux

RapidMark commented May 22, 2026

Uh oh!

Green-Sky commented Jun 1, 2026

Uh oh!

RapidMark commented Jun 1, 2026

Uh oh!

RapidMark commented Jun 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

RapidMark commented May 22, 2026

What's included

Why split extraction from injection

Binary format

Verification

Not yet supported (called out in docs/pulid.md)

Backward compatibility

File summary

Uh oh!

Green-Sky commented Jun 1, 2026

Uh oh!

RapidMark commented Jun 1, 2026

Uh oh!

RapidMark commented Jun 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants