feat: PuLID-Flux identity-injection support#1542
Conversation
This PR adds support for [PuLID-Flux](https://github.com/ToTheBeginning/PuLID) identity preservation to the Flux denoise loop. Given a single source portrait, generated images preserve the source person's face across arbitrary scenes and prompts. ### What's included - `src/pulid.hpp` — `PuLIDPerceiverAttentionCA`, the cross-attention module mirroring the PyTorch reference at [ToTheBeginning/PuLID/.../encoders_transformer.py](https://github.com/ToTheBeginning/PuLID/blob/main/pulid/encoders_transformer.py). Pure-ggml graph; runs on CPU / CUDA / Vulkan / Metal without backend-specific code. - `src/flux.hpp` — adds 20 `pulid_ca.<i>` child blocks to `Flux` (constructed conditionally when `params.pulid_enabled` is set), inserts the cross-attention call between transformer blocks at the intervals the PyTorch reference uses (every 2nd double block, every 4th single block), and threads two new optional parameters (`pulid_id`, `pulid_id_weight`) through `forward`, `forward_orig`, `forward_chroma_radiance`, `forward_flux_chroma`, `compute`, and `build_graph`. - `src/stable-diffusion.cpp` — loads `pulid_*.safetensors` via `model_loader.init_from_file` under the existing `model.diffusion_model.` prefix so PuLID-CA tensors bind to the new blocks naturally. PuLID-encoder keys (which live in the precompute tool, not in C++) are correctly identified as unknown. Adds `load_pulid_id_embedding()` to parse a small `.pulidembd` binary file and wraps its content as a `sd::Tensor<float>` passed via `DiffusionParams`. - `include/stable-diffusion.h` — public API: `sd_pulid_params_t` (per-generation embedding path + weight), `pulid_weights_path` on `sd_ctx_params_t`, `pulid_params` on `sd_img_gen_params_t`. - `examples/common/common.{cpp,h}` — three new CLI flags: `--pulid-weights <path>`, `--pulid-id-embedding <path>`, and `--pulid-id-weight <float>`. - `src/diffusion_model.hpp` — extends `DiffusionParams` to carry the new identity embedding + weight; `FluxModel::compute` forwards both through. - `docs/pulid.md` — usage, binary format spec, supported PuLID weight versions (v0.9.0 / v0.9.1; v1.1 deferred), memory budget notes, and a three-way SHA-256 falsification recipe. - `scripts/pulid_extract_id.py` — reference precompute tool that produces the `.pulidembd` binary from a source portrait. Lives outside the C++ build because identity extraction (insightface + EVA-CLIP-L + IDFormer) is a heavy PyTorch stack that would be impractical to port to ggml just to run once per source person. ### Why split extraction from injection PuLID-Flux's identity extractor is a stack of three large PyTorch models (ArcFace face detector + EVA-CLIP-L vision encoder + IDFormer perceiver-resampler). Porting all three to C++/ggml would add ~5000 lines for code that runs once per source person and produces a 131 KB output. By making sd.cpp consume a precomputed binary file, the C++ surface area is small (~600 lines), the heavy ML stack only needs to run once per person on any backend that supports PyTorch, and adding PuLID is decoupled from the active development on insightface / EVA-CLIP / IDFormer. ### Binary format ``` offset 0 : magic "PULIDV01" (8 bytes ASCII) offset 8 : num_tokens (uint32 LE) offset 12 : token_dim (uint32 LE) offset 16 : dtype (uint8): 0=fp16, 1=bf16, 2=fp32 offset 17 : reserved zeros (15 bytes; header total = 32) offset 32 : tokens, row-major LE ``` Typical (32, 2048, fp16) = 131 KB. ### Verification The three-way SHA-256 falsification recipe in docs/pulid.md distinguishes "the feature is wired but doesn't do anything" from "the feature is actively altering the diffusion trajectory": | Run | Expected hash relation | |-----------------------------------------|--------------------------------------------| | A: no `--pulid-*` flags | baseline | | B: PuLID flags, `--pulid-id-weight 0.0` | byte-identical to A | | C: PuLID flags, `--pulid-id-weight 1.0` | differs, preserves source identity | Verified on three backends with the same source code: - **Vulkan-AMD** (RX 6700 XT, `-DSD_VULKAN=ON`): A == B byte-identical, A != C, C visually preserves source identity. - **Vulkan-NVIDIA** (RTX 3060, same binary, `--backend "diffusion=vulkan1"`): A == B, A != C, C visually equivalent to the AMD output at the same seed (different bytes per the usual cross-backend nondeterminism). - **CUDA-NVIDIA** (RTX 3060, separate `-DSD_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=86` build against CUDA 13.2): A == B byte-identical, A != C, C visually preserves source identity. PerceiverAttentionCA's pure-ggml graph code runs unchanged across all three backends -- no backend-specific conditionals were needed. Per-image sampling times at 512x512 / 4 steps / Flux Schnell Q4 + PuLID: | Backend | Sampling (s) | Notes | |------------------------|-------------:|--------------------------------| | AMD 6700 XT (Vulkan) | 22 | 12 GB consumer card | | NVIDIA 3060 (Vulkan) | 11 | same binary as AMD | | NVIDIA 3060 (CUDA) | 9.6 | separate `-DSD_CUDA=ON` build | batch_count=3 was tested separately and confirms the long-lived-worker amortization story: per-image sampling drops from 19.6 s (cold) to ~11 s (warm) as the model stays resident across batch iterations. Tested with Flux Schnell Q4_K_S + PuLID v0.9.1 at 512x512 / 4 steps, and Flux Dev Q4_K_S + PuLID v0.9.1 at 768x768 / 20 steps. 1024x1024 + Dev + PuLID OOMs on a 12 GB card unless the VAE is routed to the CPU backend via `--backend "vae=cpu"` (not just `--vae-on-cpu`, which only offloads weights, not the compute buffer); this is existing stable-diffusion.cpp behavior, not a PuLID-specific issue, but documented in docs/pulid.md because PuLID users will hit it. Tested with batch_count > 1 (verified each image gets the same identity, different composition). ### Not yet supported (called out in docs/pulid.md) - PuLID v1.1 (`pulid_v1.1.safetensors`) -- has renamed key layout (`id_adapter_attn_layers.*` vs `pulid_ca.*`) and potentially different module structure. Follow-up PR. - Multiple ID images fused into one embedding (the reference Python pipeline supports this; the current precompute tool accepts only one portrait per run). - The `--true-cfg` negative-prompt branch -- PuLID only injects on the positive conditioning path in the reference implementation; this matches. ### Backward compatibility Non-PuLID generations are unaffected. The `params.pulid_enabled` flag defaults to false and is only set when the model loader sees a `pulid_ca.*` tensor in the loaded safetensors file. A regression run of Flux Schnell Q4 without `--pulid-*` flags produces byte-identical output to pre-patch. ### File summary ``` include/stable-diffusion.h +34 / -0 src/stable-diffusion.cpp +120 / -0 src/diffusion_model.hpp +5 / -1 src/flux.hpp +106 / -10 src/pulid.hpp +127 / -0 (new) examples/common/common.h +6 / -0 examples/common/common.cpp +19 / -0 docs/pulid.md +220 / -0 (new) scripts/pulid_extract_id.py +135 / -0 (new) ``` Total ~770 added lines, ~10 changed. No removed functionality.
616d8d0 to
aef4d29
Compare
…onParams Upstream leejet#1569 ("simplify diffusion model runner params") split the monolithic DiffusionParams into per-model Extra structs. Re-seated the PuLID-Flux feature onto the new architecture: - diffusion_model.hpp: pulid_id + pulid_id_weight added to FluxDiffusionExtra. - flux.hpp: compute(DiffusionParams) now reads extra->pulid_id / extra->pulid_id_weight and threads them through to build_graph (the PuLID cross-attention code itself merged cleanly). - stable-diffusion.cpp: the FluxDiffusionExtra construction carries the PuLID id embedding + weight; obsolete monolithic param assignments dropped. Verified end-to-end on three GPUs/backends (compiles + the 3-way off / zero-weight / on PuLID falsification all pass; zero-weight is byte-identical to baseline, weight 1.0 alters output and preserves identity): - AMD R9700 (RDNA4, ROCm) - AMD RX 6700 XT (RDNA2, Vulkan) - NVIDIA RTX 3060 (Vulkan)
|
I think gguf should be used as the container for binary data instead. |
|
Yeah, fair point — gguf makes more sense than me rolling my own header here. Nice bonus is it can load through the same |
|
Superseding this with #1595. I rebased onto current master and reworked the id-embedding into a gguf container like you suggested, @Green-Sky — it loads through the same |
This PR adds support for PuLID-Flux
identity preservation to the Flux denoise loop. Given a single source
portrait, generated images preserve the source person's face across
arbitrary scenes and prompts.
What's included
src/pulid.hpp—PuLIDPerceiverAttentionCA, the cross-attentionmodule mirroring the PyTorch reference at
ToTheBeginning/PuLID/.../encoders_transformer.py.
Pure-ggml graph; runs on CPU / CUDA / Vulkan / Metal without
backend-specific code.
src/flux.hpp— adds 20pulid_ca.<i>child blocks toFlux(constructed conditionally when
params.pulid_enabledis set),inserts the cross-attention call between transformer blocks at the
intervals the PyTorch reference uses (every 2nd double block, every
4th single block), and threads two new optional parameters
(
pulid_id,pulid_id_weight) throughforward,forward_orig,forward_chroma_radiance,forward_flux_chroma,compute, andbuild_graph.src/stable-diffusion.cpp— loadspulid_*.safetensorsviamodel_loader.init_from_fileunder the existingmodel.diffusion_model.prefix so PuLID-CA tensors bind to the newblocks naturally. PuLID-encoder keys (which live in the precompute
tool, not in C++) are correctly identified as unknown. Adds
load_pulid_id_embedding()to parse a small.pulidembdbinaryfile and wraps its content as a
sd::Tensor<float>passed viaDiffusionParams.include/stable-diffusion.h— public API:sd_pulid_params_t(per-generation embedding path + weight),
pulid_weights_pathonsd_ctx_params_t,pulid_paramsonsd_img_gen_params_t.examples/common/common.{cpp,h}— three new CLI flags:--pulid-weights <path>,--pulid-id-embedding <path>, and--pulid-id-weight <float>.src/diffusion_model.hpp— extendsDiffusionParamsto carry thenew identity embedding + weight;
FluxModel::computeforwards boththrough.
docs/pulid.md— usage, binary format spec, supported PuLID weightversions (v0.9.0 / v0.9.1; v1.1 deferred), memory budget notes, and
a three-way SHA-256 falsification recipe.
scripts/pulid_extract_id.py— reference precompute tool thatproduces the
.pulidembdbinary from a source portrait. Livesoutside the C++ build because identity extraction (insightface +
EVA-CLIP-L + IDFormer) is a heavy PyTorch stack that would be
impractical to port to ggml just to run once per source person.
Why split extraction from injection
PuLID-Flux's identity extractor is a stack of three large PyTorch
models (ArcFace face detector + EVA-CLIP-L vision encoder + IDFormer
perceiver-resampler). Porting all three to C++/ggml would add ~5000
lines for code that runs once per source person and produces a 131 KB
output. By making sd.cpp consume a precomputed binary file, the C++
surface area is small (~600 lines), the heavy ML stack only needs to
run once per person on any backend that supports PyTorch, and adding
PuLID is decoupled from the active development on insightface /
EVA-CLIP / IDFormer.
Binary format
Typical (32, 2048, fp16) = 131 KB.
Verification
The three-way SHA-256 falsification recipe in docs/pulid.md
distinguishes "the feature is wired but doesn't do anything" from
"the feature is actively altering the diffusion trajectory":
--pulid-*flags--pulid-id-weight 0.0--pulid-id-weight 1.0Verified on three backends with the same source code:
-DSD_VULKAN=ON): A == B byte-identical,A != C, C visually preserves source identity.
--backend "diffusion=vulkan1"):A == B, A != C, C visually equivalent to the AMD output at the same
seed (different bytes per the usual cross-backend nondeterminism).
-DSD_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=86build against CUDA 13.2): A == B byte-identical, A != C, C visually
preserves source identity. PerceiverAttentionCA's pure-ggml graph
code runs unchanged across all three backends -- no backend-specific
conditionals were needed.
Per-image sampling times at 512x512 / 4 steps / Flux Schnell Q4 + PuLID:
-DSD_CUDA=ONbuildbatch_count=3 was tested separately and confirms the long-lived-worker
amortization story: per-image sampling drops from 19.6 s (cold) to
~11 s (warm) as the model stays resident across batch iterations.
Tested with Flux Schnell Q4_K_S + PuLID v0.9.1 at 512x512 / 4 steps,
and Flux Dev Q4_K_S + PuLID v0.9.1 at 768x768 / 20 steps. 1024x1024 +
Dev + PuLID OOMs on a 12 GB card unless the VAE is routed to the CPU
backend via
--backend "vae=cpu"(not just--vae-on-cpu, which onlyoffloads weights, not the compute buffer); this is existing
stable-diffusion.cpp behavior, not a PuLID-specific issue, but
documented in docs/pulid.md because PuLID users will hit it.
Tested with batch_count > 1 (verified each image gets the same
identity, different composition).
Not yet supported (called out in docs/pulid.md)
pulid_v1.1.safetensors) -- has renamed key layout(
id_adapter_attn_layers.*vspulid_ca.*) and potentiallydifferent module structure. Follow-up PR.
pipeline supports this; the current precompute tool accepts only
one portrait per run).
--true-cfgnegative-prompt branch -- PuLID only injects on thepositive conditioning path in the reference implementation; this
matches.
Backward compatibility
Non-PuLID generations are unaffected. The
params.pulid_enabledflagdefaults to false and is only set when the model loader sees a
pulid_ca.*tensor in the loaded safetensors file. A regression runof Flux Schnell Q4 without
--pulid-*flags produces byte-identicaloutput to pre-patch.
File summary
Total ~770 added lines, ~10 changed. No removed functionality.