Standalone LoRA adapter for the SAM 3 vision backbone (vitdet fused-qkv seam),
modeled on Sam_LoRA with the plumbing fixed:
α/r scaling, proper registration, name-keyed safetensors save/load, and weight merging.
pip install -r requirements.txt
# the sam3 package must be importable to wrap a real model:
# pip install -e /path/to/sam3import torch
from sam3 import build_sam3_image_model
from sam3_lora import LoRA_SAM3
model = build_sam3_image_model(...) # your built SAM 3 image model
lora = LoRA_SAM3(model, r=4, alpha=4, target=("q", "v"))
opt = torch.optim.AdamW(lora.lora_parameters(), lr=1e-4)
# ... train: use `model` (now patched in place) as usual ...
lora.save_lora("adapter.safetensors")
lora.load_lora("adapter.safetensors")
merged = lora.merge_and_unload() # plain model for inferenceLoRA_SAM3 walks the model, finds every attention qkv that is a fused
nn.Linear(dim, 3*dim) (duck-typed, torch.compile-safe), freezes the whole model,
and replaces each qkv with a _LoRAQKV wrapper. The wrapper adds
(alpha / r) * B_role(A_role(x)) to the q and v slices (out-of-place, so it is
transparent to flash/RoPE attention kernels). B is zero-initialized, so the wrapped
model is numerically identical to the original until training begins.
| arg | default | meaning |
|---|---|---|
r |
4 |
LoRA rank |
alpha |
r (→ scaling 1.0) |
scaling = alpha / r |
target |
("q", "v") |
which of q/k/v to adapt |
blocks |
all | indices (in discovery order) of blocks to adapt |
Only the vision backbone's attn.qkv projections. The DETR decoder and text encoder
use parameter-based attention (in_proj_weight) and are intentionally out of scope.
python -m pytest -v13 unit tests run anywhere torch is installed (they build a tiny fake ViT with the same
fused-qkv seam). One additional integration test runs against the real
sam3.model.vitdet.Attention and is skipped automatically if sam3 is not importable.