Skip to content

drm/compositor: fall back to primary-plane cursor on map_mut failure#2033

Open
poelzi wants to merge 1 commit into
Smithay:masterfrom
poelzi:nomem
Open

drm/compositor: fall back to primary-plane cursor on map_mut failure#2033
poelzi wants to merge 1 commit into
Smithay:masterfrom
poelzi:nomem

Conversation

@poelzi

@poelzi poelzi commented May 14, 2026

Copy link
Copy Markdown

Summary

  • The cursor-plane fast path in DrmCompositor::render_cursor_plane calls .expect(\"Lost track of cursor device\") on the outer Result of gbm::BufferObject::map_mut. That panic was written assuming the only failure mode was a vanished gbm device, but on NVIDIA the kernel allocator can return ENOMEM here — typically after a misbehaving client (e.g. Chromium's video decoder hitting a CHECK and dying under SIGILL) leaks GPU memory in nvkms. A transient per-frame allocation failure then takes the entire compositor down through render_frame.
  • This patch handles the outer error the same way the four sibling failure points immediately above already do (create_buffer, add_framebuffer, copy_element_to_cursor_bo, missing underlying_storage): log at debug! and return None. The caller falls back to compositing the cursor on the primary plane for this frame, and the cursor-plane fast path is automatically retried on subsequent frames once kernel memory frees up.

Repro

Observed in the wild on a niri + NVIDIA setup running smithay 0.7.0:

panicked at smithay-0.7.0/src/backend/drm/compositor/mod.rs:3353:18:
Lost track of cursor device: Os { code: 12, kind: OutOfMemory, message: "Cannot allocate memory" }

Sequence: chromium Media thread SIGILL → kernel logs [drm:__nv_drm_gem_nvkms_map [nvidia_drm]] *ERROR* Failed to map NvKmsKapiMemory → 5s later niri panics at the line above → every Wayland client loses its socket.

Test plan

  • cargo check --no-default-features --features "backend_drm,backend_gbm,renderer_pixman" — clean.
  • cargo test --lib under the default + relevant features — 71/71 pass, no behavior change.
  • cargo clippy — no new lints (one pre-existing io_other_error warning in gbm.rs:215 is unrelated).
  • Real-world: rebuild niri (or your compositor of choice) against this and confirm the Chromium → NVIDIA-ENOMEM sequence no longer kills the session. The cursor may briefly fall back to primary-plane compositing until GPU memory recovers, which is the intended behavior.

Notes

  • Scope is intentionally minimal. The other .unwrap() / .expect() sites in DrmCompositor were audited and are state-machine invariants (HashMap entries the code just inserted, Options the conditional just proved Some); they do not touch the GPU allocator path.
  • Doesn't address the upstream NVIDIA driver leak — that's not fixable here — only the compositor's reaction to it.

The cursor-plane fast path called .expect("Lost track of cursor device")
on the result of gbm::BufferObject::map_mut. That panic was written
assuming the only way the outer Result could fail was a vanished gbm
device handle. In practice, on NVIDIA the kernel allocator can return
ENOMEM here — typically after a misbehaving client (e.g. Chromium's
video decoder crashing under SIGILL) leaks GPU memory in nvkms — and
the panic propagated all the way up render_frame, killing the entire
compositor over a transient per-frame allocation failure.

Treat the outer map_mut error the same way the four sibling failure
points immediately above already do (create_buffer, add_framebuffer,
copy_element_to_cursor_bo, missing underlying_storage): log at debug!
and return None, letting the caller composite the cursor on the
primary plane for this frame. Once kernel memory frees up the cursor
plane is re-tried automatically.

Repro path on the user's machine: chromium SIGILL in the Media thread →
nv_drm_gem_nvkms_map starts returning ENOMEM → niri panics at
mod.rs:3353 with "Lost track of cursor device: Os { code: 12,
kind: OutOfMemory }" → every Wayland client loses its socket.

@Drakulix Drakulix left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! But I think we can be a bit more concise here.

debug!("failed to map cursor buffer for rendering: {err}");
return None;
}
};

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this simply be a inspect_err followed by .ok()?.

Also I appreciate the PR description and long commit message, but this comment here is really not necessary. We simply fall back in an error case. If anybody wonders how to even reach this condition, they can use git blame.

@mati865

mati865 commented May 22, 2026

Copy link
Copy Markdown
Contributor

Also when making a PR with LLM at least check if it didn't remove the template.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants