[ET-VK] Fix embedding_q4gsw out-of-bounds access with dynamic shapes by SS-JIA · Pull Request #18558 · pytorch/executorch

SS-JIA · 2026-03-29T01:59:31Z

Stack from ghstack (oldest at bottom):

-> [ET-VK] Fix embedding_q4gsw out-of-bounds access with dynamic shapes #18558

The embedding_q4gsw shader used push constants for num_indices,
out_height, and embed_dim that were captured at graph build time and
never updated when input tensors were dynamically resized. This caused
out-of-bounds GPU memory reads when the actual input was smaller than
the initial allocation, resulting in VK_ERROR_DEVICE_LOST on Mali GPUs.

The fix derives all shape-dependent values (embed_dim, out_height,
num_indices) from the output tensor's sizes UBO, which is automatically
updated on resize. Only truly constant values (group_size,
is_linear_weight) remain as push constants.

Root cause: With a 7-token input on a graph built for 256 tokens, the
local workgroup rounding created an extra thread (y=7) that passed the
stale bounds check (7 >= 256 == false) and read past the 7-element
indices buffer.

Differential Revision: D98642319

The embedding_q4gsw shader used push constants for num_indices, out_height, and embed_dim that were captured at graph build time and never updated when input tensors were dynamically resized. This caused out-of-bounds GPU memory reads when the actual input was smaller than the initial allocation, resulting in VK_ERROR_DEVICE_LOST on Mali GPUs. The fix derives all shape-dependent values (embed_dim, out_height, num_indices) from the output tensor's sizes UBO, which is automatically updated on resize. Only truly constant values (group_size, is_linear_weight) remain as push constants. Root cause: With a 7-token input on a graph built for 256 tokens, the local workgroup rounding created an extra thread (y=7) that passed the stale bounds check (7 >= 256 == false) and read past the 7-element indices buffer. Differential Revision: [D98642319](https://our.internmc.facebook.com/intern/diff/D98642319/) [ghstack-poisoned]

The embedding_q4gsw shader used push constants for num_indices, out_height, and embed_dim that were captured at graph build time and never updated when input tensors were dynamically resized. This caused out-of-bounds GPU memory reads when the actual input was smaller than the initial allocation, resulting in VK_ERROR_DEVICE_LOST on Mali GPUs. The fix derives all shape-dependent values (embed_dim, out_height, num_indices) from the output tensor's sizes UBO, which is automatically updated on resize. Only truly constant values (group_size, is_linear_weight) remain as push constants. Root cause: With a 7-token input on a graph built for 256 tokens, the local workgroup rounding created an extra thread (y=7) that passed the stale bounds check (7 >= 256 == false) and read past the 7-element indices buffer. Differential Revision: [D98642319](https://our.internmc.facebook.com/intern/diff/D98642319/) ghstack-source-id: 359350851 Pull Request resolved: #18558

pytorch-bot · 2026-03-29T01:59:36Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/18558

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure, 1 Cancelled Job, 2 Unrelated Failures

As of commit 33dfb14 with merge base 24751f1 ():

NEW FAILURE - The following job has failed:

pull / android / run-emulator (gh)
The process '/usr/bin/sh' failed with exit code 1

CANCELLED JOB - The following job was cancelled. Please retry:

pull / unittest / macos / macos-job (gh)
##[error]The operation was canceled.

BROKEN TRUNK - The following jobs failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

pull / unittest / windows / windows-job (gh) (trunk failure)
##[error]The operation was canceled.
pull / unittest-editable / windows / windows-job (gh) (trunk failure)
##[error]The operation was canceled.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

github-actions · 2026-03-29T02:00:13Z

This PR needs a `release notes:` label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

The embedding_q4gsw shader used push constants for num_indices, out_height, and embed_dim that were captured at graph build time and never updated when input tensors were dynamically resized. This caused out-of-bounds GPU memory reads when the actual input was smaller than the initial allocation, resulting in VK_ERROR_DEVICE_LOST on Mali GPUs. The fix derives all shape-dependent values (embed_dim, out_height, num_indices) from the output tensor's sizes UBO, which is automatically updated on resize. Only truly constant values (group_size, is_linear_weight) remain as push constants. Root cause: With a 7-token input on a graph built for 256 tokens, the local workgroup rounding created an extra thread (y=7) that passed the stale bounds check (7 >= 256 == false) and read past the 7-element indices buffer. Differential Revision: [D98642319](https://our.internmc.facebook.com/intern/diff/D98642319/) ghstack-source-id: 359350851 Pull Request resolved: #18558

The embedding_q4gsw shader used push constants for num_indices, out_height, and embed_dim that were captured at graph build time and never updated when input tensors were dynamically resized. This caused out-of-bounds GPU memory reads when the actual input was smaller than the initial allocation, resulting in VK_ERROR_DEVICE_LOST on Mali GPUs. The fix derives all shape-dependent values (embed_dim, out_height, num_indices) from the output tensor's sizes UBO, which is automatically updated on resize. Only truly constant values (group_size, is_linear_weight) remain as push constants. Root cause: With a 7-token input on a graph built for 256 tokens, the local workgroup rounding created an extra thread (y=7) that passed the stale bounds check (7 >= 256 == false) and read past the 7-element indices buffer. Differential Revision: [D98642319](https://our.internmc.facebook.com/intern/diff/D98642319/) ghstack-source-id: 359350851 Pull Request resolved: pytorch#18558

meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Mar 29, 2026

manuelcandales approved these changes Mar 30, 2026

View reviewed changes

meta-codesync Bot merged commit 2ad79a8 into gh/SS-JIA/513/base Mar 30, 2026
158 of 165 checks passed

meta-codesync Bot deleted the gh/SS-JIA/513/head branch March 30, 2026 16:42

meta-codesync Bot temporarily deployed to cherry-pick-bot March 30, 2026 16:42 Inactive

pytorchbot mentioned this pull request Mar 30, 2026

[ET-VK] Fix embedding_q4gsw out-of-bounds access with dynamic shapes #18584

Merged

SS-JIA mentioned this pull request Mar 30, 2026

[ET-VK] Add fused HuggingFace RoPE operator (apply_rotary_emb_hf) #18592

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ET-VK] Fix embedding_q4gsw out-of-bounds access with dynamic shapes#18558

[ET-VK] Fix embedding_q4gsw out-of-bounds access with dynamic shapes#18558
meta-codesync[bot] merged 1 commit intogh/SS-JIA/513/basefrom
gh/SS-JIA/513/head

SS-JIA commented Mar 29, 2026 •

edited

Loading

Uh oh!

pytorch-bot Bot commented Mar 29, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Mar 29, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

SS-JIA commented Mar 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot Bot commented Mar 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/18558

❌ 1 New Failure, 1 Cancelled Job, 2 Unrelated Failures

Uh oh!

github-actions Bot commented Mar 29, 2026

This PR needs a release notes: label

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

SS-JIA commented Mar 29, 2026 •

edited

Loading

pytorch-bot Bot commented Mar 29, 2026 •

edited

Loading

This PR needs a `release notes:` label