Q2_0 group 64: CUDA backend#43
Draft
khosravipasha wants to merge 1 commit into
Draft
Conversation
There was a problem hiding this comment.
Pull request overview
Adds CUDA backend support for GGML_TYPE_Q2_0 (group size 64) across the main CUDA execution paths (MMVQ mat-vec, MMQ mat-mat, row extraction, and dequantization/conversion), plus build/template plumbing to instantiate the needed kernels.
Changes:
- Implement Q2_0×Q8_1 CUDA dot product and wire it into the MMVQ (mul_mat_vec_q) dispatch path.
- Add MMQ (mul_mat_q) support for Q2_0 via new tile loader, type traits, and template instantiation generation.
- Enable Q2_0 for CUDA getrows + conversion/dequantization utilities, and mark relevant ops as supported by the CUDA backend.
Reviewed changes
Copilot reviewed 12 out of 12 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| ggml/src/ggml-cuda/vecdotq.cuh | Adds vec_dot_q2_0_q8_1 and VDR macros for Q2_0. |
| ggml/src/ggml-cuda/mmvq.cu | Routes Q2_0 through MMVQ vec-dot dispatch and type switch. |
| ggml/src/ggml-cuda/mmq.cuh | Adds Q2_0 MMQ tile loading and type trait wiring; updates q8_1 ds layout selection. |
| ggml/src/ggml-cuda/mmq.cu | Enables Q2_0 in MMQ type dispatch and MMQ usage heuristic. |
| ggml/src/ggml-cuda/template-instances/mmq-instance-q2_0.cu | New generated MMQ instantiation TU for Q2_0. |
| ggml/src/ggml-cuda/template-instances/generate_cu_files.py | Includes Q2_0 in the MMQ instantiation generation list. |
| ggml/src/ggml-cuda/ggml-cuda.cu | Marks Q2_0 as supported for relevant CUDA ops in capability checks. |
| ggml/src/ggml-cuda/getrows.cu | Adds Q2_0 case to CUDA get_rows dispatch using dequantize_q2_0. |
| ggml/src/ggml-cuda/dequantize.cuh | Introduces dequantize_q2_0 for CUDA dequantization kernels. |
| ggml/src/ggml-cuda/convert.cu | Enables Q2_0 conversions to fp16/fp32 (contiguous + non-contiguous) via new dequantizer. |
| ggml/src/ggml-cuda/common.cuh | Adds CUDA type traits (qk/qr/qi) for GGML_TYPE_Q2_0. |
| ggml/src/ggml-cpu/arch-fallback.h | Adds arch-fallback alias for ggml_vec_dot_q2_0_q8_0_generic on x86. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comment on lines
+728
to
+731
| // Q2_0: 128 elements with ONE scale, 2 bits per element (4 elements per byte) | ||
| // Q8_1: 32 elements per block with individual scales | ||
| // iqs selects which of the 4 chunks of 32 elements to process (0-3) | ||
|
|
7c6c628 to
0f07ba4
Compare
81997c2 to
500613a
Compare
0f07ba4 to
a69cff5
Compare
500613a to
126d285
Compare
a69cff5 to
dc7c932
Compare
126d285 to
5a300e4
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
DRAFT PR for testing and review