Skip to content

Q2_0 group 64: CUDA backend#43

Draft
khosravipasha wants to merge 1 commit into
pr/q2_0-cpufrom
pr/q2_0-cuda
Draft

Q2_0 group 64: CUDA backend#43
khosravipasha wants to merge 1 commit into
pr/q2_0-cpufrom
pr/q2_0-cuda

Conversation

@khosravipasha

Copy link
Copy Markdown
Collaborator

DRAFT PR for testing and review

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds CUDA backend support for GGML_TYPE_Q2_0 (group size 64) across the main CUDA execution paths (MMVQ mat-vec, MMQ mat-mat, row extraction, and dequantization/conversion), plus build/template plumbing to instantiate the needed kernels.

Changes:

  • Implement Q2_0×Q8_1 CUDA dot product and wire it into the MMVQ (mul_mat_vec_q) dispatch path.
  • Add MMQ (mul_mat_q) support for Q2_0 via new tile loader, type traits, and template instantiation generation.
  • Enable Q2_0 for CUDA getrows + conversion/dequantization utilities, and mark relevant ops as supported by the CUDA backend.

Reviewed changes

Copilot reviewed 12 out of 12 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
ggml/src/ggml-cuda/vecdotq.cuh Adds vec_dot_q2_0_q8_1 and VDR macros for Q2_0.
ggml/src/ggml-cuda/mmvq.cu Routes Q2_0 through MMVQ vec-dot dispatch and type switch.
ggml/src/ggml-cuda/mmq.cuh Adds Q2_0 MMQ tile loading and type trait wiring; updates q8_1 ds layout selection.
ggml/src/ggml-cuda/mmq.cu Enables Q2_0 in MMQ type dispatch and MMQ usage heuristic.
ggml/src/ggml-cuda/template-instances/mmq-instance-q2_0.cu New generated MMQ instantiation TU for Q2_0.
ggml/src/ggml-cuda/template-instances/generate_cu_files.py Includes Q2_0 in the MMQ instantiation generation list.
ggml/src/ggml-cuda/ggml-cuda.cu Marks Q2_0 as supported for relevant CUDA ops in capability checks.
ggml/src/ggml-cuda/getrows.cu Adds Q2_0 case to CUDA get_rows dispatch using dequantize_q2_0.
ggml/src/ggml-cuda/dequantize.cuh Introduces dequantize_q2_0 for CUDA dequantization kernels.
ggml/src/ggml-cuda/convert.cu Enables Q2_0 conversions to fp16/fp32 (contiguous + non-contiguous) via new dequantizer.
ggml/src/ggml-cuda/common.cuh Adds CUDA type traits (qk/qr/qi) for GGML_TYPE_Q2_0.
ggml/src/ggml-cpu/arch-fallback.h Adds arch-fallback alias for ggml_vec_dot_q2_0_q8_0_generic on x86.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +728 to +731
// Q2_0: 128 elements with ONE scale, 2 bits per element (4 elements per byte)
// Q8_1: 32 elements per block with individual scales
// iqs selects which of the 4 chunks of 32 elements to process (0-3)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants