Skip to content

feat: optional InferenceService annotation to skip model-ready nodeSelector (Karpenter)#1

Open
ossianpe wants to merge 22 commits intomainfrom
add_annotation_for_autoscaling
Open

feat: optional InferenceService annotation to skip model-ready nodeSelector (Karpenter)#1
ossianpe wants to merge 22 commits intomainfrom
add_annotation_for_autoscaling

Conversation

@ossianpe
Copy link
Copy Markdown

@ossianpe ossianpe commented Apr 1, 2026

Problem

Engine/decoder pods get a required nodeSelector of the form models.ome.io/clusterbasemodel.<name>=Ready. The model-agent applies that label only after weights are on disk. Autoscalers such as Karpenter may not provision nodes when the pod requires labels that no NodePool advertises yet, causing a scheduling deadlock on cold GPU pools.

Additionally, GHA has been configured to build the ome-manager image. Pipeline results for a build can be seen here https://github.com/vsco/ome/actions/runs/23865820885

Solution

  • New annotation on InferenceService: ome.io/skip-model-ready-node-selector: "true" (default behavior unchanged when absent).
  • When set, omit the model-ready nodeSelector; accelerator/runtime merged selectors still apply.
  • BenchmarkJob: if the referenced InferenceService has the annotation, the benchmark pod skips the same selector for consistency.

Implementation

  • constants.SkipModelReadyNodeSelectorAnnotationKey
  • IsSkipModelReadyNodeSelector() + gate in UpdatePodSpecNodeSelector.
  • Unit tests, charts/ome-resources/README.md (Autoscaling section).

Makefile

Optional BASE_IMAGE=ubuntu:24.04 for linux/amd64 builds on Apple Silicon (OL10 + QEMU x86-64-v3 issue).


Note: An earlier PR was mistakenly opened against sgl-project/ome; close that one if you only want review on this fork.

Made with Cursor

Copy link
Copy Markdown

@JDavis10213 JDavis10213 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use the version in semver format so that we can also push up the chart with the changes that I requested in the shared-infra PR. I will note them here just to be safe. :)

I would like to have the helm chart updated so that it fixes the configmap to provide the types mapping and to use our ghcr to pull the chart instead of pulling from upstream.

I would also like to add that I would like to have @calebwilliams-vsco review this from a code perspective. See if it aligns with how the flow was meant to be in the code or just get some more eyes on this.

@github-actions github-actions bot added the crd label Apr 2, 2026
"inferenceService", inferenceService.Namespace+"/"+inferenceService.Name,
"benchmarkJob", benchmarkJob.Name,
"baseModel", baseModelMeta.Name)
} else {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we have a comment here as to what's happening when this annotation doesn't skip the model ready node selector?

Copy link
Copy Markdown

@calebwilliams-vsco calebwilliams-vsco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants