feat: optional InferenceService annotation to skip model-ready nodeSelector (Karpenter)#1
feat: optional InferenceService annotation to skip model-ready nodeSelector (Karpenter)#1
Conversation
…-node-selector set
Self-contained workflow (workflow_dispatch). Delete this file before PR to sgl-project/ome. Does not modify upstream dev-images/release workflows. Made-with: Cursor
There was a problem hiding this comment.
Can we use the version in semver format so that we can also push up the chart with the changes that I requested in the shared-infra PR. I will note them here just to be safe. :)
I would like to have the helm chart updated so that it fixes the configmap to provide the types mapping and to use our ghcr to pull the chart instead of pulling from upstream.
I would also like to add that I would like to have @calebwilliams-vsco review this from a code perspective. See if it aligns with how the flow was meant to be in the code or just get some more eyes on this.
| "inferenceService", inferenceService.Namespace+"/"+inferenceService.Name, | ||
| "benchmarkJob", benchmarkJob.Name, | ||
| "baseModel", baseModelMeta.Name) | ||
| } else { |
There was a problem hiding this comment.
can we have a comment here as to what's happening when this annotation doesn't skip the model ready node selector?
…sco/ome into add_annotation_for_autoscaling
Problem
Engine/decoder pods get a required
nodeSelectorof the formmodels.ome.io/clusterbasemodel.<name>=Ready. The model-agent applies that label only after weights are on disk. Autoscalers such as Karpenter may not provision nodes when the pod requires labels that no NodePool advertises yet, causing a scheduling deadlock on cold GPU pools.Additionally, GHA has been configured to build the ome-manager image. Pipeline results for a build can be seen here https://github.com/vsco/ome/actions/runs/23865820885
Solution
ome.io/skip-model-ready-node-selector: "true"(default behavior unchanged when absent).nodeSelector; accelerator/runtime merged selectors still apply.Implementation
constants.SkipModelReadyNodeSelectorAnnotationKeyIsSkipModelReadyNodeSelector()+ gate inUpdatePodSpecNodeSelector.charts/ome-resources/README.md(Autoscaling section).Makefile
Optional
BASE_IMAGE=ubuntu:24.04forlinux/amd64builds on Apple Silicon (OL10 + QEMU x86-64-v3 issue).Note: An earlier PR was mistakenly opened against
sgl-project/ome; close that one if you only want review on this fork.Made with Cursor