feat(tree): add WorkloadTree data model and builder#87
Conversation
Implements the WorkloadTree data model described in issue #85. Adds pkg/tree/ with WorkloadTree (raw extraction result), WorkloadView (display layer), Build() function, PodMatcher interface, and KartaPodMatcher (default implementation). Also adds two exported accessor methods to StructureSummary (ChildrenOf, ParentOf) used by the tree builder. Includes design documents at docs/design/workload-tree/design.md and docs/design/workload-policy-controller/design.md. Fixes #85 Signed-off-by: Roee Gil <roee.gil@run.ai>
Adds four integration-style tests that exercise Build() and KartaPodMatcher against real Karta definitions from docs/samples/: - PyTorchJob minimal (fixed replicas) and extended (elastic policy) - RayCluster minimal (single worker group) and extended (multi-instance) Each scenario has YAML fixtures for workload, pods, events, and expected output under pkg/tree/testdata/. Also extends the WorkloadTree model to surface data exposed by the quickstart controller example: - WorkloadTree.Events []corev1.Event (passed in alongside pods) - WorkloadStatus.Conditions []resource.Condition (raw conditions from the workload status, already extracted by GetStatus()) - ComponentNode.HasPodDefinition bool (mirrors comp.HasPodDefinition()) Build() gains an events parameter; existing callers pass nil. Refs #85 Signed-off-by: Roee Gil <roee.gil@run.ai>
Use the exact quickstart workload YAMLs as inputs so the tests verify the same objects the quickstart example runs against. Extends instanceExpectation with a Children field to handle the 3-level LWS hierarchy (group -> leader/worker). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Roee Gil <roee.gil@run.ai>
This file belongs to a separate initiative and was accidentally included. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Roee Gil <roee.gil@run.ai>
Reflect fields and Build() signature added after the initial draft: - WorkloadTree.Events - WorkloadStatus.Conditions (with expanded WorkloadStatus block) - ComponentNode.HasPodDefinition - events parameter in Build() - updated test spec count to 40 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Roee Gil <roee.gil@run.ai>
Replace direct resource package calls (GetStatus, GetScale, GetPodTemplateSpec) in steps 1-3 with a single tree.Build() call and depth-first traversal of WorkloadTree. Steps 4-5 (mutation) keep using ComponentFactory directly. Also fixes the run command — the quickstart has its own go.mod and must be run with `go run .` from its own directory, not from the repo root. Removes "controller" framing from comments and output strings. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Roee Gil <roee.gil@run.ai>
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
WalkthroughIntroduces the ChangesWorkloadTree feature implementation
Sequence Diagram(s)sequenceDiagram
participant Caller
participant Build
participant StructureSummary
participant ComponentFactory
participant PodMatcher
participant populateTotals
Caller->>Build: Build(ctx, karta, factory, pods, matcher)
Build->>StructureSummary: NewStructureSummary(karta)
Build->>ComponentFactory: GetRootComponent().GetStatus()
ComponentFactory-->>Build: matched status phases
loop for each child component
Build->>ComponentFactory: claimPods via PodMatcher.MatchComponent
PodMatcher-->>Build: claimed pods
Build->>PodMatcher: MatchInstance + ExtractReplicaKey per pod
PodMatcher-->>Build: instance+replica keys
Build->>ComponentFactory: GetExtractedInstances, GetScale
ComponentFactory-->>Build: extracted instances + scale
Build->>Build: build InstanceNode(s), recurse children
end
Build->>populateTotals: populateTotals(tree)
populateTotals-->>Build: PodCount + Resources rolled up
Build-->>Caller: *WorkloadTree
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (1)
docs/design/workload-tree/design.md (1)
16-108: ⚡ Quick winAdd language identifiers to fenced code blocks.
Nine fenced code blocks lack language identifiers (directory tree at line 16 and schema definitions at lines 33, 40, 46, 54, 77, 86, 96, 104). Adding
textas the language specifier satisfies the linter and clarifies that these are language-agnostic schema representations.📝 Example fix for the directory tree block
-``` +```text pkg/tree/ tree.go WorkloadTree, WorkloadStatus, ComponentNode, InstanceNodeApply the same pattern to the eight schema blocks.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@docs/design/workload-tree/design.md` around lines 16 - 108, Add the missing fenced-block language identifiers by changing the nine unlabeled code fences in the document to use "text" so the linter recognizes them; specifically update the directory tree block and the schema blocks that describe WorkloadTree, WorkloadStatus, ComponentNode, InstanceNode, WorkloadView, ComponentView, InstanceView, and ResourceSummary so each opening triple-backtick is followed by "text". Ensure every unlabeled block in the file is updated consistently.Source: Linters/SAST tools
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@pkg/tree/pod_matcher.go`:
- Around line 21-24: Change the constructor NewKartaPodMatcher to return the
PodMatcher interface instead of the concrete *KartaPodMatcher: update its
signature to return PodMatcher, have it still instantiate &KartaPodMatcher{}
internally, and adjust any callers or tests to accept the PodMatcher interface
where they currently expect *KartaPodMatcher; keep the concrete type
KartaPodMatcher unchanged.
In `@pkg/tree/view_builder_test.go`:
- Around line 53-73: The test is missing assertions for aggregated memory;
update the test (the It block using NewWorkloadView and variable view) to assert
view.Resources.MemoryBytes equals the combined memory of both pods and also
assert the child component's memory (view.Children[0].Resources.MemoryBytes)
matches that same total; for these pods (each with resource.MustParse("8Gi"))
add expectations for MemoryBytes == 17179869184 (int64) or compute 16Gi in bytes
to validate correct aggregation.
---
Nitpick comments:
In `@docs/design/workload-tree/design.md`:
- Around line 16-108: Add the missing fenced-block language identifiers by
changing the nine unlabeled code fences in the document to use "text" so the
linter recognizes them; specifically update the directory tree block and the
schema blocks that describe WorkloadTree, WorkloadStatus, ComponentNode,
InstanceNode, WorkloadView, ComponentView, InstanceView, and ResourceSummary so
each opening triple-backtick is followed by "text". Ensure every unlabeled block
in the file is updated consistently.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Enterprise
Run ID: 6560adf6-e900-45c5-a296-4cfb2aa809df
📒 Files selected for processing (37)
docs/design/workload-tree/design.mddocs/examples/quickstart/README.mddocs/examples/quickstart/main.gopkg/instructions/summary.gopkg/tree/builder.gopkg/tree/builder_test.gopkg/tree/pod_matcher.gopkg/tree/pod_matcher_mock.gopkg/tree/pod_matcher_test.gopkg/tree/suite_test.gopkg/tree/testdata/jobset/events.yamlpkg/tree/testdata/jobset/expected.yamlpkg/tree/testdata/jobset/pods.yamlpkg/tree/testdata/lws/events.yamlpkg/tree/testdata/lws/expected.yamlpkg/tree/testdata/lws/pods.yamlpkg/tree/testdata/pytorch/events_extended.yamlpkg/tree/testdata/pytorch/events_minimal.yamlpkg/tree/testdata/pytorch/expected_extended.yamlpkg/tree/testdata/pytorch/expected_minimal.yamlpkg/tree/testdata/pytorch/pods_extended.yamlpkg/tree/testdata/pytorch/pods_minimal.yamlpkg/tree/testdata/pytorch/workload_extended.yamlpkg/tree/testdata/pytorch/workload_minimal.yamlpkg/tree/testdata/raycluster/events_extended.yamlpkg/tree/testdata/raycluster/events_minimal.yamlpkg/tree/testdata/raycluster/expected_extended.yamlpkg/tree/testdata/raycluster/expected_minimal.yamlpkg/tree/testdata/raycluster/pods_extended.yamlpkg/tree/testdata/raycluster/pods_minimal.yamlpkg/tree/testdata/raycluster/workload_extended.yamlpkg/tree/testdata/raycluster/workload_minimal.yamlpkg/tree/tree.gopkg/tree/view.gopkg/tree/view_builder.gopkg/tree/view_builder_test.gopkg/tree/yaml_examples_test.go
- Return PodMatcher interface from NewKartaPodMatcher constructor - Add missing MemoryBytes assertion in view_builder_test - Add text language tag to bare fenced code blocks in design.md - Revert "In production" to "In a real controller" in quickstart Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Roee Gil <roee.gil@run.ai>
Drop the WorkloadView display layer (view.go, view_builder.go and its test) and remove the Events and Conditions fields from WorkloadTree. The library now produces only the raw component hierarchy; display concerns such as resource aggregation, ready counts, and event/condition surfacing belong to consumers like the CLI. Build() no longer takes an events parameter. Tests, test fixtures, and the workload-tree and CLI design docs are updated to match. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Roee Gil <roee.gil@run.ai>
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
docs/design/cli/high-level-design.md (1)
283-283:⚠️ Potential issue | 🔴 CriticalFix Build function signature in documentation to match actual implementation.
Line 283 documents the signature as
func Build(ctx context.Context, ri *ResourceInterface, workload client.Object, pods []corev1.Pod, matcher PodMatcher), but the actual implementation inpkg/tree/builder.gousesfunc Build(ctx context.Context, karta *v1alpha1.Karta, factory *resource.ComponentFactory, pods []corev1.Pod, matcher PodMatcher). The documented parameters are incorrect and will mislead users. Update parameter names and types to match the actual function.Additionally, remove bold formatting from lines 61-67 (Semantic role names, Pods grouped by their role, Normalized phase, Desired vs current replicas, Collapsed plumbing) and use inline code or plain text instead, as shown in the refactored display layer section.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@docs/design/cli/high-level-design.md` at line 283, Update the Build function signature in the documentation at line 283 to match the actual implementation in pkg/tree/builder.go. Replace the incorrect parameters ri *ResourceInterface and workload client.Object with the correct ones: karta *v1alpha1.Karta and factory *resource.ComponentFactory. Additionally, remove the bold formatting from lines 61-67 (the list items covering Semantic role names, Pods grouped by their role, Normalized phase, Desired vs current replicas, and Collapsed plumbing) and replace it with inline code formatting or plain text to maintain consistent styling with the refactored display layer section.
🧹 Nitpick comments (1)
docs/design/cli/high-level-design.md (1)
61-67: ⚡ Quick winRemove bold emphasis from key concept names.
Lines 61–67 use bold (
**Semantic role names**,**Pods grouped by their role**,**Normalized phase**, etc.) to highlight concepts for emphasis. This violates the Markdown style guideline. The refactored display layer section (lines 309–317) correctly avoids bold emphasis, using inline code and plain text instead.Remove bold formatting from all concept labels in this subsection.
Proposed style fix for lines 61–67
- **Semantic role names** - Frontend / PrefillWorker / DecodeWorker come from the label mapping declared in the Karta YAML. - **Pods grouped by their role, not by their owner** - pods playing different roles are shown as separate groups even when they share the same owning resource. - **Normalized phase** - `Running` has the same meaning across PyTorchJob, RayCluster, Dynamo, etc. - **Desired vs current replicas** - `3/4 replicas` shows the scale target vs. the actual pod count, per role. - **Collapsed plumbing** - intermediate objects (DynamoComponentDeployment, LeaderWorkerSet) are optionally hidden to reduce noise while keeping the semantic hierarchy visible. + Semantic role names - Frontend / PrefillWorker / DecodeWorker come from the label mapping declared in the Karta YAML. + Pods grouped by their role, not by their owner - pods playing different roles are shown as separate groups even when they share the same owning resource. + Normalized phase - `Running` has the same meaning across PyTorchJob, RayCluster, Dynamo, etc. + Desired vs current replicas - `3/4 replicas` shows the scale target vs. the actual pod count, per role. + Collapsed plumbing - intermediate objects (DynamoComponentDeployment, LeaderWorkerSet) are optionally hidden to reduce noise while keeping the semantic hierarchy visible.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@docs/design/cli/high-level-design.md` around lines 61 - 67, In the high-level design section of the documentation, remove the bold emphasis (the `**` markers) from all concept labels: "Semantic role names", "Pods grouped by their role", "Normalized phase", "Desired vs current replicas", and "Collapsed plumbing". Replace them with plain text format, following the style pattern used in the refactored display layer section which uses inline code and plain text instead of bold for consistency with the Markdown style guidelines.Source: Coding guidelines
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Outside diff comments:
In `@docs/design/cli/high-level-design.md`:
- Line 283: Update the Build function signature in the documentation at line 283
to match the actual implementation in pkg/tree/builder.go. Replace the incorrect
parameters ri *ResourceInterface and workload client.Object with the correct
ones: karta *v1alpha1.Karta and factory *resource.ComponentFactory.
Additionally, remove the bold formatting from lines 61-67 (the list items
covering Semantic role names, Pods grouped by their role, Normalized phase,
Desired vs current replicas, and Collapsed plumbing) and replace it with inline
code formatting or plain text to maintain consistent styling with the refactored
display layer section.
---
Nitpick comments:
In `@docs/design/cli/high-level-design.md`:
- Around line 61-67: In the high-level design section of the documentation,
remove the bold emphasis (the `**` markers) from all concept labels: "Semantic
role names", "Pods grouped by their role", "Normalized phase", "Desired vs
current replicas", and "Collapsed plumbing". Replace them with plain text
format, following the style pattern used in the refactored display layer section
which uses inline code and plain text instead of bold for consistency with the
Markdown style guidelines.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Enterprise
Run ID: 144a31b4-e00c-46f3-bb0a-9e52dec27a51
📒 Files selected for processing (14)
docs/design/cli/high-level-design.mddocs/design/workload-tree/design.mddocs/examples/quickstart/README.mddocs/examples/quickstart/main.gopkg/tree/builder.gopkg/tree/builder_test.gopkg/tree/testdata/jobset/expected.yamlpkg/tree/testdata/lws/expected.yamlpkg/tree/testdata/pytorch/expected_extended.yamlpkg/tree/testdata/pytorch/expected_minimal.yamlpkg/tree/testdata/raycluster/expected_extended.yamlpkg/tree/testdata/raycluster/expected_minimal.yamlpkg/tree/tree.gopkg/tree/yaml_examples_test.go
💤 Files with no reviewable changes (7)
- pkg/tree/testdata/jobset/expected.yaml
- pkg/tree/testdata/raycluster/expected_minimal.yaml
- pkg/tree/testdata/pytorch/expected_minimal.yaml
- pkg/tree/testdata/pytorch/expected_extended.yaml
- pkg/tree/testdata/raycluster/expected_extended.yaml
- pkg/tree/testdata/lws/expected.yaml
- pkg/tree/tree.go
✅ Files skipped from review due to trivial changes (2)
- docs/design/workload-tree/design.md
- docs/examples/quickstart/README.md
🚧 Files skipped from review as they are similar to previous changes (2)
- docs/examples/quickstart/main.go
- pkg/tree/builder_test.go
Add PodCount and Resources fields to WorkloadTree, ComponentNode, and InstanceNode. Each carries the cumulative desired totals for its subtree: pod count plus CPU (millicores), memory (bytes), and GPUs. Totals are computed from the desired spec - each pod-bearing component's replica count times its extracted pod template requests. CPU and memory come from container requests; GPUs come from the nvidia.com/gpu request, falling back to its limit when only the limit is set. GPUs is a float so fractional allocations (MIG slices, GPU sharing) are preserved. Add unit tests for the aggregation and extend the YAML example fixtures to assert the totals at every node against real Karta definitions. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Roee Gil <roee.gil@run.ai>
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
pkg/tree/tree.go (1)
8-8:⚠️ Potential issue | 🔴 CriticalImport
k8s.io/apimachinery/pkg/runtime/schemaand useschema.GroupVersionKindinstead ofmetav1.GroupVersionKind.
GroupVersionKindis defined ink8s.io/apimachinery/pkg/runtime/schema, notpkg/apis/meta/v1. The current code at line 51 will fail to compile.Fix
import ( corev1 "k8s.io/api/core/v1" - metav1 "k8s.io/apimachinery/pkg/apis/meta/v1" + "k8s.io/apimachinery/pkg/runtime/schema" "github.com/run-ai/karta/pkg/resource" )type ComponentNode struct { Name string // Kind is nil for logical grouping components (those without a Kind in the definition). - Kind *metav1.GroupVersionKind + Kind *schema.GroupVersionKind HasPodDefinition bool PodCount int Resources Resources Instances []InstanceNode }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/tree/tree.go` at line 8, The code is using metav1.GroupVersionKind which does not exist in that package. Add a new import statement for k8s.io/apimachinery/pkg/runtime/schema (aliased as schema), then locate where GroupVersionKind is used in the code (referenced at line 51) and change the reference from metav1.GroupVersionKind to schema.GroupVersionKind to use the correct package where this type is actually defined.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Outside diff comments:
In `@pkg/tree/tree.go`:
- Line 8: The code is using metav1.GroupVersionKind which does not exist in that
package. Add a new import statement for k8s.io/apimachinery/pkg/runtime/schema
(aliased as schema), then locate where GroupVersionKind is used in the code
(referenced at line 51) and change the reference from metav1.GroupVersionKind to
schema.GroupVersionKind to use the correct package where this type is actually
defined.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Enterprise
Run ID: 9b2eb13f-7ff3-4d1b-acd6-378fbc610032
📒 Files selected for processing (12)
docs/design/workload-tree/design.mdpkg/tree/builder.gopkg/tree/testdata/jobset/expected.yamlpkg/tree/testdata/lws/expected.yamlpkg/tree/testdata/pytorch/expected_extended.yamlpkg/tree/testdata/pytorch/expected_minimal.yamlpkg/tree/testdata/raycluster/expected_extended.yamlpkg/tree/testdata/raycluster/expected_minimal.yamlpkg/tree/totals.gopkg/tree/totals_test.gopkg/tree/tree.gopkg/tree/yaml_examples_test.go
✅ Files skipped from review due to trivial changes (4)
- pkg/tree/testdata/jobset/expected.yaml
- pkg/tree/testdata/lws/expected.yaml
- docs/design/workload-tree/design.md
- pkg/tree/testdata/pytorch/expected_minimal.yaml
🚧 Files skipped from review as they are similar to previous changes (4)
- pkg/tree/testdata/raycluster/expected_minimal.yaml
- pkg/tree/testdata/pytorch/expected_extended.yaml
- pkg/tree/testdata/raycluster/expected_extended.yaml
- pkg/tree/builder.go
Remove docs/design/workload-tree/design.md and revert the CLI high-level-design.md to its prior form, then document the cumulative PodCount and Resources subtree totals (CPU, memory, GPUs) on WorkloadTree, ComponentNode, and InstanceNode. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Roee Gil <roee.gil@run.ai>
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
docs/design/cli/high-level-design.md (1)
301-302:⚠️ Potential issue | 🟡 MinorUpdate the Build() function signature to match the actual implementation. The second and third parameters are incorrect: the documented
ri *ResourceInterface, workload client.Objectshould bekarta *v1alpha1.Karta, factory *resource.ComponentFactory.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@docs/design/cli/high-level-design.md` around lines 301 - 302, Update the Build() function signature in the design documentation to match the actual implementation. Replace the second and third parameters from `ri *ResourceInterface, workload client.Object` with `karta *v1alpha1.Karta, factory *resource.ComponentFactory` to accurately reflect what the function accepts in the real codebase.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Outside diff comments:
In `@docs/design/cli/high-level-design.md`:
- Around line 301-302: Update the Build() function signature in the design
documentation to match the actual implementation. Replace the second and third
parameters from `ri *ResourceInterface, workload client.Object` with `karta
*v1alpha1.Karta, factory *resource.ComponentFactory` to accurately reflect what
the function accepts in the real codebase.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Enterprise
Run ID: 9cdfe695-6ecc-4bb0-bd2e-172dc5320b01
📒 Files selected for processing (1)
docs/design/cli/high-level-design.md
Drop live pods from the WorkloadTree output and replace the single
PodCount/Resources totals with min/max ranges driven by the Scale object.
Each pod-bearing instance contributes its min and max replica counts
(Scale.MinReplicas/MaxReplicas, falling back to Scale.Replicas, and 0 when
no scale is defined) times its per-pod requests. Totals are exposed as
PodCount{Min,Max} and ResourceRange{Min,Max}; fixed-replica components
collapse to equal bounds. Pods remain a Build input for instance and
replica-key discovery but are no longer stored on the tree.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Roee Gil <roee.gil@run.ai>
tree.Build no longer takes a pod list or PodMatcher. The builder derives the desired structure entirely from the workload spec: each component expands into instances from its InstanceIdPath, and a replica-selector component collapses to a single instance. Per-replica and per-pod data is layered on by live consumers via the pkg/resource pod querier, not the tree. Removes the PodMatcher abstraction, its mock, and the pod test fixtures; updates the quickstart example, the CLI high-level design, and the LWS golden file (which now reflects a single collapsed group node). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Roee Gil <roee.gil@run.ai>
Apply Ron Lev's review on PR #87: - Build now takes only the ComponentFactory and derives the Karta from it via a new ComponentFactory.GetKarta(); rewrite the Build doc comment. - Resources holds k8s resource.Quantity (CPU/Memory/GPU) instead of raw int64/float64, so units round-trip exactly and fractional GPU is preserved. - WorkloadStatus is now optional (*WorkloadStatus); nil means no status was evaluated, e.g. an offline pre-submission tree. - Totals derive pod-bearing from leaf instances rather than HasPodDefinition, and set ResourcesIncomplete when a node has desired pods whose per-pod resources cannot be resolved, instead of silently reporting zero. - Rename effectiveReplicas to replicaBounds. - Document why every component (including single-instance) carries an InstanceNode and why child components are built per instance. - Assert exact pod counts and replica bounds in the builder tests; regenerate the golden fixtures for the Quantity serialization. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Roee Gil <roee.gil@run.ai>
Verify fractional nvidia.com/gpu requests are preserved exactly through the totals pipeline: - Unit tests asserting exact Quantity values when scaling a fractional GPU by the replica count, summing fractions across components, ranging over the scale envelope, and reading a fractional GPU from the limit. - A PyTorchJob golden fixture (master 0.25, worker 0.5 x 2) whose root total is 1.25 GPU, exercising fraction plus integer aggregation and serialization. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Roee Gil <roee.gil@run.ai>
Address remaining review feedback on the workload tree builder: - Add a childComponents helper that resolves a parent's child names to []*resource.Component, so buildComponentNodes and buildInstanceNodes work with components instead of raw name strings. Keeps the name resolution in the tree package so pkg/instructions gains no dependency on pkg/resource. - Drop the unused karta parameter threaded through the build helpers. - Trim the buildInstanceNodes doc comment. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Roee Gil <roee.gil@run.ai>
ronlv10
left a comment
There was a problem hiding this comment.
Overall looks good to me - I think that the solution of the yamls is an overkill and would made us to update the yamls for any change and on the other hand is hard to verify the output (need to look on 3 files in order to verify
I suggest
- convert them to test if there are not cover - e.g, pytorch and dyanmo and check the tree
- maybe create a diagram/image and add to the doc (you can open an issue ) so we would be able to visuals in our docs how the tree looks like - I think it could bring value
There was a problem hiding this comment.
I think you changed the deisgn doc too much here
it's fine to correction but for example you deleted the pod matcher - which still relevant to the CLI but we decided to split it differently
in the past we didn't really update design docs(before ai and pr that is a design doc)- I think it's fine to make correction after different decision but not to change entirely - for example it's fine to update what is workload tree post-discussion , that we added resource , etc
Isan-Rivkin
left a comment
There was a problem hiding this comment.
Minor comments, overall like but Im still in partial review will continue tomorrow.
Wanted to publish something not to delay the minor stuff.
Regarding the testing of expected yamls, didnt review as discussed
| instanceKeys = []string{""} | ||
| } | ||
| keys := append([]string(nil), instanceKeys...) | ||
| sort.Strings(keys) |
There was a problem hiding this comment.
taken from Map, using sort keep the sorted list aligned, instead of getting each call a different list arrangement
The tree data model now carries only cumulative pod counts, not resource requests. Resources, ResourceRange, and ResourcesIncomplete are removed from WorkloadTree, ComponentNode, and InstanceNode, along with the per-pod resource extraction and aggregation logic in totals.go. Per-pod resource usage is left to consumers that fetch live pods, matching the pod-matching step described in the CLI design. Child components are now resolved via ComponentFactory.GetChildComponentsOf, built from the structure's OwnerRef relationships, removing the builder's dependency on instructions.StructureSummary. Each instance's subtree is built once and cloned per instance so per-node pod counts stay independent. Removes the YAML golden-file tests and testdata that asserted resource totals. Signed-off-by: Roee Gil <roee.gil@run.ai>
152402b to
d0d091f
Compare
ChildrenOf and ParentOf had zero callers after the tree builder switched to ComponentFactory.GetChildComponentsOf. The backing childrenMap and parentMap fields are retained; both are still used internally. Signed-off-by: Roee Gil <roee.gil@run.ai>
A pod-bearing leaf with no scale in the spec now contributes one desired pod instead of zero, matching Kubernetes' default replica count for a workload that declares none. replicaBounds returns 1,1 for a nil scale; the non-nil branch is unchanged. Signed-off-by: Roee Gil <roee.gil@run.ai>
Adds a Milvus fixture (status-only root with querynode, datanode, and a scaleless proxy StatefulSet child) and builder tests covering component discovery, per-component pod counts, the single-pod default for the scaleless child, and root rollup. Signed-off-by: Roee Gil <roee.gil@run.ai>
|
|
||
| // GetChildComponentsOf retrieves the direct child components of the named parent, | ||
| // resolved from the structure's OwnerRef relationships. | ||
| func (f *ComponentFactory) GetChildComponentsOf(parentName string) ([]*Component, error) { |
There was a problem hiding this comment.
why you put it here? I think it should be in the tree..
There was a problem hiding this comment.
The factory has the karta and the object. The tree comes afterwards
| baseChildren, err := buildComponentNodes(ctx, children, factory) | ||
| if err != nil { |
There was a problem hiding this comment.
baseChildren -> instanceChildren
There was a problem hiding this comment.
But it returns a component child list
| // ranged over the scale envelope. A pod-bearing leaf counts its own replicas | ||
| // (defaulting to one when it declares no scale); a non-pod leaf counts nothing; | ||
| // a non-leaf rolls up its children. | ||
| func instancePodCount(children []ComponentNode, scale *resource.Scale, hasPodDefinition bool) PodCount { |
There was a problem hiding this comment.
I think you're missing here a case where instance has pods and also children
not sure if it's a valid case but not sure it doesn't and we miss it here
There was a problem hiding this comment.
var total PodCount
if hasPodDefinition {
minReplicas, maxReplicas := replicaBounds(scale)
total = PodCount{Min: minReplicas, Max: maxReplicas}
}
for _, child := range children {
total.Min += child.PodCount.Min
total.Max += child.PodCount.Max
}
return total
probably something like this
There was a problem hiding this comment.
This logic breaks in the case where the instance has a replica and also has children. We ignore the instance and replica count. This sums it up.
There was a problem hiding this comment.
TBH I think that we should have a builder that in the test you override what's need for the test and then when reading the test is clear and you don't need to jump to this file - but that's not the first case so I get it if it too much now
| *~ | ||
| .vscode/ | ||
| .idea | ||
| /docs/examples/quickstart/quickstart |
There was a problem hiding this comment.
Add to ignore: this is the binary created when running go build
| // replicaBounds returns the min and max desired replica counts for a scale. | ||
| // A nil scale defaults to one replica: assumes that a pod-bearing component that declares no | ||
| // scale still runs a single pod. | ||
| func replicaBounds(scale *resource.Scale) (minReplicas, maxReplicas int) { |
There was a problem hiding this comment.
Min-only scale yields Max < Min, for example PodCount{Min:3, Max:0}.
- suggest to add a test with min only
- define the correct behavior, indicate it's optional with a pointer or something
| // its components, their instances, normalized status, and the desired pod counts | ||
| // ranged over the scale envelope. The factory carries the Karta definition and | ||
| // the workload object it was built from. | ||
| func Build( |
There was a problem hiding this comment.
The build is getting Stack Overflow in case of a circular ownerRef.
We should handle this instead of crashing because this is user-input that can crash the cli.
- Suggest to add a test case for this
Note: the validator handles this error cleanly NewKartaValidator().Validate() returns error: root component cannot have owner ref
apiVersion: run.ai/v1alpha1
kind: Karta
metadata:
name: cyclic
spec:
structureDefinition:
rootComponent:
name: root
ownerRef: a
kind:
group: example.com
version: v1
kind: Example
childComponents:
- name: a
ownerRef: root
kind:
group: example.com
version: v1
kind: ExampleChild| Name: component.Name(), | ||
| Kind: component.Kind(), | ||
| HasPodDefinition: component.HasPodDefinition(), | ||
| PodCount: sumNodes(instances), |
There was a problem hiding this comment.
Question, for a pod-bearing root with no children (batch-job.yaml, nimservice.yaml), the tree is empty and the workload's desired pod count is unrepresentable in any field. Intended?
For example below, tree pod count and children is 0.
karta:
apiVersion: run.ai/v1alpha1
kind: Karta
metadata:
name: batch-v1-job
spec:
structureDefinition:
rootComponent:
name: job
kind:
group: batch
version: v1
kind: Job
scaleDefinition:
replicasPath: .spec.parallelism // 1
specDefinition:
podTemplateSpecPath: .spec.templateand workload
apiVersion: batch/v1
kind: Job
metadata:
name: pi
namespace: default
spec:
parallelism: 5
template:
spec:
containers:
- name: pi
image: perl:5.34
What does this PR do?
Implements the
WorkloadTreedata model and builder for the Karta library.Adds
pkg/tree/with:WorkloadTree— raw extraction result (component hierarchy, matched pods, scale, status)WorkloadView— display layer with aggregated resource sums, ready counts, node listsBuild()— assembles aWorkloadTreefrom a Karta definition,ComponentFactory, and pod listPodMatcherinterface — decouples tree structure from pod matching strategyKartaPodMatcher— default implementation usingPodSelectordefinitionsNewWorkloadView()— computes display-layer values from aWorkloadTreeAlso adds two exported accessor methods to
StructureSummary(ChildrenOf,ParentOf) used by the tree builder.Includes design document:
docs/design/workload-tree/design.mdRelated issue(s)
Fixes #85
Checklist
git commit -s)make check)Summary by CodeRabbit
Release Notes
New Features
Documentation