feat(tree): add WorkloadTree data model and builder by rogirun · Pull Request #87 · run-ai/karta

rogirun · 2026-06-10T07:10:00Z

What does this PR do?

Implements the WorkloadTree data model and builder for the Karta library.

Adds pkg/tree/ with:

WorkloadTree — raw extraction result (component hierarchy, matched pods, scale, status)
WorkloadView — display layer with aggregated resource sums, ready counts, node lists
Build() — assembles a WorkloadTree from a Karta definition, ComponentFactory, and pod list
PodMatcher interface — decouples tree structure from pod matching strategy
KartaPodMatcher — default implementation using PodSelector definitions
NewWorkloadView() — computes display-layer values from a WorkloadTree

Also adds two exported accessor methods to StructureSummary (ChildrenOf, ParentOf) used by the tree builder.

Includes design document:

docs/design/workload-tree/design.md

Related issue(s)

Fixes #85

Checklist

All commits are signed off with DCO (git commit -s)
New/modified files have SPDX license and copyright headers
Documentation updated (if applicable)
Tests pass (make check)
No proprietary or internal information included

Summary by CodeRabbit

Release Notes

New Features
- Added WorkloadTree API to inspect workload hierarchies with pod counts and resource aggregation across components and instances.
- Introduced customizable pod-matching interface for flexible workload introspection.
- Extended component structure queries with new parent/child relationship lookups.
Documentation
- Updated quickstart guide with clearer execution instructions and improved example walkthroughs.
- Enhanced design documentation for workload inspection schema.

Implements the WorkloadTree data model described in issue #85. Adds pkg/tree/ with WorkloadTree (raw extraction result), WorkloadView (display layer), Build() function, PodMatcher interface, and KartaPodMatcher (default implementation). Also adds two exported accessor methods to StructureSummary (ChildrenOf, ParentOf) used by the tree builder. Includes design documents at docs/design/workload-tree/design.md and docs/design/workload-policy-controller/design.md. Fixes #85 Signed-off-by: Roee Gil <roee.gil@run.ai>

Adds four integration-style tests that exercise Build() and KartaPodMatcher against real Karta definitions from docs/samples/: - PyTorchJob minimal (fixed replicas) and extended (elastic policy) - RayCluster minimal (single worker group) and extended (multi-instance) Each scenario has YAML fixtures for workload, pods, events, and expected output under pkg/tree/testdata/. Also extends the WorkloadTree model to surface data exposed by the quickstart controller example: - WorkloadTree.Events []corev1.Event (passed in alongside pods) - WorkloadStatus.Conditions []resource.Condition (raw conditions from the workload status, already extracted by GetStatus()) - ComponentNode.HasPodDefinition bool (mirrors comp.HasPodDefinition()) Build() gains an events parameter; existing callers pass nil. Refs #85 Signed-off-by: Roee Gil <roee.gil@run.ai>

Use the exact quickstart workload YAMLs as inputs so the tests verify the same objects the quickstart example runs against. Extends instanceExpectation with a Children field to handle the 3-level LWS hierarchy (group -> leader/worker). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Roee Gil <roee.gil@run.ai>

This file belongs to a separate initiative and was accidentally included. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Roee Gil <roee.gil@run.ai>

Reflect fields and Build() signature added after the initial draft: - WorkloadTree.Events - WorkloadStatus.Conditions (with expanded WorkloadStatus block) - ComponentNode.HasPodDefinition - events parameter in Build() - updated test spec count to 40 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Roee Gil <roee.gil@run.ai>

Replace direct resource package calls (GetStatus, GetScale, GetPodTemplateSpec) in steps 1-3 with a single tree.Build() call and depth-first traversal of WorkloadTree. Steps 4-5 (mutation) keep using ComponentFactory directly. Also fixes the run command — the quickstart has its own go.mod and must be run with `go run .` from its own directory, not from the repo root. Removes "controller" framing from comments and output strings. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Roee Gil <roee.gil@run.ai>

coderabbitai · 2026-06-10T13:52:30Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

Walkthrough

Introduces the pkg/tree package implementing a WorkloadTree data model with Build function, PodMatcher interface, KartaPodMatcher default implementation, and subtree pod-count/resource aggregation. Refactors the quickstart example to traverse the tree instead of querying per-component factory methods. Adds comprehensive Ginkgo tests and YAML fixtures for PyTorchJob, RayCluster, LeaderWorkerSet, and JobSet workloads. Updates CLI design documentation.

Changes

WorkloadTree feature implementation

Layer / File(s)	Summary
WorkloadTree data model and StructureSummary accessors `pkg/tree/tree.go`, `pkg/instructions/summary.go`	Defines `WorkloadTree`, `WorkloadStatus`, `ComponentNode`, `InstanceNode`, and `Resources` structs with cumulative subtree `PodCount`/`Resources` fields; adds `ChildrenOf` and `ParentOf` accessor methods to `StructureSummary`.
PodMatcher interface, KartaPodMatcher, and tests `pkg/tree/builder.go`, `pkg/tree/pod_matcher.go`, `pkg/tree/pod_matcher_mock.go`, `pkg/tree/pod_matcher_test.go`, `pkg/tree/suite_test.go`	Defines `PodMatcher` interface for component membership, instance key derivation, and replica-group key extraction; implements `KartaPodMatcher` delegating to `PodQuerier`; provides GoMock mock and Ginkgo tests for all three methods.
Build function and recursive component/instance orchestration `pkg/tree/builder.go`	Implements `Build`, which creates a `StructureSummary`, loads root matched status for phases, recursively claims pods per component via `PodMatcher`, groups claimed pods by instance+replica keys, ensures extracted instances exist even with no matched pods, and produces deterministically ordered instance nodes with nested child components.
Builder unit tests `pkg/tree/builder_test.go`	Covers phase extraction, per-component pod assignment with sibling isolation, multi-instance routing, replicated instance groups, empty-pod component nodes with `ExtractedInstance`, and matcher error propagation using `stubPodMatcher` and `errorMatcher` test doubles.
Subtree pod count and resource aggregation `pkg/tree/totals.go`, `pkg/tree/totals_test.go`	Implements `populateTotals` rolling up CPU/memory/GPU resource totals and pod counts from pod templates through instances to tree root; applies replica scaling per instance with GPU fallback from limits when requests are zero.
Fixture-backed end-to-end workload validation `pkg/tree/yaml_examples_test.go`, `pkg/tree/testdata/pytorch/`, `pkg/tree/testdata/raycluster/`, `pkg/tree/testdata/lws/`, `pkg/tree/testdata/jobset/`	Adds YAML expectation structs and per-workload scenarios (PyTorchJob minimal/extended, RayCluster minimal/extended, LeaderWorkerSet, JobSet) that call `Build` with `KartaPodMatcher` and assert output matches expected fixture files.
Quickstart migration to tree-based traversal `docs/examples/quickstart/main.go`, `docs/examples/quickstart/README.md`	Refactors quickstart `run()` steps 1–3 to build and traverse `WorkloadTree` for phase/replica/resource inspection; removes per-component factory reads; updates usage text, import, and README walkthrough.
CLI high-level design doc updates `docs/design/cli/high-level-design.md`	Adds `PodCount` and `Resources` to `WorkloadTree`, `ComponentNode`, and `InstanceNode` schema; defines `Resources` type; clarifies totals are cumulative desired-scale values not live pod counts.

Sequence Diagram(s)

sequenceDiagram
  participant Caller
  participant Build
  participant StructureSummary
  participant ComponentFactory
  participant PodMatcher
  participant populateTotals

  Caller->>Build: Build(ctx, karta, factory, pods, matcher)
  Build->>StructureSummary: NewStructureSummary(karta)
  Build->>ComponentFactory: GetRootComponent().GetStatus()
  ComponentFactory-->>Build: matched status phases
  loop for each child component
    Build->>ComponentFactory: claimPods via PodMatcher.MatchComponent
    PodMatcher-->>Build: claimed pods
    Build->>PodMatcher: MatchInstance + ExtractReplicaKey per pod
    PodMatcher-->>Build: instance+replica keys
    Build->>ComponentFactory: GetExtractedInstances, GetScale
    ComponentFactory-->>Build: extracted instances + scale
    Build->>Build: build InstanceNode(s), recurse children
  end
  Build->>populateTotals: populateTotals(tree)
  populateTotals-->>Build: PodCount + Resources rolled up
  Build-->>Caller: *WorkloadTree

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

run-ai/karta#31: Directly implements the WorkloadTree design introduced in docs/design/cli/high-level-design.md, which is updated in this PR to reflect the new PodCount/Resources totals fields.
run-ai/karta#84: Introduced the quickstart controller example in docs/examples/quickstart/main.go that this PR refactors to use pkg/tree.WorkloadTree traversal.

Suggested labels

enhancement

Suggested reviewers

ronlv10
AviadHayumi
shaked-bouktus
Isan-Rivkin

🐰 A tree of pods now blooms with care,
Each replica key claimed with flair!
GPU totals roll up the height,
From leaf to root — a glorious sight.
The WorkloadTree stands tall and bright! 🌳

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 22.22% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'feat(tree): add WorkloadTree data model and builder' accurately describes the primary changes—introducing the WorkloadTree data model and builder functionality in the tree package.
Linked Issues check	✅ Passed	The pull request implements the WorkloadTree data model and builder as required by issue `#85`, including the core data structures, Build function, PodMatcher interface, and KartaPodMatcher implementation.
Out of Scope Changes check	✅ Passed	All changes directly support the WorkloadTree implementation: new tree package files, StructureSummary accessor methods, quickstart documentation updates, and comprehensive test coverage are all in scope.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch feat/workload-tree-data-model

_{Comment @coderabbitai help to get the list of available commands.}

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (1)

docs/design/workload-tree/design.md (1)
16-108: ⚡ Quick win

Add language identifiers to fenced code blocks.

Nine fenced code blocks lack language identifiers (directory tree at line 16 and schema definitions at lines 33, 40, 46, 54, 77, 86, 96, 104). Adding text as the language specifier satisfies the linter and clarifies that these are language-agnostic schema representations.
📝 Example fix for the directory tree block
-```
+```text
 pkg/tree/
   tree.go           WorkloadTree, WorkloadStatus, ComponentNode, InstanceNode
Apply the same pattern to the eight schema blocks.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/design/workload-tree/design.md` around lines 16 - 108, Add the missing
fenced-block language identifiers by changing the nine unlabeled code fences in
the document to use "text" so the linter recognizes them; specifically update
the directory tree block and the schema blocks that describe WorkloadTree,
WorkloadStatus, ComponentNode, InstanceNode, WorkloadView, ComponentView,
InstanceView, and ResourceSummary so each opening triple-backtick is followed by
"text". Ensure every unlabeled block in the file is updated consistently.
Source: Linters/SAST tools

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@pkg/tree/pod_matcher.go`:
- Around line 21-24: Change the constructor NewKartaPodMatcher to return the
PodMatcher interface instead of the concrete *KartaPodMatcher: update its
signature to return PodMatcher, have it still instantiate &KartaPodMatcher{}
internally, and adjust any callers or tests to accept the PodMatcher interface
where they currently expect *KartaPodMatcher; keep the concrete type
KartaPodMatcher unchanged.

In `@pkg/tree/view_builder_test.go`:
- Around line 53-73: The test is missing assertions for aggregated memory;
update the test (the It block using NewWorkloadView and variable view) to assert
view.Resources.MemoryBytes equals the combined memory of both pods and also
assert the child component's memory (view.Children[0].Resources.MemoryBytes)
matches that same total; for these pods (each with resource.MustParse("8Gi"))
add expectations for MemoryBytes == 17179869184 (int64) or compute 16Gi in bytes
to validate correct aggregation.

---

Nitpick comments:
In `@docs/design/workload-tree/design.md`:
- Around line 16-108: Add the missing fenced-block language identifiers by
changing the nine unlabeled code fences in the document to use "text" so the
linter recognizes them; specifically update the directory tree block and the
schema blocks that describe WorkloadTree, WorkloadStatus, ComponentNode,
InstanceNode, WorkloadView, ComponentView, InstanceView, and ResourceSummary so
each opening triple-backtick is followed by "text". Ensure every unlabeled block
in the file is updated consistently.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Enterprise

Run ID: 6560adf6-e900-45c5-a296-4cfb2aa809df

📥 Commits

Reviewing files that changed from the base of the PR and between 041e434 and 36c769c.

📒 Files selected for processing (37)

docs/design/workload-tree/design.md
docs/examples/quickstart/README.md
docs/examples/quickstart/main.go
pkg/instructions/summary.go
pkg/tree/builder.go
pkg/tree/builder_test.go
pkg/tree/pod_matcher.go
pkg/tree/pod_matcher_mock.go
pkg/tree/pod_matcher_test.go
pkg/tree/suite_test.go
pkg/tree/testdata/jobset/events.yaml
pkg/tree/testdata/jobset/expected.yaml
pkg/tree/testdata/jobset/pods.yaml
pkg/tree/testdata/lws/events.yaml
pkg/tree/testdata/lws/expected.yaml
pkg/tree/testdata/lws/pods.yaml
pkg/tree/testdata/pytorch/events_extended.yaml
pkg/tree/testdata/pytorch/events_minimal.yaml
pkg/tree/testdata/pytorch/expected_extended.yaml
pkg/tree/testdata/pytorch/expected_minimal.yaml
pkg/tree/testdata/pytorch/pods_extended.yaml
pkg/tree/testdata/pytorch/pods_minimal.yaml
pkg/tree/testdata/pytorch/workload_extended.yaml
pkg/tree/testdata/pytorch/workload_minimal.yaml
pkg/tree/testdata/raycluster/events_extended.yaml
pkg/tree/testdata/raycluster/events_minimal.yaml
pkg/tree/testdata/raycluster/expected_extended.yaml
pkg/tree/testdata/raycluster/expected_minimal.yaml
pkg/tree/testdata/raycluster/pods_extended.yaml
pkg/tree/testdata/raycluster/pods_minimal.yaml
pkg/tree/testdata/raycluster/workload_extended.yaml
pkg/tree/testdata/raycluster/workload_minimal.yaml
pkg/tree/tree.go
pkg/tree/view.go
pkg/tree/view_builder.go
pkg/tree/view_builder_test.go
pkg/tree/yaml_examples_test.go

- Return PodMatcher interface from NewKartaPodMatcher constructor - Add missing MemoryBytes assertion in view_builder_test - Add text language tag to bare fenced code blocks in design.md - Revert "In production" to "In a real controller" in quickstart Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Roee Gil <roee.gil@run.ai>

Drop the WorkloadView display layer (view.go, view_builder.go and its test) and remove the Events and Conditions fields from WorkloadTree. The library now produces only the raw component hierarchy; display concerns such as resource aggregation, ready counts, and event/condition surfacing belong to consumers like the CLI. Build() no longer takes an events parameter. Tests, test fixtures, and the workload-tree and CLI design docs are updated to match. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Roee Gil <roee.gil@run.ai>

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

docs/design/cli/high-level-design.md (1)
283-283: ⚠️ Potential issue | 🔴 Critical

Fix Build function signature in documentation to match actual implementation.

Line 283 documents the signature as func Build(ctx context.Context, ri *ResourceInterface, workload client.Object, pods []corev1.Pod, matcher PodMatcher), but the actual implementation in pkg/tree/builder.go uses func Build(ctx context.Context, karta *v1alpha1.Karta, factory *resource.ComponentFactory, pods []corev1.Pod, matcher PodMatcher). The documented parameters are incorrect and will mislead users. Update parameter names and types to match the actual function.

Additionally, remove bold formatting from lines 61-67 (Semantic role names, Pods grouped by their role, Normalized phase, Desired vs current replicas, Collapsed plumbing) and use inline code or plain text instead, as shown in the refactored display layer section.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/design/cli/high-level-design.md` at line 283, Update the Build function
signature in the documentation at line 283 to match the actual implementation in
pkg/tree/builder.go. Replace the incorrect parameters ri *ResourceInterface and
workload client.Object with the correct ones: karta *v1alpha1.Karta and factory
*resource.ComponentFactory. Additionally, remove the bold formatting from lines
61-67 (the list items covering Semantic role names, Pods grouped by their role,
Normalized phase, Desired vs current replicas, and Collapsed plumbing) and
replace it with inline code formatting or plain text to maintain consistent
styling with the refactored display layer section.

🧹 Nitpick comments (1)

docs/design/cli/high-level-design.md (1)

61-67: ⚡ Quick win

Remove bold emphasis from key concept names.

Lines 61–67 use bold (**Semantic role names**, **Pods grouped by their role**, **Normalized phase**, etc.) to highlight concepts for emphasis. This violates the Markdown style guideline. The refactored display layer section (lines 309–317) correctly avoids bold emphasis, using inline code and plain text instead.

Remove bold formatting from all concept labels in this subsection.

Proposed style fix for lines 61–67

- **Semantic role names** - Frontend / PrefillWorker / DecodeWorker come from the label mapping declared in the Karta YAML.
- **Pods grouped by their role, not by their owner** - pods playing different roles are shown as separate groups even when they share the same owning resource.
- **Normalized phase** - `Running` has the same meaning across PyTorchJob, RayCluster, Dynamo, etc.
- **Desired vs current replicas** - `3/4 replicas` shows the scale target vs. the actual pod count, per role.
- **Collapsed plumbing** - intermediate objects (DynamoComponentDeployment, LeaderWorkerSet) are optionally hidden to reduce noise while keeping the semantic hierarchy visible.
+ Semantic role names - Frontend / PrefillWorker / DecodeWorker come from the label mapping declared in the Karta YAML.
+ Pods grouped by their role, not by their owner - pods playing different roles are shown as separate groups even when they share the same owning resource.
+ Normalized phase - `Running` has the same meaning across PyTorchJob, RayCluster, Dynamo, etc.
+ Desired vs current replicas - `3/4 replicas` shows the scale target vs. the actual pod count, per role.
+ Collapsed plumbing - intermediate objects (DynamoComponentDeployment, LeaderWorkerSet) are optionally hidden to reduce noise while keeping the semantic hierarchy visible.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/design/cli/high-level-design.md` around lines 61 - 67, In the high-level
design section of the documentation, remove the bold emphasis (the `**` markers)
from all concept labels: "Semantic role names", "Pods grouped by their role",
"Normalized phase", "Desired vs current replicas", and "Collapsed plumbing".
Replace them with plain text format, following the style pattern used in the
refactored display layer section which uses inline code and plain text instead
of bold for consistency with the Markdown style guidelines.

Source: Coding guidelines

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@docs/design/cli/high-level-design.md`:
- Line 283: Update the Build function signature in the documentation at line 283
to match the actual implementation in pkg/tree/builder.go. Replace the incorrect
parameters ri *ResourceInterface and workload client.Object with the correct
ones: karta *v1alpha1.Karta and factory *resource.ComponentFactory.
Additionally, remove the bold formatting from lines 61-67 (the list items
covering Semantic role names, Pods grouped by their role, Normalized phase,
Desired vs current replicas, and Collapsed plumbing) and replace it with inline
code formatting or plain text to maintain consistent styling with the refactored
display layer section.

---

Nitpick comments:
In `@docs/design/cli/high-level-design.md`:
- Around line 61-67: In the high-level design section of the documentation,
remove the bold emphasis (the `**` markers) from all concept labels: "Semantic
role names", "Pods grouped by their role", "Normalized phase", "Desired vs
current replicas", and "Collapsed plumbing". Replace them with plain text
format, following the style pattern used in the refactored display layer section
which uses inline code and plain text instead of bold for consistency with the
Markdown style guidelines.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Enterprise

Run ID: 144a31b4-e00c-46f3-bb0a-9e52dec27a51

📥 Commits

Reviewing files that changed from the base of the PR and between 67366c6 and f22b189.

📒 Files selected for processing (14)

docs/design/cli/high-level-design.md
docs/design/workload-tree/design.md
docs/examples/quickstart/README.md
docs/examples/quickstart/main.go
pkg/tree/builder.go
pkg/tree/builder_test.go
pkg/tree/testdata/jobset/expected.yaml
pkg/tree/testdata/lws/expected.yaml
pkg/tree/testdata/pytorch/expected_extended.yaml
pkg/tree/testdata/pytorch/expected_minimal.yaml
pkg/tree/testdata/raycluster/expected_extended.yaml
pkg/tree/testdata/raycluster/expected_minimal.yaml
pkg/tree/tree.go
pkg/tree/yaml_examples_test.go

💤 Files with no reviewable changes (7)

pkg/tree/testdata/jobset/expected.yaml
pkg/tree/testdata/raycluster/expected_minimal.yaml
pkg/tree/testdata/pytorch/expected_minimal.yaml
pkg/tree/testdata/pytorch/expected_extended.yaml
pkg/tree/testdata/raycluster/expected_extended.yaml
pkg/tree/testdata/lws/expected.yaml
pkg/tree/tree.go

✅ Files skipped from review due to trivial changes (2)

docs/design/workload-tree/design.md
docs/examples/quickstart/README.md

🚧 Files skipped from review as they are similar to previous changes (2)

docs/examples/quickstart/main.go
pkg/tree/builder_test.go

Add PodCount and Resources fields to WorkloadTree, ComponentNode, and InstanceNode. Each carries the cumulative desired totals for its subtree: pod count plus CPU (millicores), memory (bytes), and GPUs. Totals are computed from the desired spec - each pod-bearing component's replica count times its extracted pod template requests. CPU and memory come from container requests; GPUs come from the nvidia.com/gpu request, falling back to its limit when only the limit is set. GPUs is a float so fractional allocations (MIG slices, GPU sharing) are preserved. Add unit tests for the aggregation and extend the YAML example fixtures to assert the totals at every node against real Karta definitions. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Roee Gil <roee.gil@run.ai>

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

pkg/tree/tree.go (1)

8-8: ⚠️ Potential issue | 🔴 Critical

Import k8s.io/apimachinery/pkg/runtime/schema and use schema.GroupVersionKind instead of metav1.GroupVersionKind.

GroupVersionKind is defined in k8s.io/apimachinery/pkg/runtime/schema, not pkg/apis/meta/v1. The current code at line 51 will fail to compile.

Fix

import (
	corev1 "k8s.io/api/core/v1"
-	metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
+	"k8s.io/apimachinery/pkg/runtime/schema"

	"github.com/run-ai/karta/pkg/resource"
)

type ComponentNode struct {
	Name string
	// Kind is nil for logical grouping components (those without a Kind in the definition).
-	Kind             *metav1.GroupVersionKind
+	Kind             *schema.GroupVersionKind
	HasPodDefinition bool
	PodCount         int
	Resources        Resources
	Instances        []InstanceNode
}

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/tree/tree.go` at line 8, The code is using metav1.GroupVersionKind which
does not exist in that package. Add a new import statement for
k8s.io/apimachinery/pkg/runtime/schema (aliased as schema), then locate where
GroupVersionKind is used in the code (referenced at line 51) and change the
reference from metav1.GroupVersionKind to schema.GroupVersionKind to use the
correct package where this type is actually defined.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@pkg/tree/tree.go`:
- Line 8: The code is using metav1.GroupVersionKind which does not exist in that
package. Add a new import statement for k8s.io/apimachinery/pkg/runtime/schema
(aliased as schema), then locate where GroupVersionKind is used in the code
(referenced at line 51) and change the reference from metav1.GroupVersionKind to
schema.GroupVersionKind to use the correct package where this type is actually
defined.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Enterprise

Run ID: 9b2eb13f-7ff3-4d1b-acd6-378fbc610032

📥 Commits

Reviewing files that changed from the base of the PR and between f22b189 and 5430be0.

📒 Files selected for processing (12)

docs/design/workload-tree/design.md
pkg/tree/builder.go
pkg/tree/testdata/jobset/expected.yaml
pkg/tree/testdata/lws/expected.yaml
pkg/tree/testdata/pytorch/expected_extended.yaml
pkg/tree/testdata/pytorch/expected_minimal.yaml
pkg/tree/testdata/raycluster/expected_extended.yaml
pkg/tree/testdata/raycluster/expected_minimal.yaml
pkg/tree/totals.go
pkg/tree/totals_test.go
pkg/tree/tree.go
pkg/tree/yaml_examples_test.go

✅ Files skipped from review due to trivial changes (4)

pkg/tree/testdata/jobset/expected.yaml
pkg/tree/testdata/lws/expected.yaml
docs/design/workload-tree/design.md
pkg/tree/testdata/pytorch/expected_minimal.yaml

🚧 Files skipped from review as they are similar to previous changes (4)

pkg/tree/testdata/raycluster/expected_minimal.yaml
pkg/tree/testdata/pytorch/expected_extended.yaml
pkg/tree/testdata/raycluster/expected_extended.yaml
pkg/tree/builder.go

Remove docs/design/workload-tree/design.md and revert the CLI high-level-design.md to its prior form, then document the cumulative PodCount and Resources subtree totals (CPU, memory, GPUs) on WorkloadTree, ComponentNode, and InstanceNode. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Roee Gil <roee.gil@run.ai>

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

docs/design/cli/high-level-design.md (1)
301-302: ⚠️ Potential issue | 🟡 Minor

Update the Build() function signature to match the actual implementation. The second and third parameters are incorrect: the documented ri *ResourceInterface, workload client.Object should be karta *v1alpha1.Karta, factory *resource.ComponentFactory.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/design/cli/high-level-design.md` around lines 301 - 302, Update the
Build() function signature in the design documentation to match the actual
implementation. Replace the second and third parameters from `ri
*ResourceInterface, workload client.Object` with `karta *v1alpha1.Karta, factory
*resource.ComponentFactory` to accurately reflect what the function accepts in
the real codebase.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@docs/design/cli/high-level-design.md`:
- Around line 301-302: Update the Build() function signature in the design
documentation to match the actual implementation. Replace the second and third
parameters from `ri *ResourceInterface, workload client.Object` with `karta
*v1alpha1.Karta, factory *resource.ComponentFactory` to accurately reflect what
the function accepts in the real codebase.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Enterprise

Run ID: 9cdfe695-6ecc-4bb0-bd2e-172dc5320b01

📥 Commits

Reviewing files that changed from the base of the PR and between 5430be0 and 24db4f0.

📒 Files selected for processing (1)

docs/design/cli/high-level-design.md

Drop live pods from the WorkloadTree output and replace the single PodCount/Resources totals with min/max ranges driven by the Scale object. Each pod-bearing instance contributes its min and max replica counts (Scale.MinReplicas/MaxReplicas, falling back to Scale.Replicas, and 0 when no scale is defined) times its per-pod requests. Totals are exposed as PodCount{Min,Max} and ResourceRange{Min,Max}; fixed-replica components collapse to equal bounds. Pods remain a Build input for instance and replica-key discovery but are no longer stored on the tree. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Roee Gil <roee.gil@run.ai>

tree.Build no longer takes a pod list or PodMatcher. The builder derives the desired structure entirely from the workload spec: each component expands into instances from its InstanceIdPath, and a replica-selector component collapses to a single instance. Per-replica and per-pod data is layered on by live consumers via the pkg/resource pod querier, not the tree. Removes the PodMatcher abstraction, its mock, and the pod test fixtures; updates the quickstart example, the CLI high-level design, and the LWS golden file (which now reflects a single collapsed group node). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Roee Gil <roee.gil@run.ai>

Apply Ron Lev's review on PR #87: - Build now takes only the ComponentFactory and derives the Karta from it via a new ComponentFactory.GetKarta(); rewrite the Build doc comment. - Resources holds k8s resource.Quantity (CPU/Memory/GPU) instead of raw int64/float64, so units round-trip exactly and fractional GPU is preserved. - WorkloadStatus is now optional (*WorkloadStatus); nil means no status was evaluated, e.g. an offline pre-submission tree. - Totals derive pod-bearing from leaf instances rather than HasPodDefinition, and set ResourcesIncomplete when a node has desired pods whose per-pod resources cannot be resolved, instead of silently reporting zero. - Rename effectiveReplicas to replicaBounds. - Document why every component (including single-instance) carries an InstanceNode and why child components are built per instance. - Assert exact pod counts and replica bounds in the builder tests; regenerate the golden fixtures for the Quantity serialization. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Roee Gil <roee.gil@run.ai>

Verify fractional nvidia.com/gpu requests are preserved exactly through the totals pipeline: - Unit tests asserting exact Quantity values when scaling a fractional GPU by the replica count, summing fractions across components, ranging over the scale envelope, and reading a fractional GPU from the limit. - A PyTorchJob golden fixture (master 0.25, worker 0.5 x 2) whose root total is 1.25 GPU, exercising fraction plus integer aggregation and serialization. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Roee Gil <roee.gil@run.ai>

Address remaining review feedback on the workload tree builder: - Add a childComponents helper that resolves a parent's child names to []*resource.Component, so buildComponentNodes and buildInstanceNodes work with components instead of raw name strings. Keeps the name resolution in the tree package so pkg/instructions gains no dependency on pkg/resource. - Drop the unused karta parameter threaded through the build helpers. - Trim the buildInstanceNodes doc comment. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Roee Gil <roee.gil@run.ai>

ronlv10

Overall looks good to me - I think that the solution of the yamls is an overkill and would made us to update the yamls for any change and on the other hand is hard to verify the output (need to look on 3 files in order to verify

I suggest

convert them to test if there are not cover - e.g, pytorch and dyanmo and check the tree
maybe create a diagram/image and add to the doc (you can open an issue ) so we would be able to visuals in our docs how the tree looks like - I think it could bring value

ronlv10 · 2026-06-29T09:17:16Z

I think you changed the deisgn doc too much here
it's fine to correction but for example you deleted the pod matcher - which still relevant to the CLI but we decided to split it differently

in the past we didn't really update design docs(before ai and pr that is a design doc)- I think it's fine to make correction after different decision but not to change entirely - for example it's fine to update what is workload tree post-discussion , that we added resource , etc

Isan-Rivkin

Minor comments, overall like but Im still in partial review will continue tomorrow.
Wanted to publish something not to delay the minor stuff.
Regarding the testing of expected yamls, didnt review as discussed

Isan-Rivkin · 2026-06-30T12:17:25Z

+		instanceKeys = []string{""}
+	}
+	keys := append([]string(nil), instanceKeys...)
+	sort.Strings(keys)


taken from Map, using sort keep the sorted list aligned, instead of getting each call a different list arrangement

The tree data model now carries only cumulative pod counts, not resource requests. Resources, ResourceRange, and ResourcesIncomplete are removed from WorkloadTree, ComponentNode, and InstanceNode, along with the per-pod resource extraction and aggregation logic in totals.go. Per-pod resource usage is left to consumers that fetch live pods, matching the pod-matching step described in the CLI design. Child components are now resolved via ComponentFactory.GetChildComponentsOf, built from the structure's OwnerRef relationships, removing the builder's dependency on instructions.StructureSummary. Each instance's subtree is built once and cloned per instance so per-node pod counts stay independent. Removes the YAML golden-file tests and testdata that asserted resource totals. Signed-off-by: Roee Gil <roee.gil@run.ai>

ChildrenOf and ParentOf had zero callers after the tree builder switched to ComponentFactory.GetChildComponentsOf. The backing childrenMap and parentMap fields are retained; both are still used internally. Signed-off-by: Roee Gil <roee.gil@run.ai>

A pod-bearing leaf with no scale in the spec now contributes one desired pod instead of zero, matching Kubernetes' default replica count for a workload that declares none. replicaBounds returns 1,1 for a nil scale; the non-nil branch is unchanged. Signed-off-by: Roee Gil <roee.gil@run.ai>

Adds a Milvus fixture (status-only root with querynode, datanode, and a scaleless proxy StatefulSet child) and builder tests covering component discovery, per-component pod counts, the single-pod default for the scaleless child, and root rollup. Signed-off-by: Roee Gil <roee.gil@run.ai>

ronlv10 · 2026-07-01T13:42:55Z


+// GetChildComponentsOf retrieves the direct child components of the named parent,
+// resolved from the structure's OwnerRef relationships.
+func (f *ComponentFactory) GetChildComponentsOf(parentName string) ([]*Component, error) {


why you put it here? I think it should be in the tree..

The factory has the karta and the object. The tree comes afterwards

ronlv10 · 2026-07-01T13:46:24Z

+	baseChildren, err := buildComponentNodes(ctx, children, factory)
+	if err != nil {


baseChildren -> instanceChildren

But it returns a component child list

ronlv10 · 2026-07-01T13:53:24Z

+// ranged over the scale envelope. A pod-bearing leaf counts its own replicas
+// (defaulting to one when it declares no scale); a non-pod leaf counts nothing;
+// a non-leaf rolls up its children.
+func instancePodCount(children []ComponentNode, scale *resource.Scale, hasPodDefinition bool) PodCount {


I think you're missing here a case where instance has pods and also children
not sure if it's a valid case but not sure it doesn't and we miss it here

var total PodCount if hasPodDefinition { minReplicas, maxReplicas := replicaBounds(scale) total = PodCount{Min: minReplicas, Max: maxReplicas} } for _, child := range children { total.Min += child.PodCount.Min total.Max += child.PodCount.Max } return total

probably something like this

This logic breaks in the case where the instance has a replica and also has children. We ignore the instance and replica count. This sums it up.

ronlv10 · 2026-07-01T14:08:03Z

TBH I think that we should have a builder that in the test you override what's need for the test and then when reading the test is clear and you don't need to jump to this file - but that's not the first case so I get it if it too much now

ronlv10 · 2026-07-01T14:08:20Z

 *~
 .vscode/
 .idea
+/docs/examples/quickstart/quickstart


Add to ignore: this is the binary created when running go build

Isan-Rivkin · 2026-07-02T12:54:22Z

+// replicaBounds returns the min and max desired replica counts for a scale.
+// A nil scale defaults to one replica: assumes that a pod-bearing component that declares no
+// scale still runs a single pod.
+func replicaBounds(scale *resource.Scale) (minReplicas, maxReplicas int) {


Min-only scale yields Max < Min, for example PodCount{Min:3, Max:0}.

suggest to add a test with min only

define the correct behavior, indicate it's optional with a pointer or something

Isan-Rivkin · 2026-07-02T12:59:38Z

+// its components, their instances, normalized status, and the desired pod counts
+// ranged over the scale envelope. The factory carries the Karta definition and
+// the workload object it was built from.
+func Build(


The build is getting Stack Overflow in case of a circular ownerRef.
We should handle this instead of crashing because this is user-input that can crash the cli.

Suggest to add a test case for this

Note: the validator handles this error cleanly NewKartaValidator().Validate() returns error: root component cannot have owner ref

apiVersion: run.ai/v1alpha1 kind: Karta metadata: name: cyclic spec: structureDefinition: rootComponent: name: root ownerRef: a kind: group: example.com version: v1 kind: Example childComponents: - name: a ownerRef: root kind: group: example.com version: v1 kind: ExampleChild

Isan-Rivkin · 2026-07-02T13:19:38Z

+			Name:             component.Name(),
+			Kind:             component.Kind(),
+			HasPodDefinition: component.HasPodDefinition(),
+			PodCount:         sumNodes(instances),


Question, for a pod-bearing root with no children (batch-job.yaml, nimservice.yaml), the tree is empty and the workload's desired pod count is unrepresentable in any field. Intended?

For example below, tree pod count and children is 0.

karta:

apiVersion: run.ai/v1alpha1 kind: Karta metadata: name: batch-v1-job spec: structureDefinition: rootComponent: name: job kind: group: batch version: v1 kind: Job scaleDefinition: replicasPath: .spec.parallelism // 1 specDefinition: podTemplateSpecPath: .spec.template

and workload

apiVersion: batch/v1 kind: Job metadata: name: pi namespace: default spec: parallelism: 5 template: spec: containers: - name: pi image: perl:5.34

rogirun and others added 6 commits June 10, 2026 10:09

chore(docs): remove workload-policy-controller design from branch

ffe6038

This file belongs to a separate initiative and was accidentally included. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Roee Gil <roee.gil@run.ai>

rogirun marked this pull request as ready for review June 10, 2026 13:37

rogirun requested review from AviadHayumi, Isan-Rivkin, ronlv10, shaked-bouktus and yuval-gr as code owners June 10, 2026 13:37

coderabbitai Bot reviewed Jun 10, 2026

View reviewed changes

Comment thread pkg/tree/pod_matcher.go Outdated

Comment thread pkg/tree/view_builder_test.go Outdated

yuval-gr reviewed Jun 10, 2026

View reviewed changes

Comment thread docs/examples/quickstart/main.go Outdated

rogirun and others added 2 commits June 10, 2026 17:29

coderabbitai Bot reviewed Jun 15, 2026

View reviewed changes

rogirun and others added 2 commits June 24, 2026 12:42

ronlv10 reviewed Jun 27, 2026

View reviewed changes

rogirun and others added 3 commits June 28, 2026 15:46

rogirun requested a review from ronlv10 June 28, 2026 13:38

ronlv10 reviewed Jun 29, 2026

View reviewed changes

Isan-Rivkin reviewed Jun 30, 2026

View reviewed changes

Comment thread pkg/tree/testdata/jobset/expected.yaml Outdated

Isan-Rivkin requested changes Jun 30, 2026

View reviewed changes

rogirun force-pushed the feat/workload-tree-data-model branch from 152402b to d0d091f Compare July 1, 2026 10:32

rogirun added 3 commits July 1, 2026 13:57

wip

0e4795f

wip

8f9bf7c

rogirun requested review from Isan-Rivkin and ronlv10 July 1, 2026 11:25

rogirun added 2 commits July 1, 2026 15:19

ronlv10 reviewed Jul 1, 2026

View reviewed changes

Comment thread docs/design/cli/high-level-design.md Outdated

ronlv10 previously approved these changes Jul 1, 2026

View reviewed changes

CR fixes

e5e4b7c

rogirun dismissed ronlv10’s stale review via e5e4b7c July 2, 2026 07:03

Isan-Rivkin requested changes Jul 2, 2026

View reviewed changes

		baseChildren, err := buildComponentNodes(ctx, children, factory)
		if err != nil {

Uh oh!

Conversation

rogirun commented Jun 10, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Related issue(s)

Checklist

Summary by CodeRabbit

Release Notes

Uh oh!

coderabbitai Bot commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ronlv10 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Isan-Rivkin left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rogirun commented Jun 10, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 10, 2026 •

edited

Loading