Skip to content

feat(tree): add WorkloadTree data model and builder#87

Open
rogirun wants to merge 22 commits into
mainfrom
feat/workload-tree-data-model
Open

feat(tree): add WorkloadTree data model and builder#87
rogirun wants to merge 22 commits into
mainfrom
feat/workload-tree-data-model

Conversation

@rogirun

@rogirun rogirun commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

What does this PR do?

Implements the WorkloadTree data model and builder for the Karta library.

Adds pkg/tree/ with:

  • WorkloadTree — raw extraction result (component hierarchy, matched pods, scale, status)
  • WorkloadView — display layer with aggregated resource sums, ready counts, node lists
  • Build() — assembles a WorkloadTree from a Karta definition, ComponentFactory, and pod list
  • PodMatcher interface — decouples tree structure from pod matching strategy
  • KartaPodMatcher — default implementation using PodSelector definitions
  • NewWorkloadView() — computes display-layer values from a WorkloadTree

Also adds two exported accessor methods to StructureSummary (ChildrenOf, ParentOf) used by the tree builder.

Includes design document:

  • docs/design/workload-tree/design.md

Related issue(s)

Fixes #85

Checklist

  • All commits are signed off with DCO (git commit -s)
  • New/modified files have SPDX license and copyright headers
  • Documentation updated (if applicable)
  • Tests pass (make check)
  • No proprietary or internal information included

Summary by CodeRabbit

Release Notes

  • New Features

    • Added WorkloadTree API to inspect workload hierarchies with pod counts and resource aggregation across components and instances.
    • Introduced customizable pod-matching interface for flexible workload introspection.
    • Extended component structure queries with new parent/child relationship lookups.
  • Documentation

    • Updated quickstart guide with clearer execution instructions and improved example walkthroughs.
    • Enhanced design documentation for workload inspection schema.

rogirun and others added 6 commits June 10, 2026 10:09
Implements the WorkloadTree data model described in issue #85.
Adds pkg/tree/ with WorkloadTree (raw extraction result),
WorkloadView (display layer), Build() function, PodMatcher
interface, and KartaPodMatcher (default implementation).

Also adds two exported accessor methods to StructureSummary
(ChildrenOf, ParentOf) used by the tree builder.

Includes design documents at docs/design/workload-tree/design.md
and docs/design/workload-policy-controller/design.md.

Fixes #85

Signed-off-by: Roee Gil <roee.gil@run.ai>
Adds four integration-style tests that exercise Build() and
KartaPodMatcher against real Karta definitions from docs/samples/:
- PyTorchJob minimal (fixed replicas) and extended (elastic policy)
- RayCluster minimal (single worker group) and extended (multi-instance)

Each scenario has YAML fixtures for workload, pods, events, and expected
output under pkg/tree/testdata/.

Also extends the WorkloadTree model to surface data exposed by the
quickstart controller example:
- WorkloadTree.Events []corev1.Event (passed in alongside pods)
- WorkloadStatus.Conditions []resource.Condition (raw conditions from
  the workload status, already extracted by GetStatus())
- ComponentNode.HasPodDefinition bool (mirrors comp.HasPodDefinition())

Build() gains an events parameter; existing callers pass nil.

Refs #85

Signed-off-by: Roee Gil <roee.gil@run.ai>
Use the exact quickstart workload YAMLs as inputs so the tests verify
the same objects the quickstart example runs against. Extends
instanceExpectation with a Children field to handle the 3-level
LWS hierarchy (group -> leader/worker).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Roee Gil <roee.gil@run.ai>
This file belongs to a separate initiative and was accidentally included.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Roee Gil <roee.gil@run.ai>
Reflect fields and Build() signature added after the initial draft:
- WorkloadTree.Events
- WorkloadStatus.Conditions (with expanded WorkloadStatus block)
- ComponentNode.HasPodDefinition
- events parameter in Build()
- updated test spec count to 40

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Roee Gil <roee.gil@run.ai>
Replace direct resource package calls (GetStatus, GetScale,
GetPodTemplateSpec) in steps 1-3 with a single tree.Build() call and
depth-first traversal of WorkloadTree. Steps 4-5 (mutation) keep using
ComponentFactory directly.

Also fixes the run command — the quickstart has its own go.mod and must
be run with `go run .` from its own directory, not from the repo root.
Removes "controller" framing from comments and output strings.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Roee Gil <roee.gil@run.ai>
@rogirun rogirun marked this pull request as ready for review June 10, 2026 13:37
@coderabbitai

coderabbitai Bot commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

Walkthrough

Introduces the pkg/tree package implementing a WorkloadTree data model with Build function, PodMatcher interface, KartaPodMatcher default implementation, and subtree pod-count/resource aggregation. Refactors the quickstart example to traverse the tree instead of querying per-component factory methods. Adds comprehensive Ginkgo tests and YAML fixtures for PyTorchJob, RayCluster, LeaderWorkerSet, and JobSet workloads. Updates CLI design documentation.

Changes

WorkloadTree feature implementation

Layer / File(s) Summary
WorkloadTree data model and StructureSummary accessors
pkg/tree/tree.go, pkg/instructions/summary.go
Defines WorkloadTree, WorkloadStatus, ComponentNode, InstanceNode, and Resources structs with cumulative subtree PodCount/Resources fields; adds ChildrenOf and ParentOf accessor methods to StructureSummary.
PodMatcher interface, KartaPodMatcher, and tests
pkg/tree/builder.go, pkg/tree/pod_matcher.go, pkg/tree/pod_matcher_mock.go, pkg/tree/pod_matcher_test.go, pkg/tree/suite_test.go
Defines PodMatcher interface for component membership, instance key derivation, and replica-group key extraction; implements KartaPodMatcher delegating to PodQuerier; provides GoMock mock and Ginkgo tests for all three methods.
Build function and recursive component/instance orchestration
pkg/tree/builder.go
Implements Build, which creates a StructureSummary, loads root matched status for phases, recursively claims pods per component via PodMatcher, groups claimed pods by instance+replica keys, ensures extracted instances exist even with no matched pods, and produces deterministically ordered instance nodes with nested child components.
Builder unit tests
pkg/tree/builder_test.go
Covers phase extraction, per-component pod assignment with sibling isolation, multi-instance routing, replicated instance groups, empty-pod component nodes with ExtractedInstance, and matcher error propagation using stubPodMatcher and errorMatcher test doubles.
Subtree pod count and resource aggregation
pkg/tree/totals.go, pkg/tree/totals_test.go
Implements populateTotals rolling up CPU/memory/GPU resource totals and pod counts from pod templates through instances to tree root; applies replica scaling per instance with GPU fallback from limits when requests are zero.
Fixture-backed end-to-end workload validation
pkg/tree/yaml_examples_test.go, pkg/tree/testdata/pytorch/*, pkg/tree/testdata/raycluster/*, pkg/tree/testdata/lws/*, pkg/tree/testdata/jobset/*
Adds YAML expectation structs and per-workload scenarios (PyTorchJob minimal/extended, RayCluster minimal/extended, LeaderWorkerSet, JobSet) that call Build with KartaPodMatcher and assert output matches expected fixture files.
Quickstart migration to tree-based traversal
docs/examples/quickstart/main.go, docs/examples/quickstart/README.md
Refactors quickstart run() steps 1–3 to build and traverse WorkloadTree for phase/replica/resource inspection; removes per-component factory reads; updates usage text, import, and README walkthrough.
CLI high-level design doc updates
docs/design/cli/high-level-design.md
Adds PodCount and Resources to WorkloadTree, ComponentNode, and InstanceNode schema; defines Resources type; clarifies totals are cumulative desired-scale values not live pod counts.

Sequence Diagram(s)

sequenceDiagram
  participant Caller
  participant Build
  participant StructureSummary
  participant ComponentFactory
  participant PodMatcher
  participant populateTotals

  Caller->>Build: Build(ctx, karta, factory, pods, matcher)
  Build->>StructureSummary: NewStructureSummary(karta)
  Build->>ComponentFactory: GetRootComponent().GetStatus()
  ComponentFactory-->>Build: matched status phases
  loop for each child component
    Build->>ComponentFactory: claimPods via PodMatcher.MatchComponent
    PodMatcher-->>Build: claimed pods
    Build->>PodMatcher: MatchInstance + ExtractReplicaKey per pod
    PodMatcher-->>Build: instance+replica keys
    Build->>ComponentFactory: GetExtractedInstances, GetScale
    ComponentFactory-->>Build: extracted instances + scale
    Build->>Build: build InstanceNode(s), recurse children
  end
  Build->>populateTotals: populateTotals(tree)
  populateTotals-->>Build: PodCount + Resources rolled up
  Build-->>Caller: *WorkloadTree
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

  • run-ai/karta#31: Directly implements the WorkloadTree design introduced in docs/design/cli/high-level-design.md, which is updated in this PR to reflect the new PodCount/Resources totals fields.
  • run-ai/karta#84: Introduced the quickstart controller example in docs/examples/quickstart/main.go that this PR refactors to use pkg/tree.WorkloadTree traversal.

Suggested labels

enhancement

Suggested reviewers

  • ronlv10
  • AviadHayumi
  • shaked-bouktus
  • Isan-Rivkin

🐰 A tree of pods now blooms with care,
Each replica key claimed with flair!
GPU totals roll up the height,
From leaf to root — a glorious sight.
The WorkloadTree stands tall and bright! 🌳

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 22.22% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'feat(tree): add WorkloadTree data model and builder' accurately describes the primary changes—introducing the WorkloadTree data model and builder functionality in the tree package.
Linked Issues check ✅ Passed The pull request implements the WorkloadTree data model and builder as required by issue #85, including the core data structures, Build function, PodMatcher interface, and KartaPodMatcher implementation.
Out of Scope Changes check ✅ Passed All changes directly support the WorkloadTree implementation: new tree package files, StructureSummary accessor methods, quickstart documentation updates, and comprehensive test coverage are all in scope.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/workload-tree-data-model

Comment @coderabbitai help to get the list of available commands.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
docs/design/workload-tree/design.md (1)

16-108: ⚡ Quick win

Add language identifiers to fenced code blocks.

Nine fenced code blocks lack language identifiers (directory tree at line 16 and schema definitions at lines 33, 40, 46, 54, 77, 86, 96, 104). Adding text as the language specifier satisfies the linter and clarifies that these are language-agnostic schema representations.

📝 Example fix for the directory tree block
-```
+```text
 pkg/tree/
   tree.go           WorkloadTree, WorkloadStatus, ComponentNode, InstanceNode

Apply the same pattern to the eight schema blocks.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/design/workload-tree/design.md` around lines 16 - 108, Add the missing
fenced-block language identifiers by changing the nine unlabeled code fences in
the document to use "text" so the linter recognizes them; specifically update
the directory tree block and the schema blocks that describe WorkloadTree,
WorkloadStatus, ComponentNode, InstanceNode, WorkloadView, ComponentView,
InstanceView, and ResourceSummary so each opening triple-backtick is followed by
"text". Ensure every unlabeled block in the file is updated consistently.

Source: Linters/SAST tools

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@pkg/tree/pod_matcher.go`:
- Around line 21-24: Change the constructor NewKartaPodMatcher to return the
PodMatcher interface instead of the concrete *KartaPodMatcher: update its
signature to return PodMatcher, have it still instantiate &KartaPodMatcher{}
internally, and adjust any callers or tests to accept the PodMatcher interface
where they currently expect *KartaPodMatcher; keep the concrete type
KartaPodMatcher unchanged.

In `@pkg/tree/view_builder_test.go`:
- Around line 53-73: The test is missing assertions for aggregated memory;
update the test (the It block using NewWorkloadView and variable view) to assert
view.Resources.MemoryBytes equals the combined memory of both pods and also
assert the child component's memory (view.Children[0].Resources.MemoryBytes)
matches that same total; for these pods (each with resource.MustParse("8Gi"))
add expectations for MemoryBytes == 17179869184 (int64) or compute 16Gi in bytes
to validate correct aggregation.

---

Nitpick comments:
In `@docs/design/workload-tree/design.md`:
- Around line 16-108: Add the missing fenced-block language identifiers by
changing the nine unlabeled code fences in the document to use "text" so the
linter recognizes them; specifically update the directory tree block and the
schema blocks that describe WorkloadTree, WorkloadStatus, ComponentNode,
InstanceNode, WorkloadView, ComponentView, InstanceView, and ResourceSummary so
each opening triple-backtick is followed by "text". Ensure every unlabeled block
in the file is updated consistently.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Enterprise

Run ID: 6560adf6-e900-45c5-a296-4cfb2aa809df

📥 Commits

Reviewing files that changed from the base of the PR and between 041e434 and 36c769c.

📒 Files selected for processing (37)
  • docs/design/workload-tree/design.md
  • docs/examples/quickstart/README.md
  • docs/examples/quickstart/main.go
  • pkg/instructions/summary.go
  • pkg/tree/builder.go
  • pkg/tree/builder_test.go
  • pkg/tree/pod_matcher.go
  • pkg/tree/pod_matcher_mock.go
  • pkg/tree/pod_matcher_test.go
  • pkg/tree/suite_test.go
  • pkg/tree/testdata/jobset/events.yaml
  • pkg/tree/testdata/jobset/expected.yaml
  • pkg/tree/testdata/jobset/pods.yaml
  • pkg/tree/testdata/lws/events.yaml
  • pkg/tree/testdata/lws/expected.yaml
  • pkg/tree/testdata/lws/pods.yaml
  • pkg/tree/testdata/pytorch/events_extended.yaml
  • pkg/tree/testdata/pytorch/events_minimal.yaml
  • pkg/tree/testdata/pytorch/expected_extended.yaml
  • pkg/tree/testdata/pytorch/expected_minimal.yaml
  • pkg/tree/testdata/pytorch/pods_extended.yaml
  • pkg/tree/testdata/pytorch/pods_minimal.yaml
  • pkg/tree/testdata/pytorch/workload_extended.yaml
  • pkg/tree/testdata/pytorch/workload_minimal.yaml
  • pkg/tree/testdata/raycluster/events_extended.yaml
  • pkg/tree/testdata/raycluster/events_minimal.yaml
  • pkg/tree/testdata/raycluster/expected_extended.yaml
  • pkg/tree/testdata/raycluster/expected_minimal.yaml
  • pkg/tree/testdata/raycluster/pods_extended.yaml
  • pkg/tree/testdata/raycluster/pods_minimal.yaml
  • pkg/tree/testdata/raycluster/workload_extended.yaml
  • pkg/tree/testdata/raycluster/workload_minimal.yaml
  • pkg/tree/tree.go
  • pkg/tree/view.go
  • pkg/tree/view_builder.go
  • pkg/tree/view_builder_test.go
  • pkg/tree/yaml_examples_test.go

Comment thread pkg/tree/pod_matcher.go Outdated
Comment thread pkg/tree/view_builder_test.go Outdated
Comment thread docs/examples/quickstart/main.go Outdated
rogirun and others added 2 commits June 10, 2026 17:29
- Return PodMatcher interface from NewKartaPodMatcher constructor
- Add missing MemoryBytes assertion in view_builder_test
- Add text language tag to bare fenced code blocks in design.md
- Revert "In production" to "In a real controller" in quickstart

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Roee Gil <roee.gil@run.ai>
Drop the WorkloadView display layer (view.go, view_builder.go and its
test) and remove the Events and Conditions fields from WorkloadTree.
The library now produces only the raw component hierarchy; display
concerns such as resource aggregation, ready counts, and event/condition
surfacing belong to consumers like the CLI.

Build() no longer takes an events parameter. Tests, test fixtures, and
the workload-tree and CLI design docs are updated to match.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Roee Gil <roee.gil@run.ai>

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
docs/design/cli/high-level-design.md (1)

283-283: ⚠️ Potential issue | 🔴 Critical

Fix Build function signature in documentation to match actual implementation.

Line 283 documents the signature as func Build(ctx context.Context, ri *ResourceInterface, workload client.Object, pods []corev1.Pod, matcher PodMatcher), but the actual implementation in pkg/tree/builder.go uses func Build(ctx context.Context, karta *v1alpha1.Karta, factory *resource.ComponentFactory, pods []corev1.Pod, matcher PodMatcher). The documented parameters are incorrect and will mislead users. Update parameter names and types to match the actual function.

Additionally, remove bold formatting from lines 61-67 (Semantic role names, Pods grouped by their role, Normalized phase, Desired vs current replicas, Collapsed plumbing) and use inline code or plain text instead, as shown in the refactored display layer section.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/design/cli/high-level-design.md` at line 283, Update the Build function
signature in the documentation at line 283 to match the actual implementation in
pkg/tree/builder.go. Replace the incorrect parameters ri *ResourceInterface and
workload client.Object with the correct ones: karta *v1alpha1.Karta and factory
*resource.ComponentFactory. Additionally, remove the bold formatting from lines
61-67 (the list items covering Semantic role names, Pods grouped by their role,
Normalized phase, Desired vs current replicas, and Collapsed plumbing) and
replace it with inline code formatting or plain text to maintain consistent
styling with the refactored display layer section.
🧹 Nitpick comments (1)
docs/design/cli/high-level-design.md (1)

61-67: ⚡ Quick win

Remove bold emphasis from key concept names.

Lines 61–67 use bold (**Semantic role names**, **Pods grouped by their role**, **Normalized phase**, etc.) to highlight concepts for emphasis. This violates the Markdown style guideline. The refactored display layer section (lines 309–317) correctly avoids bold emphasis, using inline code and plain text instead.

Remove bold formatting from all concept labels in this subsection.

Proposed style fix for lines 61–67
- **Semantic role names** - Frontend / PrefillWorker / DecodeWorker come from the label mapping declared in the Karta YAML.
- **Pods grouped by their role, not by their owner** - pods playing different roles are shown as separate groups even when they share the same owning resource.
- **Normalized phase** - `Running` has the same meaning across PyTorchJob, RayCluster, Dynamo, etc.
- **Desired vs current replicas** - `3/4 replicas` shows the scale target vs. the actual pod count, per role.
- **Collapsed plumbing** - intermediate objects (DynamoComponentDeployment, LeaderWorkerSet) are optionally hidden to reduce noise while keeping the semantic hierarchy visible.
+ Semantic role names - Frontend / PrefillWorker / DecodeWorker come from the label mapping declared in the Karta YAML.
+ Pods grouped by their role, not by their owner - pods playing different roles are shown as separate groups even when they share the same owning resource.
+ Normalized phase - `Running` has the same meaning across PyTorchJob, RayCluster, Dynamo, etc.
+ Desired vs current replicas - `3/4 replicas` shows the scale target vs. the actual pod count, per role.
+ Collapsed plumbing - intermediate objects (DynamoComponentDeployment, LeaderWorkerSet) are optionally hidden to reduce noise while keeping the semantic hierarchy visible.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/design/cli/high-level-design.md` around lines 61 - 67, In the high-level
design section of the documentation, remove the bold emphasis (the `**` markers)
from all concept labels: "Semantic role names", "Pods grouped by their role",
"Normalized phase", "Desired vs current replicas", and "Collapsed plumbing".
Replace them with plain text format, following the style pattern used in the
refactored display layer section which uses inline code and plain text instead
of bold for consistency with the Markdown style guidelines.

Source: Coding guidelines

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@docs/design/cli/high-level-design.md`:
- Line 283: Update the Build function signature in the documentation at line 283
to match the actual implementation in pkg/tree/builder.go. Replace the incorrect
parameters ri *ResourceInterface and workload client.Object with the correct
ones: karta *v1alpha1.Karta and factory *resource.ComponentFactory.
Additionally, remove the bold formatting from lines 61-67 (the list items
covering Semantic role names, Pods grouped by their role, Normalized phase,
Desired vs current replicas, and Collapsed plumbing) and replace it with inline
code formatting or plain text to maintain consistent styling with the refactored
display layer section.

---

Nitpick comments:
In `@docs/design/cli/high-level-design.md`:
- Around line 61-67: In the high-level design section of the documentation,
remove the bold emphasis (the `**` markers) from all concept labels: "Semantic
role names", "Pods grouped by their role", "Normalized phase", "Desired vs
current replicas", and "Collapsed plumbing". Replace them with plain text
format, following the style pattern used in the refactored display layer section
which uses inline code and plain text instead of bold for consistency with the
Markdown style guidelines.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Enterprise

Run ID: 144a31b4-e00c-46f3-bb0a-9e52dec27a51

📥 Commits

Reviewing files that changed from the base of the PR and between 67366c6 and f22b189.

📒 Files selected for processing (14)
  • docs/design/cli/high-level-design.md
  • docs/design/workload-tree/design.md
  • docs/examples/quickstart/README.md
  • docs/examples/quickstart/main.go
  • pkg/tree/builder.go
  • pkg/tree/builder_test.go
  • pkg/tree/testdata/jobset/expected.yaml
  • pkg/tree/testdata/lws/expected.yaml
  • pkg/tree/testdata/pytorch/expected_extended.yaml
  • pkg/tree/testdata/pytorch/expected_minimal.yaml
  • pkg/tree/testdata/raycluster/expected_extended.yaml
  • pkg/tree/testdata/raycluster/expected_minimal.yaml
  • pkg/tree/tree.go
  • pkg/tree/yaml_examples_test.go
💤 Files with no reviewable changes (7)
  • pkg/tree/testdata/jobset/expected.yaml
  • pkg/tree/testdata/raycluster/expected_minimal.yaml
  • pkg/tree/testdata/pytorch/expected_minimal.yaml
  • pkg/tree/testdata/pytorch/expected_extended.yaml
  • pkg/tree/testdata/raycluster/expected_extended.yaml
  • pkg/tree/testdata/lws/expected.yaml
  • pkg/tree/tree.go
✅ Files skipped from review due to trivial changes (2)
  • docs/design/workload-tree/design.md
  • docs/examples/quickstart/README.md
🚧 Files skipped from review as they are similar to previous changes (2)
  • docs/examples/quickstart/main.go
  • pkg/tree/builder_test.go

Add PodCount and Resources fields to WorkloadTree, ComponentNode, and
InstanceNode. Each carries the cumulative desired totals for its subtree:
pod count plus CPU (millicores), memory (bytes), and GPUs.

Totals are computed from the desired spec - each pod-bearing component's
replica count times its extracted pod template requests. CPU and memory
come from container requests; GPUs come from the nvidia.com/gpu request,
falling back to its limit when only the limit is set. GPUs is a float so
fractional allocations (MIG slices, GPU sharing) are preserved.

Add unit tests for the aggregation and extend the YAML example fixtures
to assert the totals at every node against real Karta definitions.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Roee Gil <roee.gil@run.ai>

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
pkg/tree/tree.go (1)

8-8: ⚠️ Potential issue | 🔴 Critical

Import k8s.io/apimachinery/pkg/runtime/schema and use schema.GroupVersionKind instead of metav1.GroupVersionKind.

GroupVersionKind is defined in k8s.io/apimachinery/pkg/runtime/schema, not pkg/apis/meta/v1. The current code at line 51 will fail to compile.

Fix
import (
	corev1 "k8s.io/api/core/v1"
-	metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
+	"k8s.io/apimachinery/pkg/runtime/schema"

	"github.com/run-ai/karta/pkg/resource"
)
type ComponentNode struct {
	Name string
	// Kind is nil for logical grouping components (those without a Kind in the definition).
-	Kind             *metav1.GroupVersionKind
+	Kind             *schema.GroupVersionKind
	HasPodDefinition bool
	PodCount         int
	Resources        Resources
	Instances        []InstanceNode
}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/tree/tree.go` at line 8, The code is using metav1.GroupVersionKind which
does not exist in that package. Add a new import statement for
k8s.io/apimachinery/pkg/runtime/schema (aliased as schema), then locate where
GroupVersionKind is used in the code (referenced at line 51) and change the
reference from metav1.GroupVersionKind to schema.GroupVersionKind to use the
correct package where this type is actually defined.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@pkg/tree/tree.go`:
- Line 8: The code is using metav1.GroupVersionKind which does not exist in that
package. Add a new import statement for k8s.io/apimachinery/pkg/runtime/schema
(aliased as schema), then locate where GroupVersionKind is used in the code
(referenced at line 51) and change the reference from metav1.GroupVersionKind to
schema.GroupVersionKind to use the correct package where this type is actually
defined.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Enterprise

Run ID: 9b2eb13f-7ff3-4d1b-acd6-378fbc610032

📥 Commits

Reviewing files that changed from the base of the PR and between f22b189 and 5430be0.

📒 Files selected for processing (12)
  • docs/design/workload-tree/design.md
  • pkg/tree/builder.go
  • pkg/tree/testdata/jobset/expected.yaml
  • pkg/tree/testdata/lws/expected.yaml
  • pkg/tree/testdata/pytorch/expected_extended.yaml
  • pkg/tree/testdata/pytorch/expected_minimal.yaml
  • pkg/tree/testdata/raycluster/expected_extended.yaml
  • pkg/tree/testdata/raycluster/expected_minimal.yaml
  • pkg/tree/totals.go
  • pkg/tree/totals_test.go
  • pkg/tree/tree.go
  • pkg/tree/yaml_examples_test.go
✅ Files skipped from review due to trivial changes (4)
  • pkg/tree/testdata/jobset/expected.yaml
  • pkg/tree/testdata/lws/expected.yaml
  • docs/design/workload-tree/design.md
  • pkg/tree/testdata/pytorch/expected_minimal.yaml
🚧 Files skipped from review as they are similar to previous changes (4)
  • pkg/tree/testdata/raycluster/expected_minimal.yaml
  • pkg/tree/testdata/pytorch/expected_extended.yaml
  • pkg/tree/testdata/raycluster/expected_extended.yaml
  • pkg/tree/builder.go

Remove docs/design/workload-tree/design.md and revert the CLI
high-level-design.md to its prior form, then document the cumulative
PodCount and Resources subtree totals (CPU, memory, GPUs) on
WorkloadTree, ComponentNode, and InstanceNode.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Roee Gil <roee.gil@run.ai>

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
docs/design/cli/high-level-design.md (1)

301-302: ⚠️ Potential issue | 🟡 Minor

Update the Build() function signature to match the actual implementation. The second and third parameters are incorrect: the documented ri *ResourceInterface, workload client.Object should be karta *v1alpha1.Karta, factory *resource.ComponentFactory.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/design/cli/high-level-design.md` around lines 301 - 302, Update the
Build() function signature in the design documentation to match the actual
implementation. Replace the second and third parameters from `ri
*ResourceInterface, workload client.Object` with `karta *v1alpha1.Karta, factory
*resource.ComponentFactory` to accurately reflect what the function accepts in
the real codebase.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@docs/design/cli/high-level-design.md`:
- Around line 301-302: Update the Build() function signature in the design
documentation to match the actual implementation. Replace the second and third
parameters from `ri *ResourceInterface, workload client.Object` with `karta
*v1alpha1.Karta, factory *resource.ComponentFactory` to accurately reflect what
the function accepts in the real codebase.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Enterprise

Run ID: 9cdfe695-6ecc-4bb0-bd2e-172dc5320b01

📥 Commits

Reviewing files that changed from the base of the PR and between 5430be0 and 24db4f0.

📒 Files selected for processing (1)
  • docs/design/cli/high-level-design.md

rogirun and others added 2 commits June 24, 2026 12:42
Drop live pods from the WorkloadTree output and replace the single
PodCount/Resources totals with min/max ranges driven by the Scale object.

Each pod-bearing instance contributes its min and max replica counts
(Scale.MinReplicas/MaxReplicas, falling back to Scale.Replicas, and 0 when
no scale is defined) times its per-pod requests. Totals are exposed as
PodCount{Min,Max} and ResourceRange{Min,Max}; fixed-replica components
collapse to equal bounds. Pods remain a Build input for instance and
replica-key discovery but are no longer stored on the tree.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Roee Gil <roee.gil@run.ai>
tree.Build no longer takes a pod list or PodMatcher. The builder derives
the desired structure entirely from the workload spec: each component
expands into instances from its InstanceIdPath, and a replica-selector
component collapses to a single instance. Per-replica and per-pod data is
layered on by live consumers via the pkg/resource pod querier, not the tree.

Removes the PodMatcher abstraction, its mock, and the pod test fixtures;
updates the quickstart example, the CLI high-level design, and the LWS
golden file (which now reflects a single collapsed group node).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Roee Gil <roee.gil@run.ai>
Comment thread pkg/tree/tree.go
Comment thread pkg/tree/tree.go Outdated
Comment thread pkg/tree/tree.go Outdated
Comment thread pkg/tree/tree.go Outdated
Comment thread pkg/tree/tree.go Outdated
Comment thread pkg/tree/builder.go Outdated
Comment thread pkg/tree/totals.go Outdated
Comment thread pkg/tree/totals.go Outdated
Comment thread pkg/tree/builder_test.go Outdated
Comment thread pkg/tree/builder_test.go Outdated
rogirun and others added 3 commits June 28, 2026 15:46
Apply Ron Lev's review on PR #87:

- Build now takes only the ComponentFactory and derives the Karta from it
  via a new ComponentFactory.GetKarta(); rewrite the Build doc comment.
- Resources holds k8s resource.Quantity (CPU/Memory/GPU) instead of raw
  int64/float64, so units round-trip exactly and fractional GPU is preserved.
- WorkloadStatus is now optional (*WorkloadStatus); nil means no status was
  evaluated, e.g. an offline pre-submission tree.
- Totals derive pod-bearing from leaf instances rather than HasPodDefinition,
  and set ResourcesIncomplete when a node has desired pods whose per-pod
  resources cannot be resolved, instead of silently reporting zero.
- Rename effectiveReplicas to replicaBounds.
- Document why every component (including single-instance) carries an
  InstanceNode and why child components are built per instance.
- Assert exact pod counts and replica bounds in the builder tests; regenerate
  the golden fixtures for the Quantity serialization.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Roee Gil <roee.gil@run.ai>
Verify fractional nvidia.com/gpu requests are preserved exactly through the
totals pipeline:

- Unit tests asserting exact Quantity values when scaling a fractional GPU by
  the replica count, summing fractions across components, ranging over the
  scale envelope, and reading a fractional GPU from the limit.
- A PyTorchJob golden fixture (master 0.25, worker 0.5 x 2) whose root total
  is 1.25 GPU, exercising fraction plus integer aggregation and serialization.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Roee Gil <roee.gil@run.ai>
Address remaining review feedback on the workload tree builder:

- Add a childComponents helper that resolves a parent's child names to
  []*resource.Component, so buildComponentNodes and buildInstanceNodes work
  with components instead of raw name strings. Keeps the name resolution in the
  tree package so pkg/instructions gains no dependency on pkg/resource.
- Drop the unused karta parameter threaded through the build helpers.
- Trim the buildInstanceNodes doc comment.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Roee Gil <roee.gil@run.ai>
@rogirun rogirun requested a review from ronlv10 June 28, 2026 13:38

@ronlv10 ronlv10 left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good to me - I think that the solution of the yamls is an overkill and would made us to update the yamls for any change and on the other hand is hard to verify the output (need to look on 3 files in order to verify

I suggest

  1. convert them to test if there are not cover - e.g, pytorch and dyanmo and check the tree
  2. maybe create a diagram/image and add to the doc (you can open an issue ) so we would be able to visuals in our docs how the tree looks like - I think it could bring value

Comment thread pkg/tree/builder.go Outdated
Comment thread pkg/tree/builder.go Outdated

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you changed the deisgn doc too much here
it's fine to correction but for example you deleted the pod matcher - which still relevant to the CLI but we decided to split it differently

in the past we didn't really update design docs(before ai and pr that is a design doc)- I think it's fine to make correction after different decision but not to change entirely - for example it's fine to update what is workload tree post-discussion , that we added resource , etc

Comment thread docs/examples/quickstart/main.go Outdated
Comment thread docs/examples/quickstart/main.go Outdated
Comment thread pkg/tree/totals.go Outdated
Comment thread pkg/tree/totals.go Outdated
Comment thread pkg/tree/totals.go Outdated
Comment thread pkg/tree/totals.go Outdated
Comment thread pkg/tree/totals_test.go Outdated
Comment thread pkg/tree/testdata/jobset/expected.yaml Outdated

@Isan-Rivkin Isan-Rivkin left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor comments, overall like but Im still in partial review will continue tomorrow.
Wanted to publish something not to delay the minor stuff.
Regarding the testing of expected yamls, didnt review as discussed

Comment thread pkg/tree/builder.go Outdated
instanceKeys = []string{""}
}
keys := append([]string(nil), instanceKeys...)
sort.Strings(keys)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why sort?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

taken from Map, using sort keep the sorted list aligned, instead of getting each call a different list arrangement

Comment thread pkg/tree/builder.go
Comment thread pkg/tree/builder.go Outdated
Comment thread pkg/tree/builder.go Outdated
Comment thread pkg/tree/builder.go Outdated
Comment thread pkg/tree/builder.go Outdated
Comment thread pkg/tree/builder.go Outdated
The tree data model now carries only cumulative pod counts, not resource
requests. Resources, ResourceRange, and ResourcesIncomplete are removed from
WorkloadTree, ComponentNode, and InstanceNode, along with the per-pod resource
extraction and aggregation logic in totals.go. Per-pod resource usage is left to
consumers that fetch live pods, matching the pod-matching step described in the
CLI design.

Child components are now resolved via ComponentFactory.GetChildComponentsOf,
built from the structure's OwnerRef relationships, removing the builder's
dependency on instructions.StructureSummary. Each instance's subtree is built
once and cloned per instance so per-node pod counts stay independent.

Removes the YAML golden-file tests and testdata that asserted resource totals.

Signed-off-by: Roee Gil <roee.gil@run.ai>
@rogirun rogirun force-pushed the feat/workload-tree-data-model branch from 152402b to d0d091f Compare July 1, 2026 10:32
rogirun added 3 commits July 1, 2026 13:57
ChildrenOf and ParentOf had zero callers after the tree builder switched to
ComponentFactory.GetChildComponentsOf. The backing childrenMap and parentMap
fields are retained; both are still used internally.

Signed-off-by: Roee Gil <roee.gil@run.ai>
@rogirun rogirun requested review from Isan-Rivkin and ronlv10 July 1, 2026 11:25
rogirun added 2 commits July 1, 2026 15:19
A pod-bearing leaf with no scale in the spec now contributes one desired pod
instead of zero, matching Kubernetes' default replica count for a workload that
declares none. replicaBounds returns 1,1 for a nil scale; the non-nil branch is
unchanged.

Signed-off-by: Roee Gil <roee.gil@run.ai>
Adds a Milvus fixture (status-only root with querynode, datanode, and a
scaleless proxy StatefulSet child) and builder tests covering component
discovery, per-component pod counts, the single-pod default for the scaleless
child, and root rollup.

Signed-off-by: Roee Gil <roee.gil@run.ai>
Comment thread docs/design/cli/high-level-design.md Outdated
ronlv10
ronlv10 previously approved these changes Jul 1, 2026

// GetChildComponentsOf retrieves the direct child components of the named parent,
// resolved from the structure's OwnerRef relationships.
func (f *ComponentFactory) GetChildComponentsOf(parentName string) ([]*Component, error) {

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why you put it here? I think it should be in the tree..

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The factory has the karta and the object. The tree comes afterwards

Comment thread pkg/tree/builder.go
Comment on lines +106 to +107
baseChildren, err := buildComponentNodes(ctx, children, factory)
if err != nil {

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

baseChildren -> instanceChildren

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But it returns a component child list

Comment thread pkg/tree/builder.go Outdated
Comment thread pkg/tree/builder.go Outdated
Comment thread pkg/tree/builder.go Outdated
Comment thread pkg/tree/totals.go
// ranged over the scale envelope. A pod-bearing leaf counts its own replicas
// (defaulting to one when it declares no scale); a non-pod leaf counts nothing;
// a non-leaf rolls up its children.
func instancePodCount(children []ComponentNode, scale *resource.Scale, hasPodDefinition bool) PodCount {

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you're missing here a case where instance has pods and also children
not sure if it's a valid case but not sure it doesn't and we miss it here

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

var total PodCount
if hasPodDefinition {
    minReplicas, maxReplicas := replicaBounds(scale)
    total = PodCount{Min: minReplicas, Max: maxReplicas}
}
for _, child := range children {
    total.Min += child.PodCount.Min
    total.Max += child.PodCount.Max
}
return total

probably something like this

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This logic breaks in the case where the instance has a replica and also has children. We ignore the instance and replica count. This sums it up.

Comment thread pkg/tree/totals.go Outdated
Comment thread pkg/tree/builder_test.go
Comment thread test/types/milvus.go

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TBH I think that we should have a builder that in the test you override what's need for the test and then when reading the test is clear and you don't need to jump to this file - but that's not the first case so I get it if it too much now

Comment thread .gitignore
*~
.vscode/
.idea
/docs/examples/quickstart/quickstart

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove ?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add to ignore: this is the binary created when running go build

Comment thread pkg/tree/totals.go
// replicaBounds returns the min and max desired replica counts for a scale.
// A nil scale defaults to one replica: assumes that a pod-bearing component that declares no
// scale still runs a single pod.
func replicaBounds(scale *resource.Scale) (minReplicas, maxReplicas int) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Min-only scale yields Max < Min, for example PodCount{Min:3, Max:0}.

  • suggest to add a test with min only
  • define the correct behavior, indicate it's optional with a pointer or something

Comment thread pkg/tree/builder.go
// its components, their instances, normalized status, and the desired pod counts
// ranged over the scale envelope. The factory carries the Karta definition and
// the workload object it was built from.
func Build(

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The build is getting Stack Overflow in case of a circular ownerRef.
We should handle this instead of crashing because this is user-input that can crash the cli.

  • Suggest to add a test case for this

Note: the validator handles this error cleanly NewKartaValidator().Validate() returns error: root component cannot have owner ref

apiVersion: run.ai/v1alpha1
kind: Karta
metadata:
  name: cyclic
spec:
  structureDefinition:
    rootComponent:
      name: root
      ownerRef: a
      kind:
        group: example.com
        version: v1
        kind: Example
    childComponents:
      - name: a
        ownerRef: root
        kind:
          group: example.com
          version: v1
          kind: ExampleChild

Comment thread test/types/milvus.go
Comment thread pkg/tree/builder.go
Name: component.Name(),
Kind: component.Kind(),
HasPodDefinition: component.HasPodDefinition(),
PodCount: sumNodes(instances),

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question, for a pod-bearing root with no children (batch-job.yaml, nimservice.yaml), the tree is empty and the workload's desired pod count is unrepresentable in any field. Intended?

For example below, tree pod count and children is 0.

karta:

apiVersion: run.ai/v1alpha1
kind: Karta
metadata:
  name: batch-v1-job
spec:
  structureDefinition:
    rootComponent:
      name: job
      kind:
        group: batch
        version: v1
        kind: Job
      scaleDefinition:
        replicasPath: .spec.parallelism // 1
      specDefinition:
        podTemplateSpecPath: .spec.template

and workload

apiVersion: batch/v1
kind: Job
metadata:
  name: pi
  namespace: default
spec:
  parallelism: 5
  template:
    spec:
      containers:
        - name: pi
          image: perl:5.34

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Implement Data model: WorkloadTree

4 participants