[SDTEST-3824] Split planner from runner and make fast discovery fallback-only#73
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 5f18e23896
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| planInfo, err := tr.planner.LoadPlan() | ||
| if err != nil { | ||
| return fmt.Errorf("failed to read parallel runners count from %s: %w", constants.ParallelRunnersOutputPath, err) | ||
| slog.Error("Test optimization plan is not available", "error", err) | ||
| return fmt.Errorf("test optimization plan is not available: %w", err) |
There was a problem hiding this comment.
Don't abort runs when only the plan cache is absent
In the existing-artifact path (parallel-runners.txt already present), a missing or corrupt .testoptimization/runner/cache/test_suite_durations.json now causes Run to return before reading test-files.txt or tests-split. That makes CI jobs fail to execute tests when they restore the runner file artifacts without the cache, even though the file lists are sufficient and DistributeTestFiles already has a default-weight fallback for an unavailable cache.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
this is actually what we want to do now because running tests without a correct plan is not wise - it is better to abort to signal that the setup is broken
E2E Test Report: SUCCESSTested by: Shepherd Agent (autonomous QA for Datadog Test Optimization) Pull Request
SummaryThis PR passes the requested E2E coverage. I verified the main behavioral fix and the related regression surfaces:
Results Matrix
Detailed Evidence1. Successful full discovery stays authoritativeI created a temporary Spree spec file containing only skipped examples:
The temporary mockdog scenario returned a stale backend duration row for that file alongside the normal helper specs:
Important log evidence from Artifact evidence: The skipped-only file was absent from 2. Fast discovery fallback still works when full discovery failsTo force a real full-discovery failure, I temporarily made the skipped-only Spree spec syntactically invalid and narrowed Important log evidence from Artifact evidence: This verifies the fallback side of the behavior: when full discovery cannot produce authoritative results, ddtest still plans from fast-discovered files and can attach matching backend durations for scheduling. 3. Real worker execution still worksI ran a full ddtest execution against Sidekiq's focused Important log evidence from Plan artifact: This covers the runner side of the planner/runner refactor, not just planning. 4. CI-node local worker subsplitting still worksI then reran Sidekiq in CI-node mode with Important log evidence from This specifically exercises the CI-node worker path mentioned in the PR summary. 5. Durations endpoint 404 is non-fatalI created a temporary mockdog scenario that returned Important log evidence from This confirms the requested 404 regression check: the endpoint error is logged, but planning still succeeds with default duration weighting. Issues FoundNo ddtest PR bugs were found. One environment issue was encountered and handled:
Temporary test fixtures/scenarios used for the Spree and 404 checks were removed after the runs. ConclusionThe PR behavior is verified end-to-end:
This E2E test was performed by Shepherd - autonomous QA agent for Datadog Test Optimization. |
What
internal/planner, leavinginternal/runnerfocused on executing plan artifacts.Why
A suite that is completely skipped by the test framework can still be found by filesystem glob discovery. Before this change, planning was too eager to merge fast-discovered files back into the runnable set after full discovery succeeded. That could cause files omitted by full discovery, for example an RSpec file containing only skipped examples, to appear in plan artifacts and receive default or backend durations.
Full discovery is the authoritative signal for runnable local tests whenever it succeeds. Fast discovery remains useful only as a safety fallback when full discovery fails or is cancelled.
E2E testing
Bug reproduction and verification:
xitexamples.spec/**/*_spec.rb. To reproduce the old failure more clearly, use a file that already has historical backend duration data, for example by first running it normally, then changing every example in the file to skipped.ddtest planwith test optimization enabled and successful full discovery..testoptimization/runner/test-filesor runner split artifacts, with either default or backend duration data.Full regression testing for both planning and running test execution must be run because this PR heavily refactors planner/runner ownership and artifact loading.
Check that there are no regressions when the durations backend API returns 404.