Skip to content

Add cross-setting validator to prevent balance factor zero crash loop#22347

Open
Dhanwani wants to merge 1 commit into
opensearch-project:mainfrom
Dhanwani:fix/balance-factor-zero-crash
Open

Add cross-setting validator to prevent balance factor zero crash loop#22347
Dhanwani wants to merge 1 commit into
opensearch-project:mainfrom
Dhanwani:fix/balance-factor-zero-crash

Conversation

@Dhanwani

@Dhanwani Dhanwani commented Jun 29, 2026

Copy link
Copy Markdown

Summary

Setting both cluster.routing.allocation.balance.shard and cluster.routing.allocation.balance.index to 0.0 causes an unrecoverable cluster-manager crash loop. Each setting independently allows 0.0 as minimum, but the WeightFunction requires their sum to be > 0. When both are set to zero, the cluster-manager crashes on every election attempt and becomes unrecoverable via the API.

Changes

  • Add IndexBalanceFactorValidator and ShardBalanceFactorValidator (same pattern as DiskThresholdSettings.LowDiskWatermarkValidator) that validate the sum of both balance factors > 0 before settings enter cluster state
  • Add unit tests for both valid and invalid setting combinations

Impact

  • Before: Setting both balance factors to 0 causes permanent cluster-manager crash loop requiring manual node restart to recover
  • After: The API rejects the invalid combination with a clear error message (HTTP 400)

Test Plan

  • Unit tests verify invalid combinations are rejected on construction and dynamic update
  • Unit tests verify valid combinations (one factor at 0 with the other > 0) are accepted
  • Manually tested on a 6-node cluster (3 master + 3 data nodes): confirmed the API returns HTTP 400 with clear error message when attempting to set both factors to 0
  • Confirmed existing clusters with bad values recover gracefully (defensive fallback resets to defaults)

Manual test output (post-fix):

curl -X PUT "http://localhost:9200/_cluster/settings" -H 'Content-Type: application/json' -d '{
    "transient": {
      "cluster.routing.allocation.balance.shard": "0",
      "cluster.routing.allocation.balance.index": "0"
    }
  }'
{
  "error": {
    "root_cause": [
      {
        "type": "illegal_argument_exception",
        "reason": "Balance factors [cluster.routing.allocation.balance.index] and [cluster.routing.allocation.balance.shard] must sum to a value greater than zero but was [0.0]"
      }
    ],
    "type": "illegal_argument_exception",
    "reason": "Balance factors [cluster.routing.allocation.balance.index] and [cluster.routing.allocation.balance.shard] must sum to a value greater than zero but was [0.0]"
  },
  "status": 400
}

Issues Resolved

Fixes #22305

@github-actions

github-actions Bot commented Jun 29, 2026

Copy link
Copy Markdown
Contributor

PR Reviewer Guide 🔍

(Review updated until commit 78a5287)

Here are some key observations to aid the review process:

🧪 PR contains tests
🔒 No security concerns identified
✅ No TODO sections
🔀 No multiple PR themes
⚡ No major issues detected

@github-actions

github-actions Bot commented Jun 29, 2026

Copy link
Copy Markdown
Contributor

PR Code Suggestions ✨

Latest suggestions up to 78a5287

Explore these optional code suggestions:

CategorySuggestion                                                                                                                                    Impact
General
Make zero-sum validation explicit and robust

Since both balance factors are independently constrained to be >= 0.0f, the
condition sum <= 0.0f is equivalent to "both are exactly zero", but floating-point
addition could produce a tiny positive value (e.g., 0.0f + Float.MIN_VALUE) that
still leads to a near-zero divisor and a crash loop. Validate each factor's
contribution explicitly (e.g., require sum > 0 strictly and additionally guard
against denormalized sums) or check indexBalance == 0.0f && shardBalance == 0.0f to
make intent unambiguous and robust.

server/src/main/java/org/opensearch/cluster/routing/allocation/allocator/BalancedShardsAllocator.java [573-586]

 static void doValidateBalanceFactorSum(float indexBalance, float shardBalance) {
     float sum = indexBalance + shardBalance;
-    if (sum <= 0.0f) {
+    if (indexBalance == 0.0f && shardBalance == 0.0f) {
         throw new IllegalArgumentException(
             "Balance factors ["
                 + INDEX_BALANCE_FACTOR_SETTING.getKey()
                 + "] and ["
                 + SHARD_BALANCE_FACTOR_SETTING.getKey()
                 + "] must sum to a value greater than zero but was ["
                 + sum
                 + "]"
         );
     }
 }
Suggestion importance[1-10]: 3

__

Why: The original sum <= 0.0f check is already correct given both factors are constrained to >= 0.0f. The suggested change to explicit zero comparison is a minor stylistic preference with negligible practical impact, since both values cannot be negative.

Low

Previous suggestions

Suggestions up to commit bb4f758
CategorySuggestion                                                                                                                                    Impact
General
Handle NaN in fallback sum check

The defensive fallback silently overrides user-configured values, which can mask
configuration errors and cause confusing behavior (operators see their settings but
allocation uses different ones). Consider also accounting for negative sums or NaN,
and ensure the fallback path doesn't conflict with the user's intent—e.g., when only
one value is zero the validator already permits it, so this branch should only
trigger for sum<=0 from persisted invalid state, which is the intent; however make
sure Float.isNaN(sum) is also handled to prevent silent NaN propagation into
theta0/theta1.

server/src/main/java/org/opensearch/cluster/routing/allocation/allocator/BalancedShardsAllocator.java [636-644]

-if (sum <= 0.0f) {
+if (sum <= 0.0f || Float.isNaN(sum)) {
     logger.warn(
         "Balance factors sum to [{}] which is invalid. Falling back to defaults: index=[0.55], shard=[0.45]",
         sum
     );
     indexBalance = 0.55f;
     shardBalance = 0.45f;
     sum = 1.0f;
 }
Suggestion importance[1-10]: 5

__

Why: Adding Float.isNaN(sum) check is a reasonable defensive improvement to prevent NaN propagation into theta0/theta1, though this is an edge case unlikely to occur given input validation on the settings.

Low
Possible issue
Avoid circular static initialization in validators

The two validators reference each other via settings(), which can cause a circular
dependency during static class initialization. Since each validator's settings()
returns a singleton list referring to the other Setting, ensure that the Setting
fields are initialized in an order that avoids null references—or use a lazy
iterator (e.g., build the list inside settings() rather than via a field) to defer
resolution until both Settings are fully initialized.

server/src/main/java/org/opensearch/cluster/routing/allocation/allocator/BalancedShardsAllocator.java [100-117]

+public static final Setting<Float> INDEX_BALANCE_FACTOR_SETTING = Setting.floatSetting(
+    "cluster.routing.allocation.balance.index",
+    0.55f,
+    0.0f,
+    Float.MAX_VALUE,
+    new IndexBalanceFactorValidator(),
+    Property.Dynamic,
+    Property.NodeScope
+);
+public static final Setting<Float> SHARD_BALANCE_FACTOR_SETTING = Setting.floatSetting(
+    "cluster.routing.allocation.balance.shard",
+    0.45f,
+    0.0f,
+    Float.MAX_VALUE,
+    new ShardBalanceFactorValidator(),
+    Property.Dynamic,
+    Property.NodeScope
+);
 
-
Suggestion importance[1-10]: 4

__

Why: The concern about circular initialization is theoretically valid, but the improved_code is identical to existing_code, so no actual fix is provided. Also, the validators are instantiated at field init time and their settings() is called later, mitigating the issue.

Low
Suggestions up to commit b39b812
CategorySuggestion                                                                                                                                    Impact
General
Improve fallback log clarity for operators

The defensive fallback silently overrides user-configured values. If the cluster
state contains invalid values from an older version, the operator will see balancing
behave with hardcoded defaults but their setting values still report the invalid
configuration, causing confusing divergence. Consider also updating
this.indexBalance/this.shardBalance consistently (which the code does below) but
ensure this state mismatch with the actual Setting values is documented or surfaced
more visibly (e.g., via a cluster warning/deprecation log on each invocation
throttled appropriately).

server/src/main/java/org/opensearch/cluster/routing/allocation/allocator/BalancedShardsAllocator.java [636-644]

 if (sum <= 0.0f) {
     logger.warn(
-        "Balance factors sum to [{}] which is invalid. Falling back to defaults: index=[0.55], shard=[0.45]",
+        "Balance factors [index={}, shard={}] sum to [{}] which is invalid; "
+            + "falling back to defaults [index=0.55, shard=0.45]. Please update the cluster settings to valid values.",
+        indexBalance,
+        shardBalance,
         sum
     );
     indexBalance = 0.55f;
     shardBalance = 0.45f;
     sum = 1.0f;
 }
Suggestion importance[1-10]: 4

__

Why: The improved log message provides slightly more diagnostic information (individual factor values), which is helpful for operators, but is a minor readability/maintainability improvement.

Low
Verify upper bound choice avoids overflow

The upper bound was changed from the previous (likely unbounded) limit to
Float.MAX_VALUE. While Float.MAX_VALUE is effectively unbounded, supplying it
explicitly means any prior valid configuration with NaN or Infinity (if previously
accepted) will now be rejected. More importantly, ensure the same upper bound
semantics are intentional; consider whether a more reasonable practical upper bound
is appropriate, or document this explicitly to avoid behavior change surprises. Also
verify Float.MAX_VALUE does not cause overflow in sum = indexBalance + shardBalance
downstream.

server/src/main/java/org/opensearch/cluster/routing/allocation/allocator/BalancedShardsAllocator.java [100-117]

+public static final Setting<Float> INDEX_BALANCE_FACTOR_SETTING = Setting.floatSetting(
+    "cluster.routing.allocation.balance.index",
+    0.55f,
+    0.0f,
+    Float.MAX_VALUE,
+    new IndexBalanceFactorValidator(),
+    Property.Dynamic,
+    Property.NodeScope
+);
 
-
Suggestion importance[1-10]: 2

__

Why: The suggestion is vague, only asks to "verify" without proposing a concrete change, and the improved_code is identical to the existing_code.

Low
Suggestions up to commit 46f7506
CategorySuggestion                                                                                                                                    Impact
Possible issue
Fix swapped default fallback values

The fallback warning message swaps the default values: the actual defaults declared
above are INDEX_BALANCE_FACTOR_SETTING=0.55f and SHARD_BALANCE_FACTOR_SETTING=0.45f,
but the log message and assignments use index=0.45 and shard=0.55. Align the
fallback assignments and the log message with the declared defaults to avoid
silently applying the wrong weights.

server/src/main/java/org/opensearch/cluster/routing/allocation/allocator/BalancedShardsAllocator.java [637-643]

 logger.warn(
-    "Balance factors sum to [{}] which is invalid. Falling back to defaults: index=[0.45], shard=[0.55]",
+    "Balance factors sum to [{}] which is invalid. Falling back to defaults: index=[0.55], shard=[0.45]",
     sum
 );
-indexBalance = 0.45f;
-shardBalance = 0.55f;
+indexBalance = 0.55f;
+shardBalance = 0.45f;
 sum = 1.0f;
Suggestion importance[1-10]: 8

__

Why: Correctly identifies that the fallback values are swapped relative to the declared defaults (INDEX_BALANCE_FACTOR_SETTING=0.55f, SHARD_BALANCE_FACTOR_SETTING=0.45f), which would cause silently incorrect weights in the recovery path.

Medium
Suggestions up to commit 45799a7
CategorySuggestion                                                                                                                                    Impact
Possible issue
Preserve primary-shard constraint configuration

Removing the preferPrimaryShardBalance and preferPrimaryShardRebalance parameters
from WeightFunction causes the constructor to no longer invoke
updateAllocationConstraint/updateRebalanceConstraint for the primary-shard
constraints. This silently disables the primary-shard balance/rebalance features
even when the corresponding settings are enabled, which is a behavioral regression
unrelated to the bug fix. Restore the constraint wiring (either by passing the flags
through to WeightFunction or by applying them in updateWeightFunction).

server/src/main/java/org/opensearch/cluster/routing/allocation/allocator/BalancedShardsAllocator.java [345-352]

 private void updateWeightFunction() {
     weightFunction = new WeightFunction(
         this.indexBalanceFactor,
         this.shardBalanceFactor,
         this.preferPrimaryShardRebalanceBuffer,
-        this.primaryConstraintThreshold
+        this.primaryConstraintThreshold,
+        this.preferPrimaryShardBalance,
+        this.preferPrimaryShardRebalance
     );
 }
Suggestion importance[1-10]: 9

__

Why: Removing preferPrimaryShardBalance and preferPrimaryShardRebalance from the WeightFunction constructor call drops the calls to updateAllocationConstraint/updateRebalanceConstraint for primary-shard constraints, which is a significant behavioral regression unrelated to the bug fix.

High
Avoid unrelated remote-balancer gating change

Gating the RemoteShardsBalancer behind the SEARCHABLE_SNAPSHOT feature flag is
unrelated to this PR's stated purpose (cross-setting validation for balance factors)
and will break searchable-snapshot/remote shard allocation in deployments where the
feature flag is not explicitly enabled but searchable snapshot indices exist. Remove
this gating unless it is intentional and covered by separate review.

server/src/main/java/org/opensearch/cluster/routing/allocation/allocator/BalancedShardsAllocator.java [420-425]

-if (FeatureFlags.isEnabled(FeatureFlags.SEARCHABLE_SNAPSHOT)) {
-    final ShardsBalancer remoteShardsBalancer = new RemoteShardsBalancer(logger, allocation);
-    remoteShardsBalancer.allocateUnassigned();
-    remoteShardsBalancer.moveShards();
-    remoteShardsBalancer.balance();
-}
+final ShardsBalancer remoteShardsBalancer = new RemoteShardsBalancer(logger, allocation);
+remoteShardsBalancer.allocateUnassigned();
+remoteShardsBalancer.moveShards();
+remoteShardsBalancer.balance();
Suggestion importance[1-10]: 8

__

Why: Gating RemoteShardsBalancer behind the SEARCHABLE_SNAPSHOT feature flag is unrelated to this PR's stated scope and would likely break remote shard allocation in deployments where the flag is not explicitly enabled.

Medium
Fix swapped fallback default values

The fallback default values (index=0.45, shard=0.55) are swapped relative to the
actual setting defaults (INDEX_BALANCE_FACTOR_SETTING=0.55f,
SHARD_BALANCE_FACTOR_SETTING=0.45f). This silently changes balancer behavior on
fallback. Use the documented defaults to match the declared setting defaults.

server/src/main/java/org/opensearch/cluster/routing/allocation/allocator/BalancedShardsAllocator.java [629-637]

 if (sum <= 0.0f) {
     logger.warn(
-        "Balance factors sum to [{}] which is invalid. Falling back to defaults: index=[0.45], shard=[0.55]",
+        "Balance factors sum to [{}] which is invalid. Falling back to defaults: index=[0.55], shard=[0.45]",
         sum
     );
-    indexBalance = 0.45f;
-    shardBalance = 0.55f;
+    indexBalance = 0.55f;
+    shardBalance = 0.45f;
     sum = 1.0f;
 }
Suggestion importance[1-10]: 8

__

Why: The fallback defaults are indeed swapped relative to the declared setting defaults (INDEX_BALANCE_FACTOR_SETTING=0.55f, SHARD_BALANCE_FACTOR_SETTING=0.45f), so the fallback behavior differs from the documented defaults.

Medium

@Dhanwani Dhanwani force-pushed the fix/balance-factor-zero-crash branch from 45799a7 to 46f7506 Compare June 29, 2026 14:23
@github-actions

Copy link
Copy Markdown
Contributor

Persistent review updated to latest commit 46f7506

@github-actions

Copy link
Copy Markdown
Contributor

Persistent review updated to latest commit b39b812

@Dhanwani Dhanwani force-pushed the fix/balance-factor-zero-crash branch from b39b812 to bb4f758 Compare June 29, 2026 14:39
@github-actions

Copy link
Copy Markdown
Contributor

Persistent review updated to latest commit bb4f758

@Dhanwani Dhanwani marked this pull request as ready for review June 29, 2026 14:43
@Dhanwani Dhanwani requested a review from a team as a code owner June 29, 2026 14:43
@github-actions

Copy link
Copy Markdown
Contributor

❌ Gradle check result for bb4f758: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Setting both cluster.routing.allocation.balance.shard and
cluster.routing.allocation.balance.index to 0.0 causes an unrecoverable
cluster-manager crash loop. Each setting independently allows 0.0 as
minimum, but the WeightFunction requires their sum to be > 0.

This adds a Setting.Validator for both balance factor settings that
validates the sum > 0 constraint before the settings enter cluster state,
preventing the poison pill from being published.

Fixes opensearch-project#22305

Signed-off-by: Abhishek Dhanwani <f20170161h@alumni.bits-pilani.ac.in>
@Dhanwani Dhanwani force-pushed the fix/balance-factor-zero-crash branch from bb4f758 to 78a5287 Compare June 30, 2026 17:25
@github-actions

Copy link
Copy Markdown
Contributor

Persistent review updated to latest commit 78a5287

@github-actions

Copy link
Copy Markdown
Contributor

❌ Gradle check result for 78a5287: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

*
* @opensearch.internal
*/
static final class IndexBalanceFactorValidator implements Setting.Validator<Float> {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need individual classes for them, In the constructor we can pass the setting name to be fetched

@alchemist51

Copy link
Copy Markdown
Contributor

Spotless is failing for the new files

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Setting balance factors to zero causes unrecoverable cluster-manager crash loop

4 participants