Describe the bug
Two stacked bugs in the REST response path combine to produce a silent 5-minute hang (zero bytes flushed to client, 504 from the load balancer) whenever a search response body reaches ~715 MB.
Bug A — integer overflow in BytesRestResponse(String) serialization path
BytesRestResponse(RestStatus, String contentType, String content) calls new BytesArray(content), which calls new BytesRef(text), which calls UnicodeUtil.maxUTF8Length(text.length()). That method returns Math.multiplyExact(utf16Length, MAX_UTF8_BYTES_PER_CHAR) where MAX_UTF8_BYTES_PER_CHAR == 3. For ASCII-dominant JSON this overflows once text.length() > Integer.MAX_VALUE / 3 = 715_827_882 (~715 MB). The resulting ArithmeticException is thrown after the channel is half-closed, so no error response is flushed.
Upstream precedent: OS#1651 / PR #7963 fixed the same overflow on the request (ingest) side. The response side was never patched.
Bug B — non-idempotent close() in ResourceHandlingHttpChannel
RestController$ResourceHandlingHttpChannel.close() uses AtomicBoolean.compareAndSet(false, true) and throws IllegalStateException("Channel is already closed") when called a second time. When Bug A's exception is thrown mid-sendResponse, the first close() call half-closes the channel. RestActionListener.onFailure then tries to send the error response by calling sendResponse again, which re-enters close() and throws again. This leaves the underlying Netty channel dangling until the upstream load balancer times out (e.g. 300 s).
Related component
Search
To Reproduce
Bug A
// Reproducer (no large buffer allocation — overflow fires before any char access)
int overflowingLength = (Integer.MAX_VALUE / 3) + 1; // 715_827_883
CharSequence text = new CharSequence() {
@Override public int length() { return overflowingLength; }
@Override public char charAt(int i) { throw new AssertionError("not reached"); }
@Override public CharSequence subSequence(int s, int e) { throw new UnsupportedOperationException(); }
};
assertThrows(ArithmeticException.class, () -> new BytesArray(new BytesRef(text)));
Bug B
// Proves that a second sendResponse throws instead of no-op'ing
restController.registerHandler(GET, "/repro-bug-b", (req, channel, c) -> {
channel.sendResponse(new BytesRestResponse(OK, TEXT_CONTENT_TYPE, BytesArray.EMPTY));
// second call mirrors RestActionListener.onFailure after Bug A fires
assertThrows(IllegalStateException.class, () ->
channel.sendResponse(new BytesRestResponse(INTERNAL_SERVER_ERROR, TEXT_CONTENT_TYPE, BytesArray.EMPTY)));
});
Both reproducer tests are included in the companion PR.
Expected behavior
Bug A: Pre-check the response size before crossing into Lucene's UnicodeUtil.maxUTF8Length. If (long) content.length() * 3 > Integer.MAX_VALUE, fail with a typed exception that maps to a clean HTTP 413/507 error response — not a raw ArithmeticException thrown mid-write. Mirror the pattern from PR #7963 (request side).
Bug B: ResourceHandlingHttpChannel.close() should be idempotent — a second call should be a no-op, not throw IllegalStateException. This allows the error path in RestActionListener.onFailure to reach the client gracefully even if the success path already closed the channel.
Additional Details
Observed failure mode (production stack trace, OS 3.1):
[WARN ][r.suppressed] path: /<index>/_search
java.lang.ArithmeticException: integer overflow
at java.lang.Math.multiplyExact(Math.java:992)
at org.apache.lucene.util.UnicodeUtil.maxUTF8Length(UnicodeUtil.java:676)
at org.apache.lucene.util.BytesRef.<init>(BytesRef.java:80)
at org.opensearch.core.common.bytes.BytesArray.<init>(BytesArray.java:56)
at org.opensearch.rest.BytesRestResponse.<init>(BytesRestResponse.java:89)
at ...HttpResponseChannel.sendResponse(HttpResponseAdapter.java:140)
at AbstractSearchAsyncAction.sendSearchResponse(AbstractSearchAsyncAction.java:773)
at ExpandSearchPhase.run(ExpandSearchPhase.java:132)
[ERROR][o.o.r.a.RestResponseListener] failed to send failure response
java.lang.IllegalStateException: Channel is already closed
at RestController$ResourceHandlingHttpChannel.close(RestController.java:648)
at RestController$ResourceHandlingHttpChannel.sendResponse(RestController.java:641)
at RestActionListener.onFailure(RestActionListener.java:88)
...
Suppressed: java.lang.ArithmeticException: integer overflow (Bug A, above)
Cliff value: utf16Length > Integer.MAX_VALUE / 3 = 715_827_882 (~715 MB ASCII JSON). Confirmed on cluster with 74x m7i.4xlarge data nodes; responses <=650 MB succeed (~20 s), responses >=700 MB hang the full 300 s load-balancer timeout with zero bytes flushed to the client.
Version: OS 3.1 (stack trace); both code paths are present in current main.
Related issues:
Describe the bug
Two stacked bugs in the REST response path combine to produce a silent 5-minute hang (zero bytes flushed to client, 504 from the load balancer) whenever a search response body reaches ~715 MB.
Bug A — integer overflow in
BytesRestResponse(String)serialization pathBytesRestResponse(RestStatus, String contentType, String content)callsnew BytesArray(content), which callsnew BytesRef(text), which callsUnicodeUtil.maxUTF8Length(text.length()). That method returnsMath.multiplyExact(utf16Length, MAX_UTF8_BYTES_PER_CHAR)whereMAX_UTF8_BYTES_PER_CHAR == 3. For ASCII-dominant JSON this overflows oncetext.length() > Integer.MAX_VALUE / 3 = 715_827_882(~715 MB). The resultingArithmeticExceptionis thrown after the channel is half-closed, so no error response is flushed.Upstream precedent: OS#1651 / PR #7963 fixed the same overflow on the request (ingest) side. The response side was never patched.
Bug B — non-idempotent
close()inResourceHandlingHttpChannelRestController$ResourceHandlingHttpChannel.close()usesAtomicBoolean.compareAndSet(false, true)and throwsIllegalStateException("Channel is already closed")when called a second time. When Bug A's exception is thrown mid-sendResponse, the firstclose()call half-closes the channel.RestActionListener.onFailurethen tries to send the error response by callingsendResponseagain, which re-entersclose()and throws again. This leaves the underlying Netty channel dangling until the upstream load balancer times out (e.g. 300 s).Related component
Search
To Reproduce
Bug A
Bug B
Both reproducer tests are included in the companion PR.
Expected behavior
Bug A: Pre-check the response size before crossing into Lucene's
UnicodeUtil.maxUTF8Length. If(long) content.length() * 3 > Integer.MAX_VALUE, fail with a typed exception that maps to a clean HTTP 413/507 error response — not a rawArithmeticExceptionthrown mid-write. Mirror the pattern from PR #7963 (request side).Bug B:
ResourceHandlingHttpChannel.close()should be idempotent — a second call should be a no-op, not throwIllegalStateException. This allows the error path inRestActionListener.onFailureto reach the client gracefully even if the success path already closed the channel.Additional Details
Observed failure mode (production stack trace, OS 3.1):
Cliff value:
utf16Length > Integer.MAX_VALUE / 3 = 715_827_882(~715 MB ASCII JSON). Confirmed on cluster with 74x m7i.4xlarge data nodes; responses <=650 MB succeed (~20 s), responses >=700 MB hang the full 300 s load-balancer timeout with zero bytes flushed to the client.Version: OS 3.1 (stack trace); both code paths are present in current
main.Related issues: