Skip to content

[SPARK-56227][CORE] Fix GcmTransportCipher to correctly handle multiple messages per channel#55028

Draft
aajisaka wants to merge 1 commit intoapache:masterfrom
aajisaka:fix-rpc-encryption
Draft

[SPARK-56227][CORE] Fix GcmTransportCipher to correctly handle multiple messages per channel#55028
aajisaka wants to merge 1 commit intoapache:masterfrom
aajisaka:fix-rpc-encryption

Conversation

@aajisaka
Copy link
Member

What changes were proposed in this pull request?

This fixes two bugs in GcmTransportCipher introduced by SPARK-47172.

Bug 1: DecryptionHandler silently drops every message after the first.

AesGcmHkdfStreaming is a one-shot streaming primitive — each independently encrypted message carries its own random IV and requires a fresh StreamSegmentDecrypter. The DecryptionHandler never reset its per-message state (completed, decrypterInit, expectedLength, segmentNumber, etc.) nor replaced the single final StreamSegmentDecrypter instance between messages. After the first message was decoded, completed = true permanently, and all subsequent messages were silently dropped because both initalizeExpectedLength() and initalizeDecrypter() returned early as no-ops and the inner while loop never ran.

Fix: add resetForNextMessage() which clears all per-message fields and allocates a new StreamSegmentDecrypter; call it after each fully decoded message.

Bug 2: DecryptionHandler discards bytes from messages batched in the same channelRead() call.

Under shuffle load, TCP coalesces multiple encrypted messages into a single ByteBuf. The original code exited the decryption loop as soon as one message completed and released the buffer — including any trailing bytes belonging to subsequent messages. The next channelRead() then received bytes starting mid-stream of the second message, interpreted them as an 8-byte length header, and threw: IllegalStateException: Invalid expected ciphertext length.

Fix: wrap the decryption logic in an outer loop that continues consuming messages from the same buffer until either the buffer is exhausted or a partial message is encountered. resetForNextMessage() is called inside the loop immediately after each complete message while the buffer is still held.

Bug 3 (minor): EncryptionHandler shares working buffers across concurrent GcmEncryptedMessage instances.

plaintextBuffer and ciphertextBuffer were fields of EncryptionHandler passed into every GcmEncryptedMessage. The constructor's ciphertextBuffer.limit(0) call could corrupt an in-flight message's buffer state if Netty batched writes. Fix: move buffer ownership into GcmEncryptedMessage so each message allocates its own working buffers.

Without these fixes, enabling AES/GCM/NoPadding RPC encryption causes YARN executor containers to fail: the auth handshake succeeds but all post-auth RPC messages are dropped or corrupted, leaving the channel hung until YARN kills the container.

Why are the changes needed?

To successfully run Spark jobs on YARN with spark.network.crypto.cipher="AES/GCM/NoPadding"

Fixes #54999

Does this PR introduce any user-facing change?

No

How was this patch tested?

Added unit tests:

  • testMultipleMessages: encrypts and decrypts two independent messages through the same handler pair with separate channelRead() calls.
  • testBatchedMessages: concatenates two ciphertexts into one ByteBuf and delivers them in a single channelRead() call, verifying both are decoded correctly.

Ported these changes to our Spark 3.4.x-based internal branch and ran multiple jobs in YARN cluster successfully.

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Code (Claude Opus 4.6)

This fixes two bugs in `GcmTransportCipher` introduced by SPARK-47172.

Bug 1: `DecryptionHandler` silently drops every message after the first.

`AesGcmHkdfStreaming` is a one-shot streaming primitive — each independently
encrypted message carries its own random IV and requires a fresh
`StreamSegmentDecrypter`. The `DecryptionHandler` never reset its per-message
state (`completed`, `decrypterInit`, `expectedLength`, `segmentNumber`, etc.)
nor replaced the single `final StreamSegmentDecrypter` instance between
messages. After the first message was decoded, `completed = true` permanently,
and all subsequent messages were silently dropped because both `initalizeExpectedLength()`
and `initalizeDecrypter()` returned early as no-ops and the inner while loop
never ran.

Fix: add `resetForNextMessage()` which clears all per-message fields and
allocates a new `StreamSegmentDecrypter`; call it after each fully decoded
message.

Bug 2: `DecryptionHandler` discards bytes from messages batched in the same `channelRead()` call.

Under shuffle load, TCP coalesces multiple encrypted messages into a single
`ByteBuf`. The original code exited the decryption loop as soon as one message
completed and released the buffer — including any trailing bytes belonging to
subsequent messages. The next `channelRead()` then received bytes starting
mid-stream of the second message, interpreted them as an 8-byte length header,
and threw:

  `IllegalStateException: Invalid expected ciphertext length.`

Fix: wrap the decryption logic in an outer loop that continues consuming
messages from the same buffer until either the buffer is exhausted or a
partial message is encountered. `resetForNextMessage()` is called inside the
loop immediately after each complete message while the buffer is still held.

Bug 3 (minor): `EncryptionHandler` shares working buffers across concurrent `GcmEncryptedMessage` instances.

`plaintextBuffer` and `ciphertextBuffer` were fields of `EncryptionHandler`
passed into every `GcmEncryptedMessage`. The constructor's `ciphertextBuffer.limit(0)`
call could corrupt an in-flight message's buffer state if Netty batched writes.
Fix: move buffer ownership into `GcmEncryptedMessage` so each message allocates
its own working buffers.

Without these fixes, enabling `AES/GCM/NoPadding` RPC encryption causes YARN
executor containers to fail: the auth handshake succeeds but all post-auth RPC
messages are dropped or corrupted, leaving the channel hung until YARN kills the
container.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@aajisaka aajisaka changed the title Fix GcmTransportCipher to correctly handle multiple messages per channel [SPARK-56227] Fix GcmTransportCipher to correctly handle multiple messages per channel Mar 26, 2026
@aajisaka aajisaka changed the title [SPARK-56227] Fix GcmTransportCipher to correctly handle multiple messages per channel [SPARK-56227][CORE] Fix GcmTransportCipher to correctly handle multiple messages per channel Mar 26, 2026
@aajisaka aajisaka marked this pull request as draft March 26, 2026 04:35
@aajisaka
Copy link
Member Author

Converted to draft. We are seeing job failures in benchmarking test.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

AES-GCM for RPC encryption does not work on YARN

1 participant