Add MpscRingBuffer primitive for pre-allocated slot rings by dougqh · Pull Request #11492 · DataDog/dd-trace-java

dougqh · 2026-05-28T14:39:43Z

Summary

New datadog.trace.util.concurrent.MpscRingBuffer<T> — a bounded multi-producer / single-consumer ring buffer of pre-allocated, recyclable T instances. Producers and consumers mutate and read slots in place via callbacks; no allocation occurs per write/read after construction.

Lives in internal-api. No callers yet — this PR adds the primitive and tests/benches only. CSS integration to follow in a separate PR.

Motivation

Producer/consumer split in CSS v1.3.0 (#11381) allocates one SpanSnapshot per metrics-eligible span and hands the reference through an MpscArrayQueue. At typical heap sizes (≥256m) this is fine — the snapshots die young. At tight heap (we hit it at Xmx64m in spring-petclinic load testing) the in-flight snapshots overflow G1 survivor regions, triggering To-space Exhausted → fallback Full GC storms.

Moving from "reference passing of fresh objects" to "in-place mutation of pre-allocated slots" eliminates the per-publish allocation entirely. Smoke benchmarks (capacity=1024, 8 producers, 1 consumer on M-series Mac):

Benchmark	`MpscArrayQueue<Slot>` + per-publish `new Slot(...)`	`MpscRingBuffer<Slot>`	Ratio
`write_*_8p` (publish only)	399 M ops/s	768 M ops/s	1.92×
`e2e_*_8p` (8P→1C end-to-end)	514 M ops/s	846 M ops/s	1.65×

Allocation rate falls to zero on the ring path (vs sizeof(Slot) × producer ops/s on the queue path) — that's the heap-pressure side of the win, not visible in the throughput numbers above but the key reason this matters for tight-heap workloads.

API

Callback-style, with 0/1/2 context-object overloads. Contexts come before the slot, matching TagMap.forEach / Hashtable.forEach:

// 0 context
ring.tryWrite(slot -> { slot.value = 1; });
ring.drain(slot -> handle(slot));

// 1 context — static BiConsumer<C, T> avoids per-call lambda capture
static final BiConsumer<Span, Slot> FILL = (span, slot) -> slot.svc = span.getService();
ring.tryWrite(span, FILL);

// 2 contexts — TriConsumer<C1, C2, T>
static final TriConsumer<Span, Tag, Slot> FILL2 = (span, tag, slot) -> { ... };
ring.tryWrite(span, tag, FILL2);

tryWrite returns false when full (producer drops the value). drain returns the count processed.

Design

Producer cursor is CAS-claimed (AtomicLong.compareAndSet in a retry loop). Stale read of the consumer cursor is fine — a false "full" reading just causes a drop.
Per-slot publication is gated by an AtomicLongArray: a slot is published at sequence s iff sequences[s & mask] == s. This handles out-of-order publishes from concurrent producers — the consumer only advances over contiguous published slots.
Consumer cursor is volatile; written only by the consumer thread, read by producers to detect free space.
Capacity rounded up to power-of-two so mask = capacity - 1 works.

Test/bench coverage

MpscRingBufferTest — 12 JUnit 5 tests covering construction, FIFO order, capacity bounds, fill-drain-fill round trip, 0/1/2 context variants, and an 8-producer × 50,000-writes concurrency test (400K writes, ~150ms).
MpscRingBufferBenchmark (internal-api/src/jmh/...) — write_1p/8p/16p at varying producer counts + e2e_8p @Group (8 producers + 1 consumer). Parameterised on capacity.
RingVsQueueBenchmark (dd-trace-core/src/jmh/... — sits alongside the CSS benches, where jctools is already a dep) — head-to-head against MpscArrayQueue<Slot> + per-publish allocation. Numbers above come from this.

Test plan

:internal-api:test — MpscRingBufferTest passes (12/12 locally)
:internal-api:jmh runs MpscRingBufferBenchmark (smoke run passes)
:dd-trace-core:jmh runs RingVsQueueBenchmark (smoke run passes; 1.65×–1.92× ratio)
No callers yet — production code paths unchanged

🤖 Generated with Claude Code

Bounded multi-producer / single-consumer ring buffer of long-lived T instances. Producers mutate slots in place via callbacks; the consumer reads them the same way. No allocation per write/read after construction. BiConsumer/TriConsumer variants take context object(s) before the slot, matching the TagMap.forEach / Hashtable.forEach convention so callers can use static final non-capturing lambdas. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

@group

Five benchmarks: producer-only throughput at 1/8/16 threads with a background drainer, plus an e2e @group bench pairing 8 producers with 1 consumer for system throughput. Ring capacity is a @Param so runs can sweep capacities without recompiling. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

@group

Head-to-head benches in dd-trace-core's jmh source set (which already depends on jctools): MpscRingBuffer mutating pre-allocated slots vs MpscArrayQueue with a fresh Slot allocated per publish -- the latter mirrors the existing SpanSnapshot pattern in the CSS code. write_ring_8p / write_queue_8p compare publish cost with a background drainer; e2e_ring_8p / e2e_queue_8p use @group to pair 8 producers with 1 consumer for end-to-end throughput. Run with -prof gc to see per-op allocation rate where the ring's win shows up loudest. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

dd-octo-sts · 2026-05-28T15:06:56Z

🟢 Java Benchmark SLOs — All performance SLOs passed

Suite	Status
Startup	🟢 pass

SLO thresholds are defined here based on automatically generated metrics. A warning is raised when results are within 5% of the threshold.

PR vs. master results

Startup Time

Scenario	This PR	master	Change
insecure-bank / iast	14,046 ms	14,023 ms	+0.2%
insecure-bank / tracing	12,947 ms	13,020 ms	-0.6%
petclinic / appsec	16,569 ms	16,305 ms	+1.6%
petclinic / iast	16,456 ms	16,567 ms	-0.7%
petclinic / profiling	16,374 ms	16,517 ms	-0.9%
petclinic / tracing	14,975 ms	15,712 ms	-4.7%

Commit: 348ca2a3 · CI Pipeline · Benchmarking Platform UI

Load and DaCapo benchmarks can be triggered manually in the GitLab pipeline. Results will appear in the Benchmarking Platform UI after completion.

Spell out the contract that slot users rely on: plain (non-volatile) fields, no retention past handler return, slot-reference-not-shared. Producer fillers must not throw if possible -- and if they do, the slot is now published anyway (try/finally) so the consumer can't deadlock waiting for an unfinished sequence. Test covers the throw-then-recover path; the ring's cursors stay healthy. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Saves one wrapper allocation per ring instance: producerCursor becomes a volatile long on the instance, paired with a static AtomicLongFieldUpdater for CAS. Same memory ordering as the prior AtomicLong (volatile field + field-updater CAS), but no per-instance wrapper object. publishedSequences stays AtomicLongArray -- the field updater approach doesn't apply to array element access. consumerCursor was already a plain volatile long with no wrapper. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Two related cache-line fixes for the producer hot path under heavy contention: 1. Stride publishedSequences by 8 longs (one cache line). Without this, adjacent logical slots share cache lines and concurrent producers writing nearby sequences ping-pong the same line between cores. The array grows by 8x but the upfront cost is bounded by the ring's capacity (e.g. 8 MB at the CSS default cap=131072). 2. Cache-line-pad the producerCursor and consumerCursor against each other using the standard Disruptor class-hierarchy pattern. Every consumer-side advance of consumerCursor would otherwise invalidate the line producers read for producerCursor (and vice versa). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

A single CAS claims n contiguous slots, returning a Batch handle that the caller fills via fillAndPublish(slot...). Designed for callers whose work has a natural batch boundary (e.g. CSS publishing a trace's worth of metrics-eligible spans in one shot): cuts producer-cursor contention from O(N) CASes to O(1) per call. All-or-nothing: tryClaim(n) returns null if the ring can't fit the whole batch. The Batch is single-threaded (owned by the claiming thread), short-lived (scoped to one publish call), and has no thread-shutdown hazard -- the batch is fully consumed before returning. Filler-throw safety matches the existing tryWrite contract: the slot is published in a finally block so the consumer can advance, and the batch's published counter increments either way. Tests cover: requested size, capacity rejection, all-or-nothing, three filler overloads, over-publish IllegalStateException, throw recovery, and 8-producer concurrency (200 batches/thread x 16 size = 25600 items, single consumer sees every value exactly once). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The Batch handle from tryClaim was supposed to be scalar-replaced by escape analysis, but JMH measurements showed it's not -- the inner-class implicit this$0 plus the CAS-retry inside tryClaim block scalarization on HotSpot. Result: ~24 bytes of Batch + cursor state allocated per publish on the hot path, ~50% throughput drop on single-element claims in CSS-style benches. Add three sequence-based primitives that callers manage directly: long tryClaimRange(int n) -> start sequence or -1L T slotAt(long seq) -> slot for that sequence void publish(long seq) -> release the slot to the consumer No per-call allocation, no callback dispatch. Callers handle the sequence arithmetic themselves and trade safety (forget to publish -> ring stuck) for hot-path predictability. The Batch API stays for safer use cases. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 348ca2a309

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-05-29T12:32:43Z

+      cursor = nextSeq;
+      count++;
+    }
+    if (count > 0) consumerCursor = cursor;


Release drained slots as each handler completes

When a drain processes a full ring with a non-trivial handler, producers keep seeing the old consumerCursor until the entire drain call returns, so tryWrite reports full and drops values even after earlier slots have already been handled. This is especially visible with capacity-sized drains or slow CSS aggregation callbacks; it also contradicts the class contract that a slot may be reclaimed once its handler returns. Advancing the consumer cursor after each successfully handled slot (or in smaller batches) would let producers reuse freed slots instead of dropping during the rest of the drain.

Useful? React with 👍 / 👎.

Mmmh, I believe there's a point here, consumerCursor is updated only after the whole drain loop.

AlexeyKuznetsov-DD · 2026-05-29T14:10:48Z

+// a 64-byte cache line. The HotSpot field-layout strategy preserves the declaration order across
+// the class hierarchy, so this pattern is reliable on all production JVMs we target.
+
+abstract class MpscRingBufferPad0 {


Just curious what Mpsc means here?

MPSC = multi-producer single-consumer

AlexeyKuznetsov-DD · 2026-05-29T14:13:55Z

+// producerCursor (and vice versa). Each padding class declares 7 longs (56 bytes); combined with
+// the cursor's own 8 bytes plus the JVM object header, each cursor + its surrounding pad fills
+// a 64-byte cache line. The HotSpot field-layout strategy preserves the declaration order across


Not sure if I got the math here: 56+8 is already 64 and plus the JVM object header should be more than 64 now. How this will be aligned with 64-byte cache line? Or am I missing something?

The idea is to make the cursor value to land within a cache line and the other cursor on another cacheline. That's why there's padding before (pad left), 8 bytes comes from the producerCursor.

Usually the CPU cacheline is 64B. But some processors go beyond that. I even read somewhere that on multi-cpu board, some cpu can have different cacheline sizes.
It's possibly not the intended production target, but Apple Silicon (ARM) has a 128 bytes cache line.

AlexeyKuznetsov-DD · 2026-05-29T14:16:34Z

+abstract class MpscRingBufferPad1 extends MpscRingBufferProducerCursor {
+  long p11, p12, p13, p14, p15, p16, p17;
+}
+
+abstract class MpscRingBufferConsumerCursor extends MpscRingBufferPad1 {
+  /** Highest sequence consumed. Only the consumer thread writes; producers read volatile. */
+  volatile long consumerCursor = -1L;
+}
+
+abstract class MpscRingBufferPad2 extends MpscRingBufferConsumerCursor {
+  long p21, p22, p23, p24, p25, p26, p27;


I'm eager to learn what is the logic behind this trick?

The aim of all the padding is to avoid "false sharing".
This way the consumer data is on separate cache line from the producer data.

This the drawing I have in mind when thinking about this problem

Pad0 Pad1 Pad2 Cacheline 0 Cache Line 1 Cache Line 3 | | | [----56B padding----][long value 8B][----56B padding----][long value 8B][----56B padding----] | | ProducerCursor ConsumerCursor

The start of the cache line is just an example, the cache line boundaries can start somewhere after, that is why there's pad1 so it prevent both cursors to land on the same cacheline, and that alos explains why there's padding on the right (pad2) to keep out following fields in the class. There's a notion of alignment but it's a slippery topic for me at this point.

Without the padding, that would look like this and that's where ther'es the false-sharing can happen, when two or more thread loads cache line 0, and one thread update a field this invalidates the cache for other threads, but as other threads need to update their cursor field they also invalidate the cache line, so the thread are actually contending over these.

Cache line 0 | [long value 8B][long value 8B][...]... | | | ProducerCursor ConsumerCursor other field

Doug may have a better wording than me.

bric3 · 2026-06-01T12:51:56Z

+  }
+
+  /** Mirrors the SpanSnapshot pattern: allocate a fresh instance per publish, offer it. */
+  @Threads(8)


question: Do 8 seems low for producers threads, or is enough to compare both MpscRingBuffer and MpscArrayQueue ?

bric3 · 2026-06-01T12:55:42Z

+    return ring.tryWrite(ts.counter++, FILLER);
+  }
+
+  @Threads(16)


question: Does it make sense to have higher producer threads ?

bric3 · 2026-06-01T13:34:28Z

+// producerCursor (and vice versa). Each padding class declares 7 longs (56 bytes); combined with
+// the cursor's own 8 bytes plus the JVM object header, each cursor + its surrounding pad fills
+// a 64-byte cache line. The HotSpot field-layout strategy preserves the declaration order across


The idea is to make the cursor value to land within a cache line and the other cursor on another cacheline. That's why there's padding before (pad left), 8 bytes comes from the producerCursor.

Usually the CPU cacheline is 64B. But some processors go beyond that. I even read somewhere that on multi-cpu board, some cpu can have different cacheline sizes.
It's possibly not the intended production target, but Apple Silicon (ARM) has a 128 bytes cache line.

bric3 · 2026-06-01T13:59:53Z

+abstract class MpscRingBufferPad1 extends MpscRingBufferProducerCursor {
+  long p11, p12, p13, p14, p15, p16, p17;
+}
+
+abstract class MpscRingBufferConsumerCursor extends MpscRingBufferPad1 {
+  /** Highest sequence consumed. Only the consumer thread writes; producers read volatile. */
+  volatile long consumerCursor = -1L;
+}
+
+abstract class MpscRingBufferPad2 extends MpscRingBufferConsumerCursor {
+  long p21, p22, p23, p24, p25, p26, p27;


This the drawing I have in mind when thinking about this problem

Pad0 Pad1 Pad2 Cacheline 0 Cache Line 1 Cache Line 3 | | | [----56B padding----][long value 8B][----56B padding----][long value 8B][----56B padding----] | | ProducerCursor ConsumerCursor

The start of the cache line is just an example, the cache line boundaries can start somewhere after, that is why there's pad1 so it prevent both cursors to land on the same cacheline, and that alos explains why there's padding on the right (pad2) to keep out following fields in the class. There's a notion of alignment but it's a slippery topic for me at this point.

Without the padding, that would look like this and that's where ther'es the false-sharing can happen, when two or more thread loads cache line 0, and one thread update a field this invalidates the cache for other threads, but as other threads need to update their cursor field they also invalidate the cache line, so the thread are actually contending over these.

Cache line 0 | [long value 8B][long value 8B][...]... | | | ProducerCursor ConsumerCursor other field

Doug may have a better wording than me.

bric3 · 2026-06-01T14:14:16Z

+  /**
+   * Cache line size in {@code long}-units. 64-byte cache lines on every common CPU we ship to (x86,
+   * ARM); 8 bytes per long. Each logical slot in {@link #publishedSequences} is spread out by this
+   * stride so adjacent logical sequences don't share a cache line and don't ping-pong between
+   * producer cores under heavy contention.
+   */
+  private static final int CACHE_LINE_LONGS = 8;


suggestion: ARM is not always guaranteed to be always 64. Also, unsure if that matter, and if I'm not mistaken but I think IBM has some hardware with big cachelines.

I believe we should follow JCTools padding, they are supporting 128 bytes cachelines.

bric3 · 2026-06-01T14:17:08Z

+  /** {@code true} if the slot was filled and published; {@code false} if the ring is full. */
+  public boolean tryWrite(final Consumer<? super T> filler) {
+    final long seq = claim();
+    if (seq < 0L) return false;
+    // publish in finally so a throwing filler doesn't leave the slot un-published -- the
+    // consumer would otherwise wait at that sequence forever. See class javadoc.
+    try {
+      filler.accept(slots[(int) (seq & mask)]);
+    } finally {
+      publish(seq);
+    }
+    return true;
+  }
+
+  public <C> boolean tryWrite(final C context, final BiConsumer<? super C, ? super T> filler) {
+    final long seq = claim();
+    if (seq < 0L) return false;
+    try {
+      filler.accept(context, slots[(int) (seq & mask)]);
+    } finally {
+      publish(seq);
+    }
+    return true;
+  }
+
+  public <C1, C2> boolean tryWrite(
+      final C1 context1,
+      final C2 context2,
+      final TriConsumer<? super C1, ? super C2, ? super T> filler) {
+    final long seq = claim();
+    if (seq < 0L) return false;
+    try {
+      filler.accept(context1, context2, slots[(int) (seq & mask)]);
+    } finally {
+      publish(seq);
+    }
+    return true;
+  }


suggestion: What if the producer thread is slow? If I get the code right, this blocks the ring from progressing, and may be just quickly fill up the remaining capacity if the producers are just slow ? I'm not sure how to account that properly, this also looks like a trade-off to do, and for CSS it's likely a safe one, but if there are other usages that is probably something o be aware of.

Also the other tryWrite may need a javadoc as well.

bric3 · 2026-06-01T14:21:43Z

+      cursor = nextSeq;
+      count++;
+    }
+    if (count > 0) consumerCursor = cursor;


Mmmh, I believe there's a point here, consumerCursor is updated only after the whole drain loop.

dougqh · 2026-06-01T16:41:20Z

Closing as part of shelving the CSS ring buffer direction. The MpscRingBuffer primitive itself is sound, but the CSS application of it (pre-allocated SpanSnapshot slots) exhibits a G1 mixed-collection pathology at loose heap that makes it strictly worse than #11500. See #11497 for the full explanation.

The MpscRingBuffer implementation in internal-api is preserved on dougqh/ring-buffer if it has future uses elsewhere.

dougqh and others added 3 commits May 28, 2026 09:54

dougqh added type: enhancement Enhancements and improvements comp: core Tracer core tag: performance Performance related changes tag: no release notes Changes to exclude from release notes tag: ai generated Largely based on code generated by an AI or LLM labels May 28, 2026

dougqh and others added 3 commits May 28, 2026 11:17

dougqh mentioned this pull request May 28, 2026

Use MpscRingBuffer for CSS producer/consumer inbox #11497

Closed

4 tasks

dougqh and others added 3 commits May 28, 2026 16:17

Merge branch 'master' into dougqh/ring-buffer

348ca2a

dougqh marked this pull request as ready for review May 29, 2026 12:29

dougqh requested a review from a team as a code owner May 29, 2026 12:29

dougqh requested a review from ygree May 29, 2026 12:29

chatgpt-codex-connector Bot reviewed May 29, 2026

View reviewed changes

AlexeyKuznetsov-DD reviewed May 29, 2026

View reviewed changes

bric3 reviewed Jun 1, 2026

View reviewed changes

dougqh closed this Jun 1, 2026

Conversation

dougqh commented May 28, 2026

Summary

Motivation

API

Design

Test/bench coverage

Test plan

Uh oh!

dd-octo-sts Bot commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🟢 Java Benchmark SLOs — All performance SLOs passed

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 29, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

AlexeyKuznetsov-DD May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

AlexeyKuznetsov-DD May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dougqh commented Jun 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

dd-octo-sts Bot commented May 28, 2026 •

edited

Loading

AlexeyKuznetsov-DD May 29, 2026 •

edited

Loading

AlexeyKuznetsov-DD May 29, 2026 •

edited

Loading