kernel32: fix per-thread FLS and Windows-faithful CRITICAL_SECTION (fixes intermittent MSVC c2.dll deadlock) by jeffmcjunkin · Pull Request #126 · decompals/wibo

jeffmcjunkin · 2026-06-12T01:15:28Z

Fixes an intermittent (~1–2% of invocations) permanent deadlock when running the VS2013 MSVC compiler (cl.exe, multithreaded codegen in c2.dll) under wibo, traced to two thread-primitive divergences from Windows. With both fixes the hang rate went from ~1–2% to 0 in 5,100 stress runs, compiler output stays byte-identical to native Windows, and heavily contended workloads got ~6× faster as a side effect.

Bug 1: FLS values were process-global (`dll/kernel32/fibersapi.cpp`)

FlsAlloc/FlsGetValue/FlsSetValue kept one g_flsValues[] array shared by every thread. Fiber-local storage without fibers is thread-local on Windows (each thread is exactly one fiber).

This breaks msvcr120.dll thread creation: the VS2013 CRT stores its per-thread data block (_ptd) — which carries the _beginthreadex entry point and argument — via FlsSetValue, and the new thread's _threadstartex/_callthreadstartex re-read it via FlsGetValue. With one shared cell, two concurrently-starting threads overwrite each other's _ptd and can both start with the same argument.

Caught in an API trace of a live hang — c2.dll's codegen pool creates 4 workers back-to-back, each with its own auto-reset "go" event (7c/88/94/a0):

t251103 WaitForSingleObject h=88   <- worker
t251102 WaitForSingleObject h=88   <- a SECOND thread waits on the SAME go event
boss    SetEvent 7c, 88, 94, a0    <- 94 never gets a waiter, ever
boss    WaitForMultipleObjects({done events}, bWaitAll, INFINITE)
        -> hangs forever: the orphaned worker never runs, its done-event is never set

Fix: values move to a thread_local array (wibo maps guest threads 1:1 onto host threads); the index allocation map stays process-wide, guarded by a mutex. FLS callbacks are still not invoked (unchanged behavior; noted in the commit message).

Bug 2: CRITICAL_SECTION not faithful to the Windows state machine (`dll/kernel32/synchapi.cpp`)

Two defects:

Contended handoff was claim-by-race: waitForCriticalSection waited until OwningThread == 0 and then claimed with a plain store. Two waiters can both observe 0 after one Leave and both enter the section. The fingerprint — a free section left with LockCount == 0 instead of −1, created when the second "owner"'s Leave fails the ownership check and returns without decrementing — was captured with gdb in hung c2.dll processes.
Leave validated the caller; Windows doesn't. Real LeaveCriticalSection performs no ownership check — it unconditionally decrements RecursionCount/LockCount and releases exactly one waiter via a counted semaphore. wibo's silent bail-out strands the lock state for guest usage Windows tolerates. Instrumentation caught leave-not-owner cs=<c2 work-queue CS> tid=B owner=A lock=1 rec=1 in live runs, after which the work queue's drain condition became unsatisfiable → deadlock (~0.1–0.3% of compiles even with FLS fixed).

Fix mirrors the Windows protocol: LockSemaphore is modeled as a ticket count — each contended Leave posts exactly one ticket (InterlockedIncrement + WakeByAddressSingle), each blocked Enter consumes exactly one (CAS decrement, WaitOnAddress otherwise) and then owns by construction; Leave validates nothing. TryEnterCriticalSection can't steal while waiters exist because each waiter's LockCount increment persists until it owns and leaves.

Verification

Stress harness: VS2013 cl.exe compiling an 18 KB real C unit, 25 s timeout = hang.
- Before: 20–22 hangs / 1000 at 24-way parallelism (also reproduces ~2% fully sequential, so it's intra-process).
- After (both fixes): 0 hangs in 5,100 runs (2×1000 + 2000 @ P=24, 300 @ P=1, plus 800 runs on two other source units).
Output correctness: per-section COFF comparison of wibo-compiled objects vs native-Windows-compiled objects for three test units — .text$mn, .xdata, .pdata, .rdata, .drectve and all relocations byte-identical (only .debug$S differs, which embeds absolute source paths).
Throughput: 1000 parallel compiles dropped from 26 s to 4 s wall; 300 sequential from 131 s to 7 s — the old wait loop woke every waiter to race on each release.

Both changes are host-side only and were validated on x86 (i386 build) under WSL2/Ubuntu 24.04, 24 logical CPUs.

🤖 Generated with Claude Code

FlsAlloc/FlsGetValue/FlsSetValue stored values in a single process-global array, so every thread shared one cell per FLS index. Fiber-local storage without fibers is thread-local storage on Windows: each thread is exactly one fiber, and FlsGetValue must return the value the calling thread set. The shared cell breaks msvcr120.dll (VS2013 CRT) thread creation. The CRT stores its per-thread data block (_ptd) -- which carries the _beginthreadex entry point and argument -- via FlsSetValue, and the new thread's _threadstartex/_callthreadstartex re-read it via FlsGetValue. With a process-global cell, two concurrently-starting threads overwrite each other's _ptd and can both start with the same argument. Observed with VS2013 cl.exe: c2.dll's parallel codegen pool creates four worker threads back-to-back; ~1-2% of compiles deadlocked forever. An API trace of a hung process shows two workers waiting on the same per-worker dispatch event while another worker's event has a signal and no waiter: CreateEvent h=7c / 88 / 94 / a0 (per-worker "go" events) t251103 WaitForSingleObject h=88 <- worker 1 t251102 WaitForSingleObject h=88 <- different thread, SAME event boss SetEvent 7c, 88, 94, a0 <- 94 never gets a waiter boss WaitForMultipleObjects({done events}, bWaitAll, INFINITE) -> hangs forever: the orphaned worker never runs its work item, so its "done" event is never set. Fix: keep the index allocation map process-wide (FLS indices are process-wide on Windows) but store the values in a thread_local array; wibo maps guest threads 1:1 onto host threads, so thread_local is exactly per-guest-thread. New threads observe zero-initialized values, matching Windows. Index alloc/free is guarded by a mutex. Caveat (unchanged behavior): FLS destructor callbacks are still never invoked, and freeing then reallocating an index does not clear other threads' stale values for it. The VS2013 CRT allocates its index once per process, so neither is reachable for it. With this fix the cl.exe hang rate dropped from ~1-2% to ~0.1-0.3% over 1000-run stress batches; the remainder was a separate CRITICAL_SECTION issue fixed in the following commit. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Two divergences from the real Windows state machine, both observed breaking VS2013 cl.exe (c2.dll multithreaded codegen): 1. Contended EnterCriticalSection waited until OwningThread was observed to be 0 and then claimed the section with a plain store. Two waiters can both observe 0 after a single Leave and both enter the critical section simultaneously. The fingerprint -- a free section left with LockCount == 0 instead of -1, created when the second "owner"'s Leave failed the ownership check and returned without decrementing -- was captured with gdb in hung c2.dll worker pools. 2. LeaveCriticalSection bailed out when the calling thread did not match OwningThread. Real Windows Leave performs no caller validation at all: it unconditionally decrements RecursionCount/LockCount and releases exactly one waiter; mutual exclusion is carried entirely by the LockCount/semaphore state machine. Guest lock usage that Windows tolerates therefore silently strands wibo's lock state instead. Instrumented runs caught exactly this on c2.dll's work-queue section: leave-not-owner cs=<work queue CS> tid=B owner=A lock=1 rec=1 after which the queue's "active worker" counter and the LockCount were each stranded one too high, the queue's drain condition became unsatisfiable, and the compiler deadlocked (~0.1-0.3% of compiles even after the FLS fix). Fix, mirroring the Windows protocol: - Model LockSemaphore as a ticket count: every contended Leave posts exactly one ticket (InterlockedIncrement + WakeByAddressSingle); every blocked Enter consumes exactly one ticket (CAS decrement, WaitOnAddress otherwise) and then owns the section by construction. There is no claim race on OwningThread. TryEnterCriticalSection cannot steal while waiters exist because each waiter's LockCount increment persists until it owns and leaves. - Leave validates nothing, like Windows: if --RecursionCount != 0, decrement LockCount; otherwise clear the owner, decrement, and release one waiter if the result is >= 0. Measured with VS2013 cl.exe under stress (18KB C unit, 25s timeout = hang): ~1-2% hangs before, 0 hangs in 5,100 runs with this commit plus the FLS fix (2x1000 + 2000 at 24-way parallelism, 300 sequential, 800 on two other source units). Compiler output stays byte-identical to native Windows. Side effect: heavily contended compiles got ~6x faster (1000 parallel compiles: 26s -> 4s wall) because the old wait loop woke every waiter to race on each release. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

jeffmcjunkin · 2026-06-12T01:16:01Z

This was 1h25m with Fable, btw. I was pretty impressed.

Jeff McJunkin and others added 2 commits June 11, 2026 18:10

jeffmcjunkin mentioned this pull request Jun 12, 2026

Run the VS2013 (MSVC 18.00) cl.exe/ml64.exe toolchain under wibo, byte-identical to native #127

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kernel32: fix per-thread FLS and Windows-faithful CRITICAL_SECTION (fixes intermittent MSVC c2.dll deadlock)#126

kernel32: fix per-thread FLS and Windows-faithful CRITICAL_SECTION (fixes intermittent MSVC c2.dll deadlock)#126
jeffmcjunkin wants to merge 2 commits into
decompals:mainfrom
jeffmcjunkin:fix-fls-per-thread-and-cs-handoff

jeffmcjunkin commented Jun 12, 2026

Uh oh!

jeffmcjunkin commented Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jeffmcjunkin commented Jun 12, 2026

Bug 1: FLS values were process-global (dll/kernel32/fibersapi.cpp)

Bug 2: CRITICAL_SECTION not faithful to the Windows state machine (dll/kernel32/synchapi.cpp)

Verification

Uh oh!

jeffmcjunkin commented Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Bug 1: FLS values were process-global (`dll/kernel32/fibersapi.cpp`)

Bug 2: CRITICAL_SECTION not faithful to the Windows state machine (`dll/kernel32/synchapi.cpp`)