Skip to content

kernel32: fix per-thread FLS and Windows-faithful CRITICAL_SECTION (fixes intermittent MSVC c2.dll deadlock)#126

Open
jeffmcjunkin wants to merge 2 commits into
decompals:mainfrom
jeffmcjunkin:fix-fls-per-thread-and-cs-handoff
Open

kernel32: fix per-thread FLS and Windows-faithful CRITICAL_SECTION (fixes intermittent MSVC c2.dll deadlock)#126
jeffmcjunkin wants to merge 2 commits into
decompals:mainfrom
jeffmcjunkin:fix-fls-per-thread-and-cs-handoff

Conversation

@jeffmcjunkin

Copy link
Copy Markdown

Fixes an intermittent (~1–2% of invocations) permanent deadlock when running the VS2013 MSVC compiler (cl.exe, multithreaded codegen in c2.dll) under wibo, traced to two thread-primitive divergences from Windows. With both fixes the hang rate went from ~1–2% to 0 in 5,100 stress runs, compiler output stays byte-identical to native Windows, and heavily contended workloads got ~6× faster as a side effect.

Bug 1: FLS values were process-global (dll/kernel32/fibersapi.cpp)

FlsAlloc/FlsGetValue/FlsSetValue kept one g_flsValues[] array shared by every thread. Fiber-local storage without fibers is thread-local on Windows (each thread is exactly one fiber).

This breaks msvcr120.dll thread creation: the VS2013 CRT stores its per-thread data block (_ptd) — which carries the _beginthreadex entry point and argument — via FlsSetValue, and the new thread's _threadstartex/_callthreadstartex re-read it via FlsGetValue. With one shared cell, two concurrently-starting threads overwrite each other's _ptd and can both start with the same argument.

Caught in an API trace of a live hang — c2.dll's codegen pool creates 4 workers back-to-back, each with its own auto-reset "go" event (7c/88/94/a0):

t251103 WaitForSingleObject h=88   <- worker
t251102 WaitForSingleObject h=88   <- a SECOND thread waits on the SAME go event
boss    SetEvent 7c, 88, 94, a0    <- 94 never gets a waiter, ever
boss    WaitForMultipleObjects({done events}, bWaitAll, INFINITE)
        -> hangs forever: the orphaned worker never runs, its done-event is never set

Fix: values move to a thread_local array (wibo maps guest threads 1:1 onto host threads); the index allocation map stays process-wide, guarded by a mutex. FLS callbacks are still not invoked (unchanged behavior; noted in the commit message).

Bug 2: CRITICAL_SECTION not faithful to the Windows state machine (dll/kernel32/synchapi.cpp)

Two defects:

  1. Contended handoff was claim-by-race: waitForCriticalSection waited until OwningThread == 0 and then claimed with a plain store. Two waiters can both observe 0 after one Leave and both enter the section. The fingerprint — a free section left with LockCount == 0 instead of −1, created when the second "owner"'s Leave fails the ownership check and returns without decrementing — was captured with gdb in hung c2.dll processes.
  2. Leave validated the caller; Windows doesn't. Real LeaveCriticalSection performs no ownership check — it unconditionally decrements RecursionCount/LockCount and releases exactly one waiter via a counted semaphore. wibo's silent bail-out strands the lock state for guest usage Windows tolerates. Instrumentation caught leave-not-owner cs=<c2 work-queue CS> tid=B owner=A lock=1 rec=1 in live runs, after which the work queue's drain condition became unsatisfiable → deadlock (~0.1–0.3% of compiles even with FLS fixed).

Fix mirrors the Windows protocol: LockSemaphore is modeled as a ticket count — each contended Leave posts exactly one ticket (InterlockedIncrement + WakeByAddressSingle), each blocked Enter consumes exactly one (CAS decrement, WaitOnAddress otherwise) and then owns by construction; Leave validates nothing. TryEnterCriticalSection can't steal while waiters exist because each waiter's LockCount increment persists until it owns and leaves.

Verification

  • Stress harness: VS2013 cl.exe compiling an 18 KB real C unit, 25 s timeout = hang.
    • Before: 20–22 hangs / 1000 at 24-way parallelism (also reproduces ~2% fully sequential, so it's intra-process).
    • After (both fixes): 0 hangs in 5,100 runs (2×1000 + 2000 @ P=24, 300 @ P=1, plus 800 runs on two other source units).
  • Output correctness: per-section COFF comparison of wibo-compiled objects vs native-Windows-compiled objects for three test units — .text$mn, .xdata, .pdata, .rdata, .drectve and all relocations byte-identical (only .debug$S differs, which embeds absolute source paths).
  • Throughput: 1000 parallel compiles dropped from 26 s to 4 s wall; 300 sequential from 131 s to 7 s — the old wait loop woke every waiter to race on each release.

Both changes are host-side only and were validated on x86 (i386 build) under WSL2/Ubuntu 24.04, 24 logical CPUs.

🤖 Generated with Claude Code

Jeff McJunkin and others added 2 commits June 11, 2026 18:10
FlsAlloc/FlsGetValue/FlsSetValue stored values in a single process-global
array, so every thread shared one cell per FLS index. Fiber-local storage
without fibers is thread-local storage on Windows: each thread is exactly
one fiber, and FlsGetValue must return the value the calling thread set.

The shared cell breaks msvcr120.dll (VS2013 CRT) thread creation. The CRT
stores its per-thread data block (_ptd) -- which carries the
_beginthreadex entry point and argument -- via FlsSetValue, and the new
thread's _threadstartex/_callthreadstartex re-read it via FlsGetValue.
With a process-global cell, two concurrently-starting threads overwrite
each other's _ptd and can both start with the same argument.

Observed with VS2013 cl.exe: c2.dll's parallel codegen pool creates four
worker threads back-to-back; ~1-2% of compiles deadlocked forever. An API
trace of a hung process shows two workers waiting on the same per-worker
dispatch event while another worker's event has a signal and no waiter:

    CreateEvent h=7c / 88 / 94 / a0      (per-worker "go" events)
    t251103 WaitForSingleObject h=88     <- worker 1
    t251102 WaitForSingleObject h=88     <- different thread, SAME event
    boss SetEvent 7c, 88, 94, a0         <- 94 never gets a waiter
    boss WaitForMultipleObjects({done events}, bWaitAll, INFINITE)
      -> hangs forever: the orphaned worker never runs its work item,
         so its "done" event is never set.

Fix: keep the index allocation map process-wide (FLS indices are
process-wide on Windows) but store the values in a thread_local array;
wibo maps guest threads 1:1 onto host threads, so thread_local is exactly
per-guest-thread. New threads observe zero-initialized values, matching
Windows. Index alloc/free is guarded by a mutex.

Caveat (unchanged behavior): FLS destructor callbacks are still never
invoked, and freeing then reallocating an index does not clear other
threads' stale values for it. The VS2013 CRT allocates its index once per
process, so neither is reachable for it.

With this fix the cl.exe hang rate dropped from ~1-2% to ~0.1-0.3% over
1000-run stress batches; the remainder was a separate CRITICAL_SECTION
issue fixed in the following commit.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Two divergences from the real Windows state machine, both observed
breaking VS2013 cl.exe (c2.dll multithreaded codegen):

1. Contended EnterCriticalSection waited until OwningThread was observed
   to be 0 and then claimed the section with a plain store. Two waiters
   can both observe 0 after a single Leave and both enter the critical
   section simultaneously. The fingerprint -- a free section left with
   LockCount == 0 instead of -1, created when the second "owner"'s Leave
   failed the ownership check and returned without decrementing -- was
   captured with gdb in hung c2.dll worker pools.

2. LeaveCriticalSection bailed out when the calling thread did not match
   OwningThread. Real Windows Leave performs no caller validation at all:
   it unconditionally decrements RecursionCount/LockCount and releases
   exactly one waiter; mutual exclusion is carried entirely by the
   LockCount/semaphore state machine. Guest lock usage that Windows
   tolerates therefore silently strands wibo's lock state instead.
   Instrumented runs caught exactly this on c2.dll's work-queue section:

       leave-not-owner cs=<work queue CS> tid=B owner=A lock=1 rec=1

   after which the queue's "active worker" counter and the LockCount were
   each stranded one too high, the queue's drain condition became
   unsatisfiable, and the compiler deadlocked (~0.1-0.3% of compiles even
   after the FLS fix).

Fix, mirroring the Windows protocol:

- Model LockSemaphore as a ticket count: every contended Leave posts
  exactly one ticket (InterlockedIncrement + WakeByAddressSingle); every
  blocked Enter consumes exactly one ticket (CAS decrement, WaitOnAddress
  otherwise) and then owns the section by construction. There is no claim
  race on OwningThread. TryEnterCriticalSection cannot steal while
  waiters exist because each waiter's LockCount increment persists until
  it owns and leaves.
- Leave validates nothing, like Windows: if --RecursionCount != 0,
  decrement LockCount; otherwise clear the owner, decrement, and release
  one waiter if the result is >= 0.

Measured with VS2013 cl.exe under stress (18KB C unit, 25s timeout =
hang): ~1-2% hangs before, 0 hangs in 5,100 runs with this commit plus
the FLS fix (2x1000 + 2000 at 24-way parallelism, 300 sequential, 800 on
two other source units). Compiler output stays byte-identical to native
Windows. Side effect: heavily contended compiles got ~6x faster (1000
parallel compiles: 26s -> 4s wall) because the old wait loop woke every
waiter to race on each release.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@jeffmcjunkin

Copy link
Copy Markdown
Author

This was 1h25m with Fable, btw. I was pretty impressed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant