kernel32: fix per-thread FLS and Windows-faithful CRITICAL_SECTION (fixes intermittent MSVC c2.dll deadlock)#126
Open
jeffmcjunkin wants to merge 2 commits into
Conversation
FlsAlloc/FlsGetValue/FlsSetValue stored values in a single process-global
array, so every thread shared one cell per FLS index. Fiber-local storage
without fibers is thread-local storage on Windows: each thread is exactly
one fiber, and FlsGetValue must return the value the calling thread set.
The shared cell breaks msvcr120.dll (VS2013 CRT) thread creation. The CRT
stores its per-thread data block (_ptd) -- which carries the
_beginthreadex entry point and argument -- via FlsSetValue, and the new
thread's _threadstartex/_callthreadstartex re-read it via FlsGetValue.
With a process-global cell, two concurrently-starting threads overwrite
each other's _ptd and can both start with the same argument.
Observed with VS2013 cl.exe: c2.dll's parallel codegen pool creates four
worker threads back-to-back; ~1-2% of compiles deadlocked forever. An API
trace of a hung process shows two workers waiting on the same per-worker
dispatch event while another worker's event has a signal and no waiter:
CreateEvent h=7c / 88 / 94 / a0 (per-worker "go" events)
t251103 WaitForSingleObject h=88 <- worker 1
t251102 WaitForSingleObject h=88 <- different thread, SAME event
boss SetEvent 7c, 88, 94, a0 <- 94 never gets a waiter
boss WaitForMultipleObjects({done events}, bWaitAll, INFINITE)
-> hangs forever: the orphaned worker never runs its work item,
so its "done" event is never set.
Fix: keep the index allocation map process-wide (FLS indices are
process-wide on Windows) but store the values in a thread_local array;
wibo maps guest threads 1:1 onto host threads, so thread_local is exactly
per-guest-thread. New threads observe zero-initialized values, matching
Windows. Index alloc/free is guarded by a mutex.
Caveat (unchanged behavior): FLS destructor callbacks are still never
invoked, and freeing then reallocating an index does not clear other
threads' stale values for it. The VS2013 CRT allocates its index once per
process, so neither is reachable for it.
With this fix the cl.exe hang rate dropped from ~1-2% to ~0.1-0.3% over
1000-run stress batches; the remainder was a separate CRITICAL_SECTION
issue fixed in the following commit.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Two divergences from the real Windows state machine, both observed
breaking VS2013 cl.exe (c2.dll multithreaded codegen):
1. Contended EnterCriticalSection waited until OwningThread was observed
to be 0 and then claimed the section with a plain store. Two waiters
can both observe 0 after a single Leave and both enter the critical
section simultaneously. The fingerprint -- a free section left with
LockCount == 0 instead of -1, created when the second "owner"'s Leave
failed the ownership check and returned without decrementing -- was
captured with gdb in hung c2.dll worker pools.
2. LeaveCriticalSection bailed out when the calling thread did not match
OwningThread. Real Windows Leave performs no caller validation at all:
it unconditionally decrements RecursionCount/LockCount and releases
exactly one waiter; mutual exclusion is carried entirely by the
LockCount/semaphore state machine. Guest lock usage that Windows
tolerates therefore silently strands wibo's lock state instead.
Instrumented runs caught exactly this on c2.dll's work-queue section:
leave-not-owner cs=<work queue CS> tid=B owner=A lock=1 rec=1
after which the queue's "active worker" counter and the LockCount were
each stranded one too high, the queue's drain condition became
unsatisfiable, and the compiler deadlocked (~0.1-0.3% of compiles even
after the FLS fix).
Fix, mirroring the Windows protocol:
- Model LockSemaphore as a ticket count: every contended Leave posts
exactly one ticket (InterlockedIncrement + WakeByAddressSingle); every
blocked Enter consumes exactly one ticket (CAS decrement, WaitOnAddress
otherwise) and then owns the section by construction. There is no claim
race on OwningThread. TryEnterCriticalSection cannot steal while
waiters exist because each waiter's LockCount increment persists until
it owns and leaves.
- Leave validates nothing, like Windows: if --RecursionCount != 0,
decrement LockCount; otherwise clear the owner, decrement, and release
one waiter if the result is >= 0.
Measured with VS2013 cl.exe under stress (18KB C unit, 25s timeout =
hang): ~1-2% hangs before, 0 hangs in 5,100 runs with this commit plus
the FLS fix (2x1000 + 2000 at 24-way parallelism, 300 sequential, 800 on
two other source units). Compiler output stays byte-identical to native
Windows. Side effect: heavily contended compiles got ~6x faster (1000
parallel compiles: 26s -> 4s wall) because the old wait loop woke every
waiter to race on each release.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Author
|
This was 1h25m with Fable, btw. I was pretty impressed. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes an intermittent (~1–2% of invocations) permanent deadlock when running the VS2013 MSVC compiler (
cl.exe, multithreaded codegen inc2.dll) under wibo, traced to two thread-primitive divergences from Windows. With both fixes the hang rate went from ~1–2% to 0 in 5,100 stress runs, compiler output stays byte-identical to native Windows, and heavily contended workloads got ~6× faster as a side effect.Bug 1: FLS values were process-global (
dll/kernel32/fibersapi.cpp)FlsAlloc/FlsGetValue/FlsSetValuekept oneg_flsValues[]array shared by every thread. Fiber-local storage without fibers is thread-local on Windows (each thread is exactly one fiber).This breaks msvcr120.dll thread creation: the VS2013 CRT stores its per-thread data block (
_ptd) — which carries the_beginthreadexentry point and argument — viaFlsSetValue, and the new thread's_threadstartex/_callthreadstartexre-read it viaFlsGetValue. With one shared cell, two concurrently-starting threads overwrite each other's_ptdand can both start with the same argument.Caught in an API trace of a live hang — c2.dll's codegen pool creates 4 workers back-to-back, each with its own auto-reset "go" event (
7c/88/94/a0):Fix: values move to a
thread_localarray (wibo maps guest threads 1:1 onto host threads); the index allocation map stays process-wide, guarded by a mutex. FLS callbacks are still not invoked (unchanged behavior; noted in the commit message).Bug 2: CRITICAL_SECTION not faithful to the Windows state machine (
dll/kernel32/synchapi.cpp)Two defects:
waitForCriticalSectionwaited untilOwningThread == 0and then claimed with a plain store. Two waiters can both observe 0 after one Leave and both enter the section. The fingerprint — a free section left withLockCount == 0instead of −1, created when the second "owner"'s Leave fails the ownership check and returns without decrementing — was captured with gdb in hung c2.dll processes.LeaveCriticalSectionperforms no ownership check — it unconditionally decrementsRecursionCount/LockCountand releases exactly one waiter via a counted semaphore. wibo's silent bail-out strands the lock state for guest usage Windows tolerates. Instrumentation caughtleave-not-owner cs=<c2 work-queue CS> tid=B owner=A lock=1 rec=1in live runs, after which the work queue's drain condition became unsatisfiable → deadlock (~0.1–0.3% of compiles even with FLS fixed).Fix mirrors the Windows protocol:
LockSemaphoreis modeled as a ticket count — each contended Leave posts exactly one ticket (InterlockedIncrement+WakeByAddressSingle), each blocked Enter consumes exactly one (CAS decrement,WaitOnAddressotherwise) and then owns by construction; Leave validates nothing.TryEnterCriticalSectioncan't steal while waiters exist because each waiter'sLockCountincrement persists until it owns and leaves.Verification
cl.execompiling an 18 KB real C unit, 25 s timeout = hang..text$mn,.xdata,.pdata,.rdata,.drectveand all relocations byte-identical (only.debug$Sdiffers, which embeds absolute source paths).Both changes are host-side only and were validated on x86 (i386 build) under WSL2/Ubuntu 24.04, 24 logical CPUs.
🤖 Generated with Claude Code