[PATCH 0/5] fixing thread races in linux-leaks CI

Jeff King <peff@xxxxxxxx> · Sun, 29 Dec 2024 23:23:25 -0500

This series should fix the races we see in linux-leaks CI jobs. Or at
least some of them. We tried previously in 993d38a066 (index-pack: spawn
threads atomically, 2024-01-05), but it wasn't enough. This series
covers index-pack as well as git-grep, which are the two I've seen in
practice. We may run into more, but the general pattern should be
applicable if we do.

This series takes the minimal approach, and only provides a knob to use
pthread_barrier_t if your system supports it (and then turns it on for
our CI jobs). The obvious alternative is to actually implement wrappers
for other systems, and then use it everywhere. That doesn't buy us much
for this problem, but it's possible we'd find a use for barriers
elsewhere.

The even farther alternatives are:

  1. Do nothing. This is arguably an LSan bug, which should be doing its
     own locking to synchronize the at-exit leak check with the
     per-thread setup code. I don't think we've even reported it there,
     so one option would be to work with them. Even if we do that, it
     might be nice to remove the false positive pain in the meantime
     with a series like this (and certainly patches 1 and 2 here are
     worth applying independently).

  2. Change our posture on thread exit. In index-pack, the issue is that
     one worker thread calls exit() via die(), taking down the whole
     program (including other worker threads in unknown states). If it
     installed a die() handler that called pthread_exit() instead, then
     the main thread could see that failure and cancel/join the
     remaining threads.

     But I suspect it's not so simple:

       a. We still want the error in the worker thread to cancel ongoing
	  work immediately. Right now index-pack just called
	  pthread_join() in sequence to wait for all threads to finish.
	  But instead we'd need to use some synchronization primitives
	  to report the error. Probably not too hard, but new
	  potentially tricky code.

       b. In the git-grep case, it's actually the main thread that calls
	  die(). So the rule is not really "threads should install a die
	  handler that calls pthread_exit()", but _any_ die() call when
	  threads are established would need to cleanly ask all threads
	  to stop (whether by signaling them via the work queue, or just
	  hitting them with pthread_cancel; and that's assuming that
	  there's no race between LSan's setup code and cancel).

       c. When we call die(), all bets are off about what state various
          data structures are in. For example, a thread could be holding
	  a lock, and if it's cancelled by another thread, that lock
	  will remain. So it's not safe to do any real work in the other
	  threads, unless we start setting up pthread_cleanup handlers,
	  etc. That seems like a recipe for obscure, racy deadlocks.

     I think that may be an appealing direction in the long run,
     especially since die() itself is not thread-safe. But coupled with
     libification, we probably want less "threads can die() cleanly" and
     more "threads pass errors up the stack and report the problem back
     to the work queue". Either way, though, it's a much bigger approach
     change than I think we want just to try to address LSan races.

So I think we want something like this (either this, or the variant
where we really implement barriers everywhere) in the near-term.

  [1/5]: test-lib: use individual lsan dir for --stress runs
  [2/5]: Revert "index-pack: spawn threads atomically"
  [3/5]: thread-utils: introduce optional barrier type
  [4/5]: index-pack: work around LSan threading race with barrier
  [5/5]: grep: work around LSan threading race with barrier

 Makefile             |  7 +++++++
 builtin/grep.c       |  8 ++++++++
 builtin/index-pack.c |  8 ++++++--
 ci/lib.sh            |  1 +
 t/test-lib.sh        |  2 +-
 thread-utils.h       | 17 +++++++++++++++++
 6 files changed, 40 insertions(+), 3 deletions(-)