fsmonitor deadlock / macOS CI hangs

Jeff King <peff@xxxxxxxx> · Mon, 30 Sep 2024 15:46:00 -0400

I did some more digging on the hangs we sometimes see when running the
test suite on macOS. I'm cc-ing Patrick as somebody who dug into this
before, and Johannes as the only still-active person mentioned in the
relevant code.

For those just joining, you can reproduce the issue by running t9211
with --stress on macOS. Some earlier notes are here:

  https://lore.kernel.org/git/20240517081132.GA1517321@xxxxxxxxxxxxxxxxxxxxxxx/

but the gist of it is that we end up with Git processes waiting to read
from fsmonitor, but fsmonitor hanging.

Since then, I swapped out gdb for lldb, which let me see a bit more of
who is doing what.

The test is waiting on a "scalar" process to exit. But that's just in
wait_or_whine() waiting for git-fetch to exit. That git-fetch is in turn
waiting to read() from fsmonitor after sending a command. Here's the
backtrace:

    * frame #0: 0x00007ff81afdbf7e libsystem_kernel.dylib`read + 10
      frame #1: 0x000000010a4fa6ae git`xread(fd=8, buf=0x00007ff7b5d82188, len=4) at wrapper.c:231:8
      frame #2: 0x000000010a4fa8de git`read_in_full(fd=8, buf=0x00007ff7b5d82188, count=4) at wrapper.c:289:20
      frame #3: 0x000000010a417684 git`get_packet_data(fd=8, src_buf=0x0000000000000000, src_size=0x0000000000000000, dst=0x00007ff7b5d82188, size=4, options=9) at pkt-line.c:355:9
      frame #4: 0x000000010a417173 git`packet_read_with_status(fd=8, src_buffer=0x0000000000000000, src_len=0x0000000000000000, buffer="", size=65517, pktlen=0x00007ff7b5d821e4, options=9) at pkt-line.c:421:6
      frame #5: 0x000000010a4178bb git`packet_read(fd=8, buffer="", size=65517, options=9) at pkt-line.c:519:2
      frame #6: 0x000000010a417a07 git`read_packetized_to_strbuf(fd_in=8, sb_out=0x00007ff7b5d82380, options=9) at pkt-line.c:554:16
      frame #7: 0x000000010a508779 git`ipc_client_send_command_to_connection(connection=0x0000600001718000, message="1727469801286015000", message_len=19, answer=0x00007ff7b5d82380) at ipc-unix-socket.c:210:6
      frame #8: 0x000000010a377553 git`fsmonitor_ipc__send_query(since_token="1727469801286015000", answer=0x00007ff7b5d82380) at fsmonitor-ipc.c:94:9
      frame #9: 0x000000010a376149 git`refresh_fsmonitor(istate=0x0000600002910300) at fsmonitor.c:536:20
      frame #10: 0x000000010a376be7 git`add_fsmonitor(istate=0x0000600002910300) at fsmonitor.c:778:3
      frame #11: 0x000000010a376e19 git`tweak_fsmonitor(istate=0x0000600002910300) at fsmonitor.c:818:3
      frame #12: 0x000000010a430840 git`post_read_index_from(istate=0x0000600002910300) at read-cache.c:1968:2

In fsmonitor we have a bunch of threads. The main thread is waiting on
worker threads, which I think is expected:

  * thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP
      frame #0: 0x00007ff81afdccce libsystem_kernel.dylib`__ulock_wait + 10
      frame #1: 0x00007ff81b01b9b9 libsystem_pthread.dylib`_pthread_join + 348
    * frame #2: 0x000000010e5ec2b1 git`ipc_server_await(server_data=0x0000600002790000) at ipc-unix-socket.c:971:2
      frame #3: 0x000000010e2d3a3f git`fsmonitor_run_daemon_1(state=0x00007ff7b1ca0420) at fsmonitor--daemon.c:1251:2
      frame #4: 0x000000010e2d35c5 git`fsmonitor_run_daemon at fsmonitor--daemon.c:1399:8

Most of the other threads are just waiting for clients:

  * thread #2
    * frame #0: 0x00007ff81afe2242 libsystem_kernel.dylib`poll + 10
      frame #1: 0x000000010e5ec664 git`accept_thread__wait_for_connection(accept_thread_data=0x000060000129c000) at ipc-unix-socket.c:653:12
      frame #2: 0x000000010e5ebf2a git`accept_thread_proc(_accept_thread_data=0x000060000129c000) at ipc-unix-socket.c:728:19
      frame #3: 0x00007ff81b01a1d3 libsystem_pthread.dylib`_pthread_start + 125
      frame #4: 0x00007ff81b015bd3 libsystem_pthread.dylib`thread_start + 15

which I think is again expected. But not this one, which should be
serving our command:

  * thread #3
    * frame #0: 0x00007ff81afde08e libsystem_kernel.dylib`__psynch_cvwait + 10
      frame #1: 0x00007ff81b01a758 libsystem_pthread.dylib`_pthread_cond_wait + 1242
      frame #2: 0x000000010e2d4a90 git`with_lock__wait_for_cookie(state=0x00007ff7b1ca0420) at fsmonitor--daemon.c:209:3
      frame #3: 0x000000010e2d40f2 git`do_handle_client(state=0x00007ff7b1ca0420, command="1727469801286015000", reply=(git`do_io_reply_callback at ipc-unix-socket.c:431), reply_data=0x0000700007d3af38) at fsmonitor--daemon.c:764:19
      frame #4: 0x000000010e2d3c60 git`handle_client(data=0x00007ff7b1ca0420, command="1727469801286015000", command_len=19, reply=(git`do_io_reply_callback at ipc-unix-socket.c:431), reply_data=0x0000700007d3af38) at fsmonitor--daemon.c:984:11
      frame #5: 0x000000010e5ecaa7 git`worker_thread__do_io(worker_thread_data=0x0000600001c9c000, fd=23) at ipc-unix-socket.c:531:9
      frame #6: 0x000000010e5ec056 git`worker_thread_proc(_worker_thread_data=0x0000600001c9c000) at ipc-unix-socket.c:605:9
      frame #7: 0x00007ff81b01a1d3 libsystem_pthread.dylib`_pthread_start + 125

We got that same command from git-fetch, but we're hanging on a pthread
operation. It's this one, and note the suspicious comment:

          /*
           * Technically, this is an infinite wait (well, unless another
           * thread sends us an abort).  I'd like to change this to
           * use `pthread_cond_timedwait()` and return an error/timeout
           * and let the caller do the trivial response thing, but we
           * don't have that routine in our thread-utils.
           *
           * After extensive beta testing I'm not really worried about
           * this.  Also note that the above open() and unlink() calls
           * will cause at least two FS events on that path, so the odds
           * of getting stuck are pretty slim.
           */
          while (cookie->result == FCIR_INIT)
                  pthread_cond_wait(&state->cookies_cond,
                                    &state->main_lock);

So we're waiting for somebody to trigger the cookies_cond. Who's
supposed to do that? It's done in fsmonitor_publish(), which is
triggered when we actually see filesystem events. The main listening
loop is also waiting on a pthread cond:

  * thread #11
    * frame #0: 0x00007ff81afde08e libsystem_kernel.dylib`__psynch_cvwait + 10
      frame #1: 0x00007ff81b01a758 libsystem_pthread.dylib`_pthread_cond_wait + 1242
      frame #2: 0x000000010e5fab6c git`fsm_listen__loop(state=0x00007ff7b1ca0420) at fsm-listen-darwin.c:520:2
      frame #3: 0x000000010e2d3d76 git`fsm_listen__thread_proc(_state=0x00007ff7b1ca0420) at fsmonitor--daemon.c:1176:2
      frame #4: 0x00007ff81b01a1d3 libsystem_pthread.dylib`_pthread_start + 125
      frame #5: 0x00007ff81b015bd3 libsystem_pthread.dylib`thread_start + 15

which is in this code:

        FSEventStreamSetDispatchQueue(data->stream, data->dq);
        data->stream_scheduled = 1;

        if (!FSEventStreamStart(data->stream)) {
                error(_("Failed to start the FSEventStream"));
                goto force_error_stop_without_loop;
        }
        data->stream_started = 1;

        pthread_mutex_lock(&data->dq_lock);
        pthread_cond_wait(&data->dq_finished, &data->dq_lock);
	pthread_mutex_unlock(&data->dq_lock);

So if I understand fsmonitor correctly, this is expected: we're waiting
for anybody to single dq_finished to tell us it's time to shut down. And
the real work is happening via that FSEventStreamStart(), which will
asynchronously trigger our callback when something happens in the
working tree.

In which case I don't see any particular deadlock. It's just that
somehow establishing our cookie-wait in the thread servicing the client
raced with there being an actual event to report.

I _think_ I un-stuck things at one point by just touching files in the
working tree, but I'm not 100% sure (I was doing a lot of things, and
it's hard to see when things got unstuck). But that would give evidence
to that line of thinking.

I notice there's a similar hang and fix here:

  https://lore.kernel.org/git/pull.1802.git.1727577690390.gitgitgadget@xxxxxxxxx/

but I don't think it's quite the same thing. In our repo there are no
submodules (in fact there are no working tree files at all!).

So I'm not sure where to go from here. It seems like a race, but if so
it happened before I attached the debugger. And I'm not at all familiar
with this code, let alone its timing assumptions. I tried turning on
trace2 logging and re-running (since there are some trace2 event calls
in the code that might help us see it more in action), but doing so
seems to make the race go away. :(

The hanging wait comes from b05880d357 (fsmonitor--daemon: use a cookie
file to sync with file system, 2022-03-25). My understanding is that the
author has retired, but Johannes is listed as a co-author. So any
insight is appreciated. :)

-Peff