Re: fsmonitor deadlock / macOS CI hangs

Jeff King <peff@xxxxxxxx> · Wed, 2 Oct 2024 02:25:39 -0400

On Wed, Oct 02, 2024 at 10:46:03AM +0900, Koji Nakamaru wrote:

> Perhaps I found the cause. fsmonitor_run_daemon_1() starts the fsevent
> Normally FSEventStreamStart() is called before
> with_lock__wait_for_cookie() creates a cookie file, but this is not
> guaranteed. We can reproduce the issue easily if we modify
> fsm_listen__loop() as below:
> 
>   --- a/compat/fsmonitor/fsm-listen-darwin.c
>   +++ b/compat/fsmonitor/fsm-listen-darwin.c
>   @@ -510,6 +510,7 @@ void fsm_listen__loop(struct
> fsmonitor_daemon_state *state)
>           FSEventStreamSetDispatchQueue(data->stream, data->dq);
>           data->stream_scheduled = 1;
> 
>   +       sleep(1);
>           if (!FSEventStreamStart(data->stream)) {
>                   error(_("Failed to start the FSEventStream"));
>                   goto force_error_stop_without_loop;

Ah, good catch. Yes, with that sleep it is much easier to create the
situation in the debugger.

So if I understand (and bear with me if you already know this, but I am
just figuring out how fsmonitor is supposed to work), the expected
order of events is:

  - the thread serving the client creates a cookie file (something like
    ".git/fsmonitor--daemon/cookies/1234-0", based on the pid) and then
    waits for the pthread_cond to be triggered

  - the thread listening for fs events should see that cookie file
    creation, which gets us to with_lock__mark_cookies_seen(), which
    then triggers the cond

But with the sleep, the file creation event happens before the fs event
is actually listening, so we never hear it. We can "unstick" it by
doing:

  # where "1234-0" is the unique cookie file, which you can get by
  # looking at the output from GIT_TRACE_FSMONITOR=1
  echo foo >trash directory.t9211-scalar-clone/cloned/src/.git/fsmonitor--daemon/cookies/1234-0

which then triggers us to mark the cookies seen. I have no clue why
there is this elaborate communication for two threads which are already
sharing a memory space, but let's assume for a moment that it's
required.

The solution then is to avoid letting any client threads run (or at
least create cookies) until we know we're listening for fs events.
Naively I would say we should just have the fs event thread signal us
that it's ready before we spawn any of the client threads. But it looks
like we explicitly start those threads first in fsmonitor_run_daemon_1():

       /*
         * Start the IPC thread pool before the we've started the file
         * system event listener thread so that we have the IPC handle
         * before we need it.
         */
        if (ipc_server_run_async(&state->ipc_server_data,
                                 state->path_ipc.buf, &ipc_opts,
                                 handle_client, state))
                return error_errno(
                        _("could not start IPC thread pool on '%s'"),
                        state->path_ipc.buf);

So I guess we need to let the ipc server initialize itself and then wait
to be told to start serving requests. And then the fs event thread can
trigger that "go ahead" once the fs event stream is started. Which is a
little tricky, but possible with an extra mutex.

But hmm. I wonder if we can hack around it like this:

diff --git a/compat/fsmonitor/fsm-listen-darwin.c b/compat/fsmonitor/fsm-listen-darwin.c
index 2fc6744..6b06faf 100644
--- a/compat/fsmonitor/fsm-listen-darwin.c
+++ b/compat/fsmonitor/fsm-listen-darwin.c
@@ -510,11 +510,17 @@ void fsm_listen__loop(struct fsmonitor_daemon_state *state)
	FSEventStreamSetDispatchQueue(data->stream, data->dq);
	data->stream_scheduled = 1;

+	sleep(1);
	if (!FSEventStreamStart(data->stream)) {
		error(_("Failed to start the FSEventStream"));
		goto force_error_stop_without_loop;
	}
	data->stream_started = 1;
+	/*
+	 * In case any clients showed up before we initialized the event stream,
+	 * abort all in-progress ops and start fresh.
+	 */
+	fsmonitor_force_resync(state);

	pthread_mutex_lock(&data->dq_lock);
	pthread_cond_wait(&data->dq_finished, &data->dq_lock);


The forced resync aborts the cookies, which triggers the waiting
clients. I'm not sure how "bad" it is to do a resync like this. It's
something that can happen in the normal course of events, so in theory
it's just non-optimal and nobody will do the wrong thing. And this would
only happen once at the start of fsmonitor, so in most cases nobody
would have connected yet, and we wouldn't have processed any events.

So I dunno. It does make the hang go away in this case.

-Peff