Re: [PATCH 19/23] fsmonitor--daemon: use a cookie file to sync with file system

Derrick Stolee <stolee@xxxxxxxxx> · Tue, 27 Apr 2021 10:23:54 -0400

On 4/1/2021 11:41 AM, Jeff Hostetler via GitGitGadget wrote:
> From: Jeff Hostetler <jeffhost@xxxxxxxxxxxxx>
> 
> Teach fsmonitor--daemon client threads to create a cookie file
> inside the .git directory and then wait until FS events for the
> cookie are observed by the FS listener thread.
> 
> This helps address the racy nature of file system events by
> blocking the client response until the kernel has drained any
> event backlog.

This description matches my expectation of the cookie file,
which furthers my confusion about GIT_TEST_FSMONITOR_CLIENT_DELAY.

> +enum fsmonitor_cookie_item_result {
> +	FCIR_ERROR = -1, /* could not create cookie file ? */
> +	FCIR_INIT = 0,
> +	FCIR_SEEN,
> +	FCIR_ABORT,
> +};
> +
> +struct fsmonitor_cookie_item {
> +	struct hashmap_entry entry;
> +	const char *name;
> +	enum fsmonitor_cookie_item_result result;
> +};
> +
> +static int cookies_cmp(const void *data, const struct hashmap_entry *he1,
> +		     const struct hashmap_entry *he2, const void *keydata)

I'm interested to see why a hashset is necessary.

> +static enum fsmonitor_cookie_item_result fsmonitor_wait_for_cookie(
> +	struct fsmonitor_daemon_state *state)
> +{
> +	int fd;
> +	struct fsmonitor_cookie_item cookie;
> +	struct strbuf cookie_pathname = STRBUF_INIT;
> +	struct strbuf cookie_filename = STRBUF_INIT;
> +	const char *slash;
> +	int my_cookie_seq;
> +
> +	pthread_mutex_lock(&state->main_lock);

Hm. We are entering a locked region. I hope this is only for the
cookie write and not the entire waiting period.

> +	my_cookie_seq = state->cookie_seq++;
> +
> +	strbuf_addbuf(&cookie_pathname, &state->path_cookie_prefix);
> +	strbuf_addf(&cookie_pathname, "%i-%i", getpid(), my_cookie_seq);
> +
> +	slash = find_last_dir_sep(cookie_pathname.buf);
> +	if (slash)
> +		strbuf_addstr(&cookie_filename, slash + 1);
> +	else
> +		strbuf_addbuf(&cookie_filename, &cookie_pathname);

This business about the slash-or-not-slash is good defensive
programming. I imagine the only possible way for there to not
be a slash is if the Git process is running with the .git
directory as its working directory?

> +	cookie.name = strbuf_detach(&cookie_filename, NULL);
> +	cookie.result = FCIR_INIT;
> +	// TODO should we have case-insenstive hash (and in cookie_cmp()) ??

This TODO comment should be cleaned up. Doesn't match C-style, either.

As for the question, I believe that we can limit ourselves to names that
don't need case-insensitive hashes and trust that the filesystem will not
change the case. Using lowercase letters should help with this.

> +	hashmap_entry_init(&cookie.entry, strhash(cookie.name));
> +
> +	/*
> +	 * Warning: we are putting the address of a stack variable into a
> +	 * global hashmap.  This feels dodgy.  We must ensure that we remove
> +	 * it before this thread and stack frame returns.
> +	 */
> +	hashmap_add(&state->cookies, &cookie.entry);

I saw this warning and thought about avoiding it by using the heap, but
even with a heap pointer we need to be careful to remove the result
before returning and stopping the thread.

However, there is likely a higher potential of a bug leading to a
security issue through an error causing stack corruption and unsafe
code execution. Perhaps it is worth converting to using heap data here.

> +	trace_printf_key(&trace_fsmonitor, "cookie-wait: '%s' '%s'",
> +			 cookie.name, cookie_pathname.buf);
> +
> +	/*
> +	 * Create the cookie file on disk and then wait for a notification
> +	 * that the listener thread has seen it.
> +	 */
> +	fd = open(cookie_pathname.buf, O_WRONLY | O_CREAT | O_EXCL, 0600);
> +	if (fd >= 0) {
> +		close(fd);
> +		unlink_or_warn(cookie_pathname.buf);

Interesting that we are ignoring the warning here. Is it possible that
these cookie files will continue to grow if this unlink fails?

> +
> +		while (cookie.result == FCIR_INIT)
> +			pthread_cond_wait(&state->cookies_cond,
> +					  &state->main_lock);

Ok, we are waiting here for another thread to signal that the cookie
file has been found in the events. What happens if the event gets lost?
I'll look for a later signal that cookie.result can change based on a
timeout, too.

> +
> +		hashmap_remove(&state->cookies, &cookie.entry, NULL);
> +	} else {
> +		error_errno(_("could not create fsmonitor cookie '%s'"),
> +			    cookie.name);
> +
> +		cookie.result = FCIR_ERROR;
> +		hashmap_remove(&state->cookies, &cookie.entry, NULL);
> +	}

Both blocks here remove the cookie entry, so move it to the end of the
method with the other cleanups.

> +
> +	pthread_mutex_unlock(&state->main_lock);

Hm. We are locking the main state throughout this process. I suppose that
the listener thread could be watching multiple repos and updating them
while we wait here for one repo to update. This is a larger lock window
than I was hoping for, but I don't currently see how to reduce it safely.

> +
> +	free((char*)cookie.name);
> +	strbuf_release(&cookie_pathname);
> +	return cookie.result;

Remove the cookie from the hashset along with these lines.

> +}
> +
> +/*
> + * Mark these cookies as _SEEN and wake up the corresponding client threads.
> + */
> +static void fsmonitor_cookie_mark_seen(struct fsmonitor_daemon_state *state,
> +				       const struct string_list *cookie_names)
> +{
> +	/* assert state->main_lock */

I'm now confused what this is trying to document. The 'state' should be
locked by another thread while we are waiting for a cookie response, so
this method is updating the cookie as seen from a different thread that
doesn't have the lock.

...
> +/*
> + * Set _ABORT on all pending cookies and wake up all client threads.
> + */
> +static void fsmonitor_cookie_abort_all(struct fsmonitor_daemon_state *state)
...

> + * [2] Some of those lost events may have been for cookie files.  We
> + *     should assume the worst and abort them rather letting them starve.
> + *
>   * If there are no readers of the the current token data series, we
>   * can free it now.  Otherwise, let the last reader free it.  Either
>   * way, the old token data series is no longer associated with our
> @@ -454,6 +600,8 @@ void fsmonitor_force_resync(struct fsmonitor_daemon_state *state)
>  			 state->current_token_data->token_id.buf,
>  			 new_one->token_id.buf);
>  
> +	fsmonitor_cookie_abort_all(state);
> +

I see we abort here if we force a resync. I lost the detail of whether
this is triggered by a timeout, too.

> @@ -654,6 +803,39 @@ static int do_handle_client(struct fsmonitor_daemon_state *state,
>  		goto send_trivial_response;
>  	}
>  
> +	pthread_mutex_unlock(&state->main_lock);
> +
> +	/*
> +	 * Write a cookie file inside the directory being watched in an
> +	 * effort to flush out existing filesystem events that we actually
> +	 * care about.  Suspend this client thread until we see the filesystem
> +	 * events for this cookie file.
> +	 */
> +	cookie_result = fsmonitor_wait_for_cookie(state);

Odd that we unlock before calling this method, then just take the lock
again inside of it.

> +	if (cookie_result != FCIR_SEEN) {
> +		error(_("fsmonitor: cookie_result '%d' != SEEN"),
> +		      cookie_result);
> +		result = 0;
> +		goto send_trivial_response;
> +	}
> +
> +	pthread_mutex_lock(&state->main_lock);
> +
> +	if (strcmp(requested_token_id.buf,
> +		   state->current_token_data->token_id.buf)) {
> +		/*
> +		 * Ack! The listener thread lost sync with the filesystem
> +		 * and created a new token while we were waiting for the
> +		 * cookie file to be created!  Just give up.
> +		 */
> +		pthread_mutex_unlock(&state->main_lock);
> +
> +		trace_printf_key(&trace_fsmonitor,
> +				 "lost filesystem sync");
> +		result = 0;
> +		goto send_trivial_response;
> +	}
> +
>  	/*
>  	 * We're going to hold onto a pointer to the current
>  	 * token-data while we walk the list of batches of files.
> @@ -982,6 +1164,9 @@ void fsmonitor_publish(struct fsmonitor_daemon_state *state,
>  		}
>  	}
>  
> +	if (cookie_names->nr)
> +		fsmonitor_cookie_mark_seen(state, cookie_names);
> +

I was confused as to what updates 'cookie_names', but it appears that
these are updated in the platform-specific code. That seems to happen
in later patches.

Thanks,
-Stolee