Re: [PATCH v3 6/7] builtin/maintenance: add a `--detach` flag

Patrick Steinhardt <ps@xxxxxx> · Mon, 19 Aug 2024 08:17:28 +0200

On Sat, Aug 17, 2024 at 03:09:24AM -0400, Jeff King wrote:
> On Fri, Aug 16, 2024 at 12:45:15PM +0200, Patrick Steinhardt wrote:
> 
> > +test_expect_success '--no-detach causes maintenance to not run in background' '
> > [...]
> > +		# We have no better way to check whether or not the task ran in
> > +		# the background than to verify whether it output anything. The
> > +		# next testcase checks the reverse, making this somewhat safer.
> > +		git maintenance run --no-detach >out 2>&1 &&
> > +		test_line_count = 1 out
> > [...]
> > +test_expect_success '--detach causes maintenance to run in background' '
> > +	test_when_finished "rm -rf repo" &&
> > +	git init repo &&
> > +	(
> > +		cd repo &&
> > +
> > +		test_commit something &&
> > +		git config set maintenance.gc.enabled false &&
> > +		git config set maintenance.loose-objects.enabled true &&
> > +		git config set maintenance.loose-objects.auto 1 &&
> > +		git config set maintenance.incremental-repack.enabled true &&
> > +
> > +		git maintenance run --detach >out 2>&1 &&
> > +		test_must_be_empty out
> > +	)
> > +'
> 
> This second test seems to fail racily (or maybe always? see below). In
> CI on Windows, I saw:
> 
>   'out' is not empty, it contains:
>   fc9fea69579f349e3b02e3264cffbef03e4b1852
> 
> That would make sense to me if the detached process still held the
> original stdout/stderr channel open (in which case we'd racily see the
> same line as in the no-detach case). But we do appear to call
> daemonize(), which closes both.
> 
> Curiously, the code in gc.c does this:
> 
>           /* Failure to daemonize is ok, we'll continue in foreground. */
>           if (opts->detach > 0)
>                   daemonize();
> 
> and the only way for daemonize to fail is if NO_POSIX_GOODIES is set.
> Which I'd expect on Windows. But then I'd expect this test to _always_
> fail on Windows. Does it? If so, should it be marked with !MINGW?
> 
> While investigating that, I ran it with --stress locally (on Linux) and
> got some odd (and definitely racy) results. The test itself passes, but
> the "rm -rf repo" in the test_when_finished sometimes fails with:
> 
>   rm: cannot remove 'repo/.git/objects': Directory not empty
> 
> or similar (sometimes it's another directory like 'repo/.git'). My guess
> is that the background process is still running and creating files in
> the repository, racing with rm's call to rmdir().
> 
> Even if we remove the test_when_finished, it would mean that the final
> cleanup after test_done might similarly fail, leaving a crufty trash
> directory. I think to make this robust, we'd need some way of detecting
> when the background process has finished. I don't think we report the
> pid anywhere, and the daemonize() call means it won't even be in the
> same process group. Maybe we could spin looking for the incremental pack
> it will create (and timeout after N seconds)? That feels pretty hacky,
> but I can't think of anything better.

Oh, good catch indeed, I can reproduce this on Linux with `--stress`.

I think we can use the same workaround as we do in t6500, namely open a
new file descriptor, inherit it and then wait for the child process to
close it.

Thanks!

Patrick