Re: [PATCH v3 7/7] run-command: fix detaching when running auto maintenance

Patrick Steinhardt <ps@xxxxxx> · Mon, 19 Aug 2024 11:04:42 +0200

On Mon, Aug 19, 2024 at 04:46:14AM -0400, Jeff King wrote:
> On Mon, Aug 19, 2024 at 08:17:22AM +0200, Patrick Steinhardt wrote:
> 
> > >   2. Having racy background maintenance doesn't seem great for test
> > >      robustness. At the very least, it might subject us to the "rm"
> > >      problems mentioned elsewhere, where we fail to clean up. Annotating
> > >      individual "git gc" or "git maintenance" calls with an extra
> > >      descriptor isn't too bad, but in this case it's all happening under
> > >      the hood via fetch. Is it a potential problem for every script,
> > >      then? If so, should we disable background detaching for all test
> > >      repos, and then let the few that want to test it turn it back on?
> > 
> > Might be a good idea to set `maintenance.autoDetach=false` globally,
> > yes. The only downside is of course that it wouldn't cause us to detect
> > failures like the above, where the concurrency itself causes failure.
> > 
> > Anyway, for now I'll:
> > 
> >   - Send a patch to fix the race in t7900.
> > 
> >   - Investigate the reftable concurrency issue.
> > 
> >   - _Not_ send a patch that sets `maintenance.autoDetach=false`.
> 
> That sounds like a good direction. I do suspect there are at least _two_
> races in t7900:
> 
>   1. the detached maintenance that we run explicitly, which causes the
>      "rm" cleanup to fail
> 
>   2. whatever earlier test is kicking off detached maintenance via "git
>      fetch", which is causing the reftable concurrency issue.
> 
> Fixing (1) should be easy (and it looks like you've already sent a
> series). Fixing the reftable code will stop us from segfaulting for (2),
> but I wonder if that detached maintenance might cause similar "rm" style
> problems elsewhere.

It certainly might. The only reason why I don't want to send that patch
now is that it feels a bit too reactionary. We haven't had issues in the
past with it to the best of my knowledge, even though it is an issue in
theory.

We can certainly revisit that though if we see that it indeed is a more
widespread issue.

> > The last one requires a bit more discussion first, and we have been
> > running with `gc.autoDetach=true` implicitly in the past. Thinking a bit
> > more about it, the reason why the above bug triggers now is that
> > git-gc(1) itself runs git-pack-refs(1), but does that _synchronously_
> > before detaching itself. Now we detach at a higher level in the
> > hierarchy, which means that the previously-synchronous part now runs
> > asynchronously, as well.
> 
> That makes sense. I guess we've perhaps been doing background gc for a
> long time, then, just not in the refs? In practice most repos in the
> test suite aren't big enough to trigger auto-gc anyway, so it may only
> affect a handful of scripts.

Yup. We should expect this to work just fine, because it can trigger
regardless of whether or not we run auto-compaction concurrently or not.
After all, the reftable backend even performs auto-compaction after
every write to the table, so any two concurrent writes may hit the bug
without even invoking git-maintenance(1) at all.

> Once the reftable issue is fixed, it's possible that the lingering
> detached processes don't cause a problem in practice (because they're
> not really writing much, and/or have finished by the time anybody else
> gets to cleanup), and we can just live with them. But I'm worried that
> sounds like wishful thinking. ;)

Could certainly be, yeah. I just want to focus on fixing the immediate
issues before we jump to conclusions too fast.

Patrick