On Mon, Aug 19, 2024 at 08:17:22AM +0200, Patrick Steinhardt wrote: > > 2. Having racy background maintenance doesn't seem great for test > > robustness. At the very least, it might subject us to the "rm" > > problems mentioned elsewhere, where we fail to clean up. Annotating > > individual "git gc" or "git maintenance" calls with an extra > > descriptor isn't too bad, but in this case it's all happening under > > the hood via fetch. Is it a potential problem for every script, > > then? If so, should we disable background detaching for all test > > repos, and then let the few that want to test it turn it back on? > > Might be a good idea to set `maintenance.autoDetach=false` globally, > yes. The only downside is of course that it wouldn't cause us to detect > failures like the above, where the concurrency itself causes failure. > > Anyway, for now I'll: > > - Send a patch to fix the race in t7900. > > - Investigate the reftable concurrency issue. > > - _Not_ send a patch that sets `maintenance.autoDetach=false`. That sounds like a good direction. I do suspect there are at least _two_ races in t7900: 1. the detached maintenance that we run explicitly, which causes the "rm" cleanup to fail 2. whatever earlier test is kicking off detached maintenance via "git fetch", which is causing the reftable concurrency issue. Fixing (1) should be easy (and it looks like you've already sent a series). Fixing the reftable code will stop us from segfaulting for (2), but I wonder if that detached maintenance might cause similar "rm" style problems elsewhere. > The last one requires a bit more discussion first, and we have been > running with `gc.autoDetach=true` implicitly in the past. Thinking a bit > more about it, the reason why the above bug triggers now is that > git-gc(1) itself runs git-pack-refs(1), but does that _synchronously_ > before detaching itself. Now we detach at a higher level in the > hierarchy, which means that the previously-synchronous part now runs > asynchronously, as well. That makes sense. I guess we've perhaps been doing background gc for a long time, then, just not in the refs? In practice most repos in the test suite aren't big enough to trigger auto-gc anyway, so it may only affect a handful of scripts. Once the reftable issue is fixed, it's possible that the lingering detached processes don't cause a problem in practice (because they're not really writing much, and/or have finished by the time anybody else gets to cleanup), and we can just live with them. But I'm worried that sounds like wishful thinking. ;) -Peff