On Sat, Aug 17, 2024 at 08:14:24AM -0400, Jeff King wrote: > On Fri, Aug 16, 2024 at 12:45:17PM +0200, Patrick Steinhardt wrote: > > > Fix this bug by asking git-gc(1) to not detach when it is being invoked > > via git-maintenance(1). Instead, git-maintenance(1) now respects a new > > config "maintenance.autoDetach", the equivalent of "gc.autoDetach", and > > detaches itself into the background when running as part of our auto > > maintenance. This should continue to behave the same for all users which > > use the git-gc(1) task, only. For others though, it means that we now > > properly perform all tasks in the background. The default behaviour of > > git-maintenance(1) when executed by the user does not change, it will > > remain in the foreground unless they pass the `--detach` option. > > This patch seems to cause segfaults in t5616 when combined with the > reftable backend. Try this: > > GIT_TEST_DEFAULT_REF_FORMAT=reftable ./t5616-partial-clone.sh --run=1-16 --stress > > which fails for me within a few runs. Bisecting leads to 98077d06b2 > (run-command: fix detaching when running auto maintenance, 2024-08-16). > It doesn't trigger with the files ref backend. > > Compiling with ASan gets me a stack trace like this: > > + git -c protocol.version=0 -C pc1 fetch --filter=blob:limit=29999 --refetch origin > AddressSanitizer:DEADLYSIGNAL > ================================================================= > ==657994==ERROR: AddressSanitizer: SEGV on unknown address 0x7fa0f0ec6089 (pc 0x55f23e52ddf9 bp 0x7ffe7bfa1700 sp 0x7ffe7bfa1700 T0) > ==657994==The signal is caused by a READ memory access. > #0 0x55f23e52ddf9 in get_var_int reftable/record.c:29 > #1 0x55f23e53295e in reftable_decode_keylen reftable/record.c:170 > #2 0x55f23e532cc0 in reftable_decode_key reftable/record.c:194 > #3 0x55f23e54e72e in block_iter_next reftable/block.c:398 > #4 0x55f23e5573dc in table_iter_next_in_block reftable/reader.c:240 > #5 0x55f23e5573dc in table_iter_next reftable/reader.c:355 > #6 0x55f23e5573dc in table_iter_next reftable/reader.c:339 > #7 0x55f23e551283 in merged_iter_advance_subiter reftable/merged.c:69 > #8 0x55f23e55169e in merged_iter_next_entry reftable/merged.c:123 > #9 0x55f23e55169e in merged_iter_next_void reftable/merged.c:172 > #10 0x55f23e537625 in reftable_iterator_next_ref reftable/generic.c:175 > #11 0x55f23e2cf9c6 in reftable_ref_iterator_advance refs/reftable-backend.c:464 > #12 0x55f23e2d996e in ref_iterator_advance refs/iterator.c:13 > #13 0x55f23e2d996e in do_for_each_ref_iterator refs/iterator.c:452 > #14 0x55f23dca6767 in get_ref_map builtin/fetch.c:623 > #15 0x55f23dca6767 in do_fetch builtin/fetch.c:1659 > #16 0x55f23dca6767 in fetch_one builtin/fetch.c:2133 > #17 0x55f23dca6767 in cmd_fetch builtin/fetch.c:2432 > #18 0x55f23dba7764 in run_builtin git.c:484 > #19 0x55f23dba7764 in handle_builtin git.c:741 > #20 0x55f23dbab61e in run_argv git.c:805 > #21 0x55f23dbab61e in cmd_main git.c:1000 > #22 0x55f23dba4781 in main common-main.c:64 > #23 0x7fa0f063fc89 in __libc_start_call_main ../sysdeps/nptl/libc_start_call_main.h:58 > #24 0x7fa0f063fd44 in __libc_start_main_impl ../csu/libc-start.c:360 > #25 0x55f23dba6ad0 in _start (git+0xadfad0) (BuildId: 803b2b7f59beb03d7849fb8294a8e2145dd4aa27) > > My guess based on what I'm seeing and what the patch does is that now > maintenance from a previous command is running in the background while > our foreground git-fetch runs, and it somehow confuses things (perhaps > by trying to compact reftables or something?). So I think there are two > problems: > > 1. The reftable code needs to be more robust against whatever race is > happening. I didn't dig further, but I'm sure it would be possible > to produce a coredump. Yes, it certainly has to be robust against this. It's also where the recent reftable compaction fixes [1] came from. In theory, it should work alright with a concurrent process compacting the stack at the same time where another process is reading. In practice the backend is still in its infancy, so I'm not entirely surprised that there are concurrency bugs. I will investigate this issue, thanks a lot for the backtrace. [1]: https://lore.kernel.org/git/cover.1722435214.git.ps@xxxxxx/ > 2. Having racy background maintenance doesn't seem great for test > robustness. At the very least, it might subject us to the "rm" > problems mentioned elsewhere, where we fail to clean up. Annotating > individual "git gc" or "git maintenance" calls with an extra > descriptor isn't too bad, but in this case it's all happening under > the hood via fetch. Is it a potential problem for every script, > then? If so, should we disable background detaching for all test > repos, and then let the few that want to test it turn it back on? Might be a good idea to set `maintenance.autoDetach=false` globally, yes. The only downside is of course that it wouldn't cause us to detect failures like the above, where the concurrency itself causes failure. Anyway, for now I'll: - Send a patch to fix the race in t7900. - Investigate the reftable concurrency issue. - _Not_ send a patch that sets `maintenance.autoDetach=false`. The last one requires a bit more discussion first, and we have been running with `gc.autoDetach=true` implicitly in the past. Thinking a bit more about it, the reason why the above bug triggers now is that git-gc(1) itself runs git-pack-refs(1), but does that _synchronously_ before detaching itself. Now we detach at a higher level in the hierarchy, which means that the previously-synchronous part now runs asynchronously, as well. I cannot think of a reason why we shouldn't do this, as the ref backends should handle this gracefully. The fact that the reftable backend doesn't is a separate, preexisting bug. Patrick