On Wed, Mar 13 2019, Junio C Hamano wrote: > Ævar Arnfjörð Bjarmason <avarab@xxxxxxxxx> writes: > >> I'm still working on fixing a race condition I encountered in "gc" >> recently, but am not 100% sure of the fix. So I thought I'd send a >> braindump of what I have so far in case it jolts any memories. >> >> The problem is that when we "gc" we run "reflog expire --all". This >> iterates over the reflogs, and takes a *.lock for each reference. >> >> It'll fail intermittendly in two ways: >> >> 1. If something is concurrently committing to the repo it'll fail >> because we for a tiny amount of time hold a HEAD.lock file, so HEAD >> can't be updated. >> >> 2. On a repository that's just being "git fetch"'d by some concurrent >> process the "gc" will fail, because the ref's SHA1 has changed, >> which we inspect as we aquire the lock. > > Both sounds very much expected and expectable outcome. I am not > sure how they need to be called bugs. Let's leave aside that I started the subject with "BUG:" and let me rephrase. I was under the impression that git-gc was supposed to support operating on a repository that's concurrently being modified, as long as you don't set the likes of gc.pruneExpire too aggressively. Running a "gc" in a loop without "git reflog expire --all" and when watching the repository being GC'd with: fswatch -l 0.1 -t -r . 2>&1 | grep lock We only create .git/MERGE_RR.lock, .git/gc.pid.lock and git/packed-refs.lock. These are all things that would only cause another concurrent GC to fail, not a normal git command. So the only reason a concurrent commit (case #1) fails is because of the refs being locked during the reflog iteration, and similarly "gc" itself will fail due to a concurrently updating ref (case #2). It seems that first of all we need this, I'll submit that as a separate patch sometime soon: diff --git a/builtin/gc.c b/builtin/gc.c index 020f725acc..ae488646e1 100644 --- a/builtin/gc.c +++ b/builtin/gc.c @@ -127,6 +127,12 @@ static void gc_config(void) pack_refs = git_config_bool("gc.packrefs", value); } + if (!git_config_get_value("gc.reflogexpire", &value) && value && + !strcmp(value, "never") && + !git_config_get_value("gc.reflogexpireunreachable", &value) && value && + !strcmp(value, "never")) + prune_reflogs = 0; + git_config_get_int("gc.aggressivewindow", &aggressive_window); git_config_get_int("gc.aggressivedepth", &aggressive_depth); git_config_get_int("gc.auto", &gc_auto_threshold); I.e. now even if your gc.* config says you don't want the reflogs touched, we still pointlessly iterate over all of them. The case I'm running into (a variant of #2) is one solved by that patch, i.e. I'm fine "gc" just having the reflogs kept forever as a workaround in this case. Something like that should have been added back in 62aad1849f ("gc --auto: do not lock refs in the background", 2014-05-25), i.e. now the "prune_reflogs" variable is never used, it's just cargo-culted from a copy/pasting of the "pack_refs" code. In other "gc" phases in "pack-objects" and "prune" we also look at the reflogs. This obviously bad patch ignores them entirely: diff --git a/builtin/prune.c b/builtin/prune.c index 97613eccb5..bccee7813e 100644 --- a/builtin/prune.c +++ b/builtin/prune.c @@ -41,7 +41,7 @@ static void perform_reachability_traversal(struct rev_info *revs) if (show_progress) progress = start_delayed_progress(_("Checking connectivity"), 0); - mark_reachable_objects(revs, 1, expire, progress); + mark_reachable_objects(revs, 0, expire, progress); stop_progress(&progress); initialized = 1; } diff --git a/builtin/repack.c b/builtin/repack.c index 67f8978043..618ffbfe0a 100644 --- a/builtin/repack.c +++ b/builtin/repack.c @@ -364,7 +364,6 @@ int cmd_repack(int argc, const char **argv, const char *prefix) keep_pack_list.items[i].string); argv_array_push(&cmd.args, "--non-empty"); argv_array_push(&cmd.args, "--all"); - argv_array_push(&cmd.args, "--reflog"); argv_array_push(&cmd.args, "--indexed-objects"); if (repository_format_partial_clone) argv_array_push(&cmd.args, "--exclude-promisor-objects"); I'm just including that as illustration that add_reflogs_to_pending() in revision.c during "gc" already iterates over the reflogs without locking anything, but of course it's just reading them. So one thing that would mitigate things a lot is if files_reflog_expire() and its call to expire_reflog_ent() via refs_for_each_reflog_ent() would lazily aquire the lock on the ref. Digging a bit further that's actually what we're doing now since 4ff0f01cb7 ("refs: retry acquiring reference locks for 100ms", 2017-08-21). But this runs into the logic we've had for a long time, or since your bda3a31cc7 ("reflog-expire: Avoid creating new files in a directory inside readdir(3) loop", 2008-01-25) where we first loop over all the refs in the process of finding the reflogs, and then will try to lock those refs at those expected SHA-1s. If they've changed in the meantime we error out don't clean up the lockfile. So just this fixes that: diff --git a/refs/files-backend.c b/refs/files-backend.c index ef053f716c..b6576f28b9 100644 --- a/refs/files-backend.c +++ b/refs/files-backend.c @@ -3037,7 +3037,7 @@ static int files_reflog_expire(struct ref_store *ref_store, * reference itself, plus we might need to update the * reference if --updateref was specified: */ - lock = lock_ref_oid_basic(refs, refname, oid, + lock = lock_ref_oid_basic(refs, refname, NULL, NULL, NULL, REF_NO_DEREF, &type, &err); if (!lock) { Which seems sensible to me. We'll still get the lock, we just don't assert that the refname we use to get the lock must be at that SHA-1. We'll still use it for the purposes of expiry. But maybe I've missed some caveat in reflog_expiry_prepare() and friends and we really do need the reflog at that OID, then this would suck less: diff --git a/builtin/reflog.c b/builtin/reflog.c index 4d3430900d..4bb272fdc8 100644 --- a/builtin/reflog.c +++ b/builtin/reflog.c @@ -625,12 +625,16 @@ static int cmd_reflog_expire(int argc, const char **argv, const char *prefix) free_worktrees(worktrees); for (i = 0; i < collected.nr; i++) { struct collected_reflog *e = collected.e[i]; + int st; set_reflog_expiry_param(&cb.cmd, explicit_expiry, e->reflog); - status |= reflog_expire(e->reflog, &e->oid, flags, - reflog_expiry_prepare, - should_expire_reflog_ent, - reflog_expiry_cleanup, - &cb); + st = reflog_expire(e->reflog, &e->oid, flags, + reflog_expiry_prepare, + should_expire_reflog_ent, + reflog_expiry_cleanup, + &cb); + if (st == -2) + continue; + status |= st; free(e); } free(collected.e); diff --git a/refs/files-backend.c b/refs/files-backend.c index ef053f716c..8b0b6b7b85 100644 --- a/refs/files-backend.c +++ b/refs/files-backend.c @@ -3041,6 +3041,11 @@ static int files_reflog_expire(struct ref_store *ref_store, NULL, NULL, REF_NO_DEREF, &type, &err); if (!lock) { + if (errno == EBUSY) { + warning("cannot lock ref '%s': %s. Skipping!", refname, err.buf); + strbuf_release(&err); + return -2; + } error("cannot lock ref '%s': %s", refname, err.buf); strbuf_release(&err); return -1; I.e. we just detect the EBUSY that verify_lock() sets if the OID doesn't match, and don't prune that reflog. As seen above "pack-objects" and "prune" will still iterate over the same logs later for the purposes of reachability, so this shouldn't get us into a corrupt state due to throwing away objects referenced in those logs, we'll just prune fewer things than we could have. So I think I'll use the first patch noted above as a hack to solve the narrow problem I have now, but any comments on the above most welcome. I'm not very familiar with the ref code in case that wasn't obvious already. B.t.w. the mention of f3b661f766 ("expire_reflog(): use a lock_file for rewriting the reflog file", 2014-12-12) upthread is irrelevant. That's a commit where we use the lockfile code to write out the *new* reflog, which is unrelated to all of this.