Continue the work started in c45af94dbc ("gc: run pre-detach operations under lock", 2017-07-11). Now we'll take the lock before we print out "Auto packing the repository[...]", and we'll optimistically tolerate a locking failure at that point instead of dying. This (mostly) solves two issues: 1) Fetching in parallel from multiple remotes (or other parallel/concurrent "gc --auto") would, when the need_to_gc() heuristic fired, emit a bunch of "Auto packing the repository in background" messages. See [1] for a description of a "pfetch" alias that can trigger such an issue. Now we'll (usually) only print one of these messages. 2) Because we'd call hold_lock_file_for_update() with "LOCK_DIE_ON_ERROR" such concurrent "gc" would error out with "Another git process seems to be running[...]". Now we'll (usually) avoid that by saying that "gc --auto" locking isn't so important. We tolerate a lock failure on "gc --auto" with the assumption that a "gc --auto" runs frequently enough that any one such invocation can fail, whereas without "--auto" we'll error on failure to acquire the lock Why "mostly" and "usually"? There's still unaddressed caveats with these two, respectively: 1) The gc.pid lock is only held while the "git gc" runs. Thus a "pfetch" operation as described in [1] might start N jobs, one of which runs "gc", finishes, and then "gc --auto" finds it needs to run again in the context of a single (from the user's perspective) git command. I don't think that's worth dealing with. The cases where we'll find we need to take action "gc --auto" right after it's finished are rare (see e.g. [2]). 2) There's still the unresolved race condition noted in c45af94dbc where we "hand off" the lock to the child process we fork under gc.autoDetach=true. It means we might start another "gc --auto" if we're so unlucky as to make the check and get the lock in the time the "real" earlier "gc --auto" parent/child process takes to "unlock() && fork() && lock()". Fixing that is outside the scope of this change. It's fixable by refactoring the lockfile.c code and daemonize() so that we'd e.g.: 1. parent: lock() in parent, write PID to gc.pid 2. parent: fork() the child process 3. child: spin for X amount of time waiting until gc.pid lists our PID, not our parent's 4. parent: amend the gc.pid lock to s/PARENT_PID/CHILD_PID/ (in-place rename(gc.pid.new, gc.pid)), fsync(gc.pid) and exit() 5. child: notices updated gc.pid, proceeds with its gc. Does delete_tempfile(gc.pid) before exit() 6. parent: Once lock is handed over (child trusted to poll the lock) exit() without delete_tempfile(gc.pid) Using some cross-process lock handover like that we could guarantee that there's never a point at which an outside process could get the gc.pid lock, that the PID written in the file always referred to a live process, and with #3 reasonably guarantee (aside from pathological scheduling shenanigans) that we don't blindly hand the PID off to a process that's already finished with its gc. Testing for this new behavior is hard. 1. https://public-inbox.org/git/20170712200054.mxcabiyttijpbkbb@xxxxxxxxxxxxxxxxxxxxx/ (should be https://public-inbox.org/git/87bmopzbqx.fsf@xxxxxxxxx/ but vger seems to not have gotten a copy) 2. https://public-inbox.org/git/87fu6bmr0j.fsf@xxxxxxxxxxxxxxxxxxx/ Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@xxxxxxxxx> --- This patch is part of a WIP branch I have that's a bit of a mess. It's more-gc-detach-under-lock on github.com/avar/git.git. It doesn't apply on master because it relies on some previous test work, but for RFC purposes I figured it was better to send it stand-alone. But I think this sort of approach is better than Duy's proposed patch, because... On Wed, Jun 19 2019, Nguyễn Thái Ngọc Duy wrote: > So let's try to avoid that. We should only need one 'gc' run after all > objects and references are added anyway. Add a new option --no-auto-gc > that will be used by those n-1 processes. 'gc --auto' will always run on > the main fetch process (*). > > (*) even if we fetch remotes in parallel at some point in future, this > should still be fine because we should "join" all those processes > before this step. This is what I'm trying to fix in my version of this patch. This is only true for yours if you assume that the user is going to be invoking "fetch" in a single terminal window, IOW that we have an implicit global mutex of one top-level git command at a time. Wheras mine fixes e.g. the same issue for: parallel 'git fetch {}' ::: $(git remote) Ditto for you running a "git" command and your editor running a "fetch" at the same time. A similar "one terminal" assumption was made when changing the auto-detach behavior in your 62aad1849f ("gc --auto: do not lock refs in the background", 2014-05-25). To be clear, I think that (and your patch here) would (mostly) be an improvement. I just wonder if we can do better. I got stuck with this WIP series of mine because while it's mostly ready sans the 'Why "mostly" and "usually"' caveats mentioned in my commit message, I wondered if we couldn't do much better by: a) Do some version of reverting your 62aad1849f, this would entirely get rid of this need for handing off a lock to a child (noted above), but as-is would have e.g. "commit" run into a *.lock. Is that mitigated by 4ff0f01cb7 ("refs: retry acquiring reference locks for 100ms", 2017-08-21)? Or could we have "gc" run in some "lock-less" mode where it does all the work of expiring a ref/reflog at a given OID, and if it changed just tries again without needing to *.lock it except for the brief period of changing the already prepared file? b) If we couldn't do "a" for whatever reason implement some "locking priority" where a not-gc command could create a file (which "gc" would poll) saying "I want it for real work, release it!", or alternatively "gc" would create "*.locked-by-gc" in addition to "*.lock" files, and if the "*.locked-by-gc" was seen when the 100ms retry would kick in, we'd make that say 10 seconds instead of 100 ms. builtin/gc.c | 19 ++++++++++++------- lockfile.c | 2 ++ lockfile.h | 8 +++++++- t/t6500-gc.sh | 8 ++++---- 4 files changed, 25 insertions(+), 12 deletions(-) diff --git a/builtin/gc.c b/builtin/gc.c index d12316fa48..2cd5803e86 100644 --- a/builtin/gc.c +++ b/builtin/gc.c @@ -391,7 +391,7 @@ static int need_to_gc(void) } /* return NULL on success, else hostname running the gc */ -static const char *lock_repo_for_gc(int force, pid_t* ret_pid) +static const char *lock_repo_for_gc(int force, pid_t* ret_pid, int die_on_error) { struct lock_file lock = LOCK_INIT; char my_host[HOST_NAME_MAX + 1]; @@ -401,13 +401,17 @@ static const char *lock_repo_for_gc(int force, pid_t* ret_pid) FILE *fp; int fd; char *pidfile_path; + int flags = die_on_error ? LOCK_DIE_ON_ERROR : LOCK_QUIET_ON_ERROR; if (xgethostname(my_host, sizeof(my_host))) xsnprintf(my_host, sizeof(my_host), "unknown"); pidfile_path = git_pathdup("gc.pid"); - fd = hold_lock_file_for_update(&lock, pidfile_path, - LOCK_DIE_ON_ERROR); + fd = hold_lock_file_for_update(&lock, pidfile_path, flags); + if (fd < 0) { + assert(!die_on_error); /* should die in unable_to_lock_die() */ + return pidfile_path; + } if (!force) { static char locking_host[HOST_NAME_MAX + 1]; static char *scan_fmt; @@ -533,7 +537,7 @@ int cmd_gc(int argc, const char **argv, const char *prefix) int auto_gc = 0; int quiet = -1; int force = 0; - const char *name; + const char *name = NULL; pid_t pid; int daemonized = 0; int keep_base_pack = -1; @@ -599,6 +603,8 @@ int cmd_gc(int argc, const char **argv, const char *prefix) */ if (!need_to_gc()) return 0; + if (lock_repo_for_gc(force, &pid, 0)) + return 0; if (!quiet) { if (detach_auto) fprintf(stderr, _("Auto packing the repository in background for optimum performance.\n")); @@ -616,8 +622,6 @@ int cmd_gc(int argc, const char **argv, const char *prefix) /* Last gc --auto failed. Skip this one. */ return 0; - if (lock_repo_for_gc(force, &pid)) - return 0; gc_before_repack(); /* dies on failure */ git_test_sleep("GIT_TEST_GC_SLEEP_PRE_FORK"); delete_tempfile(&pidfile); @@ -644,7 +648,8 @@ int cmd_gc(int argc, const char **argv, const char *prefix) string_list_clear(&keep_pack, 0); } - name = lock_repo_for_gc(force, &pid); + if (!pidfile) + name = lock_repo_for_gc(force, &pid, 1); if (name) { if (auto_gc) return 0; /* be quiet on --auto */ diff --git a/lockfile.c b/lockfile.c index 8e8ab4f29f..2a86d3a656 100644 --- a/lockfile.c +++ b/lockfile.c @@ -173,6 +173,8 @@ int hold_lock_file_for_update_timeout(struct lock_file *lk, const char *path, int flags, long timeout_ms) { int fd = lock_file_timeout(lk, path, flags, timeout_ms); + if (flags & LOCK_QUIET_ON_ERROR) + return fd; if (fd < 0) { if (flags & LOCK_DIE_ON_ERROR) unable_to_lock_die(path, errno); diff --git a/lockfile.h b/lockfile.h index 9843053ce8..dc366d70cc 100644 --- a/lockfile.h +++ b/lockfile.h @@ -135,10 +135,16 @@ struct lock_file { /* * ... this flag can be passed instead to return -1 and give the usual - * error message upon an error. + * error message upon an error, or ... */ #define LOCK_REPORT_ON_ERROR 4 +/* + * ... just be quiet and let the caller handle it by checking the + * return return value. + */ +#define LOCK_QUIET_ON_ERROR 8 + /* * Usually symbolic links in the destination path are resolved. This * means that (1) the lockfile is created by adding ".lock" to the diff --git a/t/t6500-gc.sh b/t/t6500-gc.sh index 363ca18980..4dea98f1c3 100755 --- a/t/t6500-gc.sh +++ b/t/t6500-gc.sh @@ -175,14 +175,14 @@ test_racy_gc_auto () { " } -test_racy_gc_auto failure false N/A N/A N/A +test_racy_gc_auto success false N/A N/A N/A for fork_works in true false do for sleep_before_fork in 0 1 do for sleep_before_fork_no_lock in 0 1 do - test_racy_gc_auto failure true $sleep_before_fork $sleep_before_fork_no_lock $fork_works + test_racy_gc_auto success true $sleep_before_fork $sleep_before_fork_no_lock $fork_works done done done @@ -204,8 +204,8 @@ test_racy_faked_gc_auto () { test_line_count = 0 errors " } -test_racy_faked_gc_auto failure true -test_racy_faked_gc_auto failure false +test_racy_faked_gc_auto success true +test_racy_faked_gc_auto success false run_and_wait_for_auto_gc () { # We read stdout from gc for the side effect of waiting until the -- 2.22.0.rc1.257.g3120a18244