On Tue, Mar 19 2019, Michael Haggerty wrote: > Thanks for your work and for your thorough explanation of the change! Hi. Yes, thanks a lot for the feedback. Just hadn't gotten around to looping back to this yet & digging into the issue you raised. > On 3/15/19 4:59 PM, Ævar Arnfjörð Bjarmason wrote: >> During reflog expiry, the cmd_reflog_expire() function first iterates >> over all reflogs in logs/*, and then one-by-one acquires the lock for >> each one to expire its reflog by getting a *.lock file on the >> corresponding loose ref[1] (even if the actual ref is packed). >> >> This lock is needed, but what isn't needed is locking the loose ref as >> a function of the OID we found from that first iteration. By the time >> we get around to re-visiting the reference some of the OIDs may have >> changed. > > Instead of "what isn't needed is locking the loose ref as a function of > the OID we found from that first iteration", I suggest "what isn't > needed is to insist that the reference still has the OID that we found > in that first iteration". > >> Thus the verify_lock() function called by the lock_ref_oid_basic() >> function being changed here would fail with e.g. "ref '%s' is at %s >> but expected %s" if the repository was being updated concurrent to the >> "reflog expire". >> >> By not passing the OID to it we'll try to lock the reference >> regardless of it last known OID. Locking as a function of the OID > > s/it/its/ > >> would make "reflog expire" exit with a non-zero exit status under such >> contention, which in turn meant that a "gc" command (which expires >> reflogs before forking to the background) would encounter a hard >> error. > > The last sentence seems mostly redundant with the previous paragraph. > >> This behavior of considering the OID when locking has been here ever >> since "reflog expire" was initially implemented in 4264dc15e1 ("git >> reflog expire", 2006-12-19). As seen in that simpler initial version >> of the code we subsequently use the OID to inform the expiry (and >> still do), but never needed to use it to lock the reference associated >> with the reflog. >> >> By locking the reference without considering what OID we last saw it >> at, we won't encounter user-visible contention to the extent that >> core.filesRefLockTimeout mitigates it. See 4ff0f01cb7 ("refs: retry >> acquiring reference locks for 100ms", 2017-08-21). >> >> Unfortunately this sort of probabilistic contention is hard to turn >> into a test. I've tested this by running the following three subshells >> in concurrent terminals: >> >> ( >> cd /tmp && >> rm -rf git && >> git init git && >> cd git && >> while true >> do >> head -c 10 /dev/urandom | hexdump >out && >> git add out && >> git commit -m"out" >> done >> ) >> >> ( >> cd /tmp && >> rm -rf git-clone && >> git clone file:///tmp/git git-clone && >> cd git-clone && >> while git pull >> do >> date >> done >> ) >> >> ( >> cd /tmp/git-clone && >> while git reflog expire --all >> do >> date >> done >> ) >> >> Before this change the "reflog expire" would fail really quickly with >> a "but expected" error. After this change both the "pull" and "reflog >> expire" will run for a while, but eventually fail because I get >> unlucky with core.filesRefLockTimeout (the "reflog expire" is in a >> really tight loop). That can be resolved by being more generous with >> higher values of core.filesRefLockTimeout than the 100ms default. >> >> As noted in the commentary being added here we also need to handle the >> case of references being racily deleted, that can be tested by adding >> this to the above: >> >> ( >> cd /tmp/git-clone && >> while git branch topic master && git branch -D topic >> do >> date >> done >> ) >> >> We could change lock_ref_oid_basic() to always pass down >> RESOLVE_REF_READING to refs_resolve_ref_unsafe() and then >> files_reflog_expire() to detect the "is it deleted?" state. But let's >> not bother, in the event of such a race we're going to redundantly >> create a lock on the deleted reference, and shortly afterwards handle >> that case and others with the refs_reflog_exists() check. >> >> 1. https://public-inbox.org/git/54857871.5090805@xxxxxxxxxxxx/ >> >> Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@xxxxxxxxx> >> --- >> refs/files-backend.c | 15 ++++++++++++++- >> 1 file changed, 14 insertions(+), 1 deletion(-) >> >> diff --git a/refs/files-backend.c b/refs/files-backend.c >> index ef053f716c3..c7ed1792b3b 100644 >> --- a/refs/files-backend.c >> +++ b/refs/files-backend.c >> @@ -3036,8 +3036,14 @@ static int files_reflog_expire(struct ref_store *ref_store, >> * The reflog file is locked by holding the lock on the >> * reference itself, plus we might need to update the >> * reference if --updateref was specified: >> + * >> + * We don't pass down the oid here because we'd like to be >> + * tolerant to the OID of the ref having changed, and to >> + * gracefully handle the case where it's been deleted (see oid >> + * -> mustexist -> RESOLVE_REF_READING in >> + * lock_ref_oid_basic()) ... >> */ >> - lock = lock_ref_oid_basic(refs, refname, oid, >> + lock = lock_ref_oid_basic(refs, refname, NULL, >> NULL, NULL, REF_NO_DEREF, >> &type, &err); > > This seems totally reasonable. But then later, where `oid` is passed to > `(*prepare_fn)()`, I think you must pass `&(lock->old_oid)` instead, > since we no longer have a guarantee that `oid` reflects the correct > state of the reference. And after that, there is no need for this > function to take an `oid` parameter at all (which also makes sense from > an abstract point of view). Which means that the signatures of > `refs_reflog_expire()`, `reflog_expire()`, `packed_reflog_expire()`, and > `reflog_expire_fn` can also be changed, along with callers. > > I haven't had time yet to inspect those callers to see whether they > might actually care that the `oid` that they used to pass to > `reflog_expire()` isn't necessarily the one that gets passed back to > their callbacks, but following the trail that I just outlined should > make it possible to determine that. > >> if (!lock) { >> @@ -3045,6 +3051,13 @@ static int files_reflog_expire(struct ref_store *ref_store, >> strbuf_release(&err); >> return -1; >> } >> + /* >> + * When refs are deleted their reflog is deleted before the >> + * loose ref is deleted. This catches that case, i.e. when >> + * racing against a ref deletion lock_ref_oid_basic() will >> + * have acquired a lock on the now-deleted ref, but here's >> + * where we find out it has no reflog anymore. >> + */ >> if (!refs_reflog_exists(ref_store, refname)) { >> unlock_ref(lock); >> return 0; >> > > Cheers, > Michael