Re: [PATCH v3 8/8] reflog expire: don't assert the OID when locking refs

Ævar Arnfjörð Bjarmason <avarab@xxxxxxxxx> · Thu, 21 Mar 2019 15:10:08 +0100

On Tue, Mar 19 2019, Michael Haggerty wrote:

> Thanks for your work and for your thorough explanation of the change!

Hi. Yes, thanks a lot for the feedback. Just hadn't gotten around to
looping back to this yet & digging into the issue you raised.

> On 3/15/19 4:59 PM, Ævar Arnfjörð Bjarmason wrote:
>> During reflog expiry, the cmd_reflog_expire() function first iterates
>> over all reflogs in logs/*, and then one-by-one acquires the lock for
>> each one to expire its reflog by getting a *.lock file on the
>> corresponding loose ref[1] (even if the actual ref is packed).
>>
>> This lock is needed, but what isn't needed is locking the loose ref as
>> a function of the OID we found from that first iteration. By the time
>> we get around to re-visiting the reference some of the OIDs may have
>> changed.
>
> Instead of "what isn't needed is locking the loose ref as a function of
> the OID we found from that first iteration", I suggest "what isn't
> needed is to insist that the reference still has the OID that we found
> in that first iteration".
>
>> Thus the verify_lock() function called by the lock_ref_oid_basic()
>> function being changed here would fail with e.g. "ref '%s' is at %s
>> but expected %s" if the repository was being updated concurrent to the
>> "reflog expire".
>>
>> By not passing the OID to it we'll try to lock the reference
>> regardless of it last known OID. Locking as a function of the OID
>
> s/it/its/
>
>> would make "reflog expire" exit with a non-zero exit status under such
>> contention, which in turn meant that a "gc" command (which expires
>> reflogs before forking to the background) would encounter a hard
>> error.
>
> The last sentence seems mostly redundant with the previous paragraph.
>
>> This behavior of considering the OID when locking has been here ever
>> since "reflog expire" was initially implemented in 4264dc15e1 ("git
>> reflog expire", 2006-12-19). As seen in that simpler initial version
>> of the code we subsequently use the OID to inform the expiry (and
>> still do), but never needed to use it to lock the reference associated
>> with the reflog.
>>
>> By locking the reference without considering what OID we last saw it
>> at, we won't encounter user-visible contention to the extent that
>> core.filesRefLockTimeout mitigates it. See 4ff0f01cb7 ("refs: retry
>> acquiring reference locks for 100ms", 2017-08-21).
>>
>> Unfortunately this sort of probabilistic contention is hard to turn
>> into a test. I've tested this by running the following three subshells
>> in concurrent terminals:
>>
>>     (
>>         cd /tmp &&
>>         rm -rf git &&
>>         git init git &&
>>         cd git &&
>>         while true
>>         do
>>             head -c 10 /dev/urandom | hexdump >out &&
>>             git add out &&
>>             git commit -m"out"
>>         done
>>     )
>>
>>     (
>>         cd /tmp &&
>>         rm -rf git-clone &&
>>         git clone file:///tmp/git git-clone &&
>>         cd git-clone &&
>>         while git pull
>>         do
>>             date
>>         done
>>     )
>>
>>     (
>>         cd /tmp/git-clone &&
>>         while git reflog expire --all
>>         do
>>             date
>>         done
>>     )
>>
>> Before this change the "reflog expire" would fail really quickly with
>> a "but expected" error. After this change both the "pull" and "reflog
>> expire" will run for a while, but eventually fail because I get
>> unlucky with core.filesRefLockTimeout (the "reflog expire" is in a
>> really tight loop). That can be resolved by being more generous with
>> higher values of core.filesRefLockTimeout than the 100ms default.
>>
>> As noted in the commentary being added here we also need to handle the
>> case of references being racily deleted, that can be tested by adding
>> this to the above:
>>
>>     (
>>         cd /tmp/git-clone &&
>>         while git branch topic master && git branch -D topic
>>         do
>>             date
>>         done
>>     )
>>
>> We could change lock_ref_oid_basic() to always pass down
>> RESOLVE_REF_READING to refs_resolve_ref_unsafe() and then
>> files_reflog_expire() to detect the "is it deleted?" state. But let's
>> not bother, in the event of such a race we're going to redundantly
>> create a lock on the deleted reference, and shortly afterwards handle
>> that case and others with the refs_reflog_exists() check.
>>
>> 1. https://public-inbox.org/git/54857871.5090805@xxxxxxxxxxxx/
>>
>> Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@xxxxxxxxx>
>> ---
>>  refs/files-backend.c | 15 ++++++++++++++-
>>  1 file changed, 14 insertions(+), 1 deletion(-)
>>
>> diff --git a/refs/files-backend.c b/refs/files-backend.c
>> index ef053f716c3..c7ed1792b3b 100644
>> --- a/refs/files-backend.c
>> +++ b/refs/files-backend.c
>> @@ -3036,8 +3036,14 @@ static int files_reflog_expire(struct ref_store *ref_store,
>>  	 * The reflog file is locked by holding the lock on the
>>  	 * reference itself, plus we might need to update the
>>  	 * reference if --updateref was specified:
>> +	 *
>> +	 * We don't pass down the oid here because we'd like to be
>> +	 * tolerant to the OID of the ref having changed, and to
>> +	 * gracefully handle the case where it's been deleted (see oid
>> +	 * -> mustexist -> RESOLVE_REF_READING in
>> +	 * lock_ref_oid_basic()) ...
>>  	 */
>> -	lock = lock_ref_oid_basic(refs, refname, oid,
>> +	lock = lock_ref_oid_basic(refs, refname, NULL,
>>  				  NULL, NULL, REF_NO_DEREF,
>>  				  &type, &err);
>
> This seems totally reasonable. But then later, where `oid` is passed to
> `(*prepare_fn)()`, I think you must pass `&(lock->old_oid)` instead,
> since we no longer have a guarantee that `oid` reflects the correct
> state of the reference. And after that, there is no need for this
> function to take an `oid` parameter at all (which also makes sense from
> an abstract point of view). Which means that the signatures of
> `refs_reflog_expire()`, `reflog_expire()`, `packed_reflog_expire()`, and
> `reflog_expire_fn` can also be changed, along with callers.
>
> I haven't had time yet to inspect those callers to see whether they
> might actually care that the `oid` that they used to pass to
> `reflog_expire()` isn't necessarily the one that gets passed back to
> their callbacks, but following the trail that I just outlined should
> make it possible to determine that.
>
>>  	if (!lock) {
>> @@ -3045,6 +3051,13 @@ static int files_reflog_expire(struct ref_store *ref_store,
>>  		strbuf_release(&err);
>>  		return -1;
>>  	}
>> +	/*
>> +	 * When refs are deleted their reflog is deleted before the
>> +	 * loose ref is deleted. This catches that case, i.e. when
>> +	 * racing against a ref deletion lock_ref_oid_basic() will
>> +	 * have acquired a lock on the now-deleted ref, but here's
>> +	 * where we find out it has no reflog anymore.
>> +	 */
>>  	if (!refs_reflog_exists(ref_store, refname)) {
>>  		unlock_ref(lock);
>>  		return 0;
>>
>
> Cheers,
> Michael