Re: [PATCH] Retry acquiring reference locks for 100ms

Michael Haggerty <mhagger@xxxxxxxxxxxx> · Thu, 24 Aug 2017 09:01:38 +0200

On Wed, Aug 23, 2017 at 11:55 PM, Junio C Hamano <gitster@xxxxxxxxx> wrote:
> Michael Haggerty <mhagger@xxxxxxxxxxxx> writes:
>
>> The philosophy of reference locking has been, "if another process is
>> changing a reference, then whatever I'm trying to do to it will
>> probably fail anyway because my old-SHA-1 value is probably no longer
>> current". But this argument falls down if the other process has locked
>> the reference to do something that doesn't actually change the value
>> of the reference, such as `pack-refs` or `reflog expire`. There
>> actually *is* a decent chance that a planned reference update will
>> still be able to go through after the other process has released the
>> lock.
>
> The reason why these 'read-only' operations take locks is because
> they want to ensure that other people will not mess with the
> anchoring points of the history they base their operation on while
> they do their work, right?

In the case of `pack-refs`, after it makes packed versions of the
loose references, it needs to lock each loose reference before pruning
it, so that it can verify in a non-racy way that the loose reference
still has the same value as the one it just packed.

In the case of `reflog expire`, it locks the reference because that
implies a lock on the reflog file, which it needs to rewrite. (Reflog
files don't have their own locks.) Otherwise it could inadvertently
overwrite a new reflog entry that is added by another process while it
is rewriting the file.

>> So when trying to lock an individual reference (e.g., when creating
>> "refs/heads/master.lock"), if it is already locked, then retry the
>> lock acquisition for approximately 100 ms before giving up. This
>> should eliminate some unnecessary lock conflicts without wasting a lot
>> of time.
>>
>> Add a configuration setting, `core.filesRefLockTimeout`, to allow this
>> setting to be tweaked.
>
> I suspect that this came from real-life needs of a server operator.
> What numbers should I be asking to justify this change? ;-)
>
> "Without this change, 0.4% of pushes used to fail due to losing the
> race against periodic GC, but with this, the rate went down to 0.2%,
> which is 50% improvement!" or something like that?

We've had a patch like this deployed to our servers for quite some
time, so I don't remember too accurately. But I think we were seeing
maybe 10-50 such errors every day across our whole infrastructure
(we're talking like literally one in a million updates). That number
went basically to zero after retries were added.

It's also not particularly serious when the race happens: the
reference update is rejected, but it is rejected cleanly.

So it's definitely a rare race, and probably only of any interest at
all on a high-traffic Git server. OTOH the cure is pretty simple, so
it seems worth fixing.

Michael