On Wed, Aug 23, 2017 at 11:55 PM, Junio C Hamano <gitster@xxxxxxxxx> wrote: > Michael Haggerty <mhagger@xxxxxxxxxxxx> writes: > >> The philosophy of reference locking has been, "if another process is >> changing a reference, then whatever I'm trying to do to it will >> probably fail anyway because my old-SHA-1 value is probably no longer >> current". But this argument falls down if the other process has locked >> the reference to do something that doesn't actually change the value >> of the reference, such as `pack-refs` or `reflog expire`. There >> actually *is* a decent chance that a planned reference update will >> still be able to go through after the other process has released the >> lock. > > The reason why these 'read-only' operations take locks is because > they want to ensure that other people will not mess with the > anchoring points of the history they base their operation on while > they do their work, right? In the case of `pack-refs`, after it makes packed versions of the loose references, it needs to lock each loose reference before pruning it, so that it can verify in a non-racy way that the loose reference still has the same value as the one it just packed. In the case of `reflog expire`, it locks the reference because that implies a lock on the reflog file, which it needs to rewrite. (Reflog files don't have their own locks.) Otherwise it could inadvertently overwrite a new reflog entry that is added by another process while it is rewriting the file. >> So when trying to lock an individual reference (e.g., when creating >> "refs/heads/master.lock"), if it is already locked, then retry the >> lock acquisition for approximately 100 ms before giving up. This >> should eliminate some unnecessary lock conflicts without wasting a lot >> of time. >> >> Add a configuration setting, `core.filesRefLockTimeout`, to allow this >> setting to be tweaked. > > I suspect that this came from real-life needs of a server operator. > What numbers should I be asking to justify this change? ;-) > > "Without this change, 0.4% of pushes used to fail due to losing the > race against periodic GC, but with this, the rate went down to 0.2%, > which is 50% improvement!" or something like that? We've had a patch like this deployed to our servers for quite some time, so I don't remember too accurately. But I think we were seeing maybe 10-50 such errors every day across our whole infrastructure (we're talking like literally one in a million updates). That number went basically to zero after retries were added. It's also not particularly serious when the race happens: the reference update is rejected, but it is rejected cleanly. So it's definitely a rare race, and probably only of any interest at all on a high-traffic Git server. OTOH the cure is pretty simple, so it seems worth fixing. Michael