Re: [PATCH v4 0/3] Make update refs more atomic

Michael Haggerty <mhagger@xxxxxxxxxxxx> · Wed, 16 Apr 2014 23:51:09 +0200

On 04/16/2014 09:31 PM, Junio C Hamano wrote:
> Ronnie Sahlberg <sahlberg@xxxxxxxxxx> writes:
> 
>> Currently any locking of refs in a transaction only happens during the commit
>> phase. I think it would be useful to have a mechanism where you could
>> optionally take out locks for the involved refs early during the transaction.
>> So that simple callers could continue using
>> ref_transaction_begin()
>> ref_transaction_create|update|delete()*
>> ref_transaction_commit()
>>
>> but, if a caller such as walker_fetch() could opt to do
>> ref_transaction_begin()
>> ref_transaction_lock_ref()*
>> ...do stuff...
>> ref_transaction_create|update|delete()*
>> ref_transaction_commit()
>>
>> In this second case ref_transaction_commit() would only take out any locks that
>> are missing during the 'lock the refs" loop.
>>
>> Suggestion 1: Add a ref_transaction_lock_ref() to allow locking a ref
>> early during
>> a transaction.
> 
> Hmph.
> 
> I am not sure if that is the right way to go, or instead change all
> create/update/delete to take locks without adding a new primitive.

Junio's suggestion seems like a good idea to me.  Obviously, as soon as
we take out a lock we could also do any applicable old_sha1 check and
possibly fail fast.

Does a "verify" operation require holding a lock?  If not, when is the
verification done--early, or during the commit, or both?  (I realize
that we don't yet *have* a verify operation at the API level, but we
should!)

We also need to think about for what period of time we have to hold the
packed-refs lock.

Finally, we shouldn't forget that currently the reflog files are locked
by holding the lock on the corresponding loose reference file.  Do we
need to integrate these files into our system any more than they
currently are?

[By the way, I noticed the other day that the command

    git reflog expire --stale-fix --expire-unreachable=now --all

can hold loose reference locks for a *long* time (like 10s of minutes),
especially in weird cases like the repository having 9000 packfiles for
some reason or another :-)  The command execution time grows strongly
with the length of the reference's log, or maybe even the square of the
length assuming the history is roughly linear and reachability is
computed separately for each SHA-1.  This is just an empirical
observation so far; I haven't looked into the code yet.]

>> A second idea is to change the signatures for
>> ref_transaction_create|update|delete()
>> slightly and allow them to return errors early.
>> We can check for things like add_update() failing, check that the
>> ref-name looks sane,
>> check some of the flags, like if has_old==true then old sha1 should
>> not be NULL or 0{40}, etc.
>>
>> Additionally for robustness, if any of these functions detect an error
>> we can flag this in the
>> transaction structure and take action during ref_transaction_commit().
>> I.e. if a ref_transaction_update had a hard failure, do not allow
>> ref_transaction_commit()
>> to succeed.
>>
>> Suggestion 2: Change ref_transaction_create|delete|update() to return an int.
>> All callers that use these functions should check the function for error.
> 
> I think that is a very sensible thing to do.
> 
> The details of determining "this cannot possibly succeed" may change
> (for example, if we have them take locks at the point of
> create/delete/update, a failure to lock may count as an early
> error).
> 
> Is there any reason why this should be conditional (i.e. you said
> "allow them to", implying that the early failure is optional)?

Also a good suggestion.  We should make it clear in the documentation
that the create/delete/update functions are not *obliged* to return an
error (for example that the current value of the reference does not
agree with old_sha1) because future alternate ref backends might
possibly not be able to make such checks until the commit phase.  For
example, checking old_sha1 might involve a round-trip to a database or
remote repository, in which case it might be preferable to check all
values in a single round-trip upon commit.

So, callers might be informed early of problems, or they might only
learn about problems when they try to commit the transaction.  They have
to be able to handle either type of error reporting.

So then the question arises (and maybe this is what Ronnie was getting
at by suggesting that the checks might be conditional): are callers
*obliged* to check the return values from create/delete/update, or are
they allowed to just keep throwing everything into the transaction,
ignoring errors, and only check the result of ref_transaction_commit()?

I don't feel strongly one way or the other about this question.  It
might be nice to be able to write callers sloppily, but it would cost a
bit more code in the reference backends.  Though maybe it wouldn't even
be much extra code, given that we would probably want to put consistency
checks in there anyway.

>> Suggestion 3: remove the qsort and check for duplicates in
>> ref_transaction_commit()
>> Since we are already taking out a lock for each ref we are updating
>> during the transaction
>> any duplicate refs will fail the second attempt to lock the same ref which will
>> implicitly make sure that a transaction will not change the same ref twice.
> 
> I do not know if I care about the implementation detail of "do we
> have a unique set of update requests?".  While I do not see a strong
> need for one transaction to touch the same ref twice (e.g. create to
> point at commit A and update it to point at commit B), I do not see
> why we should forbid such a use in the future.

I agree.  For example, we might very well want to allow multiple updates
to a single reference, as long as they are mutually consistent, to be
coalesced into a single update.  I also expect that the error message in
the current code is more illuminating than the generic error message
that would result from the inability to lock a reference (because it is
already locked by an earlier update in the same transaction!)  Let's
leave this aspect the way it is now and revisit the topic later when we
have learned more.

I see, though, that this will be a little tricky to implement early
locking.  Currently we sort all of the updates and look for duplicates
*before* we acquire any locks.  But if we want to acquire locks
immediately, then (since the list isn't sorted yet) it will be expensive
to look for duplicates before attempting to lock.  I can see two
possibilities:

1. Use a different data structure, like a hash map, for storing updates,
so that we have a cheap way to lookup whether there is already an update
involving a reference.

2. Try to lock the reference, and if the lock fails *then* look back
through the list to see if we're the ones holding the lock already.

Michael

-- 
Michael Haggerty
mhagger@xxxxxxxxxxxx
http://softwareswirl.blogspot.com/

--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html