Re: [PATCH 00/17] Remove assumptions about refname lifetimes

Jeff King <peff@xxxxxxxx> · Mon, 20 May 2013 12:59:42 -0400

On Mon, May 20, 2013 at 09:37:30AM -0700, Junio C Hamano wrote:

> Johan Herland <johan@xxxxxxxxxxx> writes:
> 
> > For server-class installations we need ref storage that can be read
> > (and updated?) atomically, and the current system of loose + packed
> > files won't work since reading (and updating) more than a single file
> > is not an atomic operation. Trivially, one could resolve this by
> > dropping loose refs, and always using a single packed-refs file, but
> > that would make it prohibitively expensive to update refs (the entire
> > packed-refs file must be rewritten for every update).
> >
> > Now, observe that we don't have these race conditions in the object
> > database, because it is an add-only immutable data store.
> >
> > What if we stored the refs as a tree object in the object database,
> > referenced by a single (loose) ref?
> 
> What is the cost of updating a single branch with that scheme?

Yeah, I had the same thoughts as you here; I don't think this is really
much better than updating a single packed-refs file. You may only
have to touch log(N) of the entries to update the trees along the path
to your ref (assuming you have a bushy refs/ tree; in practice, of
course, you have a lot of entries in refs/heads/, so you do not quite
get log behavior). So algorithmically it may be slightly better, but you
pay a much higher constant, in that you are creating many tree objects
along the path (and reading them on lookup).

But more importantly, it introduces contention between two unrelated
refs that are being updated. Even if we reconcile the differences
automatically (e.g., with a "merge-and-retry" strategy), that is going
to be a serious performance regression for a busy repository, as we
repeatedly try to reconcile the serialized updates to the refs/ root
tree.

Any transactional solution needs to have the equivalent of ref-level
locking (i.e., row-level locking in a relational setting). It is OK for
two simultaneous updates to different rows to happen in random order; we
don't care about that level of atomicity. The important thing is that
the _readers_ see transactions in consistent order; if we know that
update B started after update A finished, then we must never see update
A without update B reflected. And I think that is more or less a solved
problem in the database world.

That is what prevents:

  git update-ref A B
  git update-ref -d B

from creating a view where the reader sees neither A nor B (which is
what can happen with the current filesystem view).

-Peff
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html