Re: RefTree: Alternate ref backend

Michael Haggerty <mhagger@xxxxxxxxxxxx> · Tue, 22 Dec 2015 18:17:28 +0100

On 12/22/2015 05:11 PM, Shawn Pearce wrote:
> On Tue, Dec 22, 2015 at 7:41 AM, Michael Haggerty <mhagger@xxxxxxxxxxxx> wrote:
>> On 12/17/2015 10:02 PM, Shawn Pearce wrote:
>>> I started playing around with the idea of storing references directly
>>> in Git. Exploiting the GITLINK tree entry, we can associate a name to
>>> any SHA-1.
>> [...]
>> I'm curious why you decided to store all of the references in a single
>> list, similar to the packed-refs file. [...]
> 
> I did use tree objects for each directory component. The ls-tree I
> showed was an ls-tree -r.

Silly me. Of course that was clear from your post, and I just overlooked it.

>> Another reason that I find a hierarchical layout intriguing would be
>> that one could imagine using the SHA-1s of reference namespace subtrees
>> to speed up the negotiation phase of "git fetch". [...]
> 
> Yes. Martin Fick and I were discussing a strategy like this at the
> Gerrit User Summit in November. I totally forgot about it when I
> started this thread, but I'm glad you independently proposed it. Maybe
> its not a horrible idea!  :)
> 
> One problem is clients don't mirror the heads tree exactly; they add
> in HEAD as a symbolic reference in a way that the remote peer doesn't
> have. Minor detail.

Yes, and this is a side effect of leaving out a layer of the remote
reference namespace in the local refs/remotes layout. Naively one would
expect

    refs/remotes/origin/HEAD
    refs/remotes/origin/refs/heads/master

etc. But we store branches into the main "refs/remotes/origin/"
namespace, leaving no reserved space for the remote "HEAD" (not to
mention other namespaces that might appear on the remote, such as
"refs/changes/*", "refs/pull/*", a separate record of the remote's
"refs/tags/*", etc).

Maybe that is why my gut reaction to your proposal to elide the "refs"
part of the reference hierarchy and store "HEAD" as (effectively)
"refs/..HEAD" was negative, even though I can't think of any practical
objections.

At a deeper level, the "refs/" part of reference names is actually
pretty useless in general. I suppose it originated in the practice of
storing loose references under "refs/" to keep them separate from other
metadata in $GIT_DIR. But really, aside from slightly helping
disambiguate references from paths in the command line, what is it good
for? Would we really be worse off if references' full names were

    HEAD
    heads/master
    tags/v1.0.0
    remotes/origin/master (or remotes/origin/heads/master)

etc? This notation is already recognized in most places (though not in
"update-ref"). I think your decision to elide "refs/" in the reftree
hierarchy is a reflection of its uselessness. In any case, your decision
is much less questionable than the decision to mash "refs/heads/*" all
the way up to the top level like we do in "refs/remotes/".

> Martin and I were really thinking about server-server negotiation more
> than client-server. [...]

Yes, that's also an interesting application.

>> There are a lot of "if"s in that last paragraph, and maybe it's not
>> workable. For example, if I'm not pruning on fetch, then my reference
>> tree won't be identical to one that was ever present on the server and
>> this technique wouldn't necessarily help. But if, for example, we change
>> the default to pruning, or perhaps record some extra reftree SHA-1's,
>> then in most cases I would expect that this trick could reduce the
>> effort of negotiation to negligible in most cases, and reduce the time
>> of the whole fetch to negligible in the case that the clone is already
>> up-to-date.
> 
> Right, maybe the client just remember's the server's reftree SHA-1 and
> offers it back on reconnection. The server can then diff between the
> two reftrees and shows the client only refs that were modified that
> the client cares about.

The client not only has to remember the server's reftree, but also must
verify that it still has all of the objects implied by that reftree, in
case a reference somehow got deleted under "refs/remotes/origin/*". At
that point, there is no special reason to use a SHA-1 in the
negotiation; any unique token generated by the server would suffice if
the server can connect it back to a set of references that was sent to
the client in the past.

The advantage of using hierarchical reftree SHA-1s in the negotiation is
that they can be used to name part of the reftree. For example, if I
fetch "refs/heads/*" from a remote but not "refs/changes/*", then what
do I report as my "have-tree"? I can't claim to have *all* of the
references that the remote had at the time. But with SHA-1s I can say
that I have the reftree that corresponds to my
"refs/remotes/origin/heads/", which the remote can notice is identical
to an old reftree that it happened to have for "refs/heads/" (without
even caring what path it represented). Bingo, we've just agreed about a
big part of the reference namespace without having to agree about the
whole namespace.

In practice, in my first "haves" announcement I would probably list a
few "famous" namespaces in the hope that one or more of them are
recognized by the server:

    have-tree <SHA-1 for "refs/">
    have-tree <SHA-1 for "refs/heads/">
    have-tree <SHA-1 for "refs/tags/">
    have-tree <SHA-1 for "refs/remotes/origin/heads/">
    have-tree <SHA-1 for "refs/remotes/other/heads/">

> [...]
> FWIW, JGit is able to scan the canonical trees out of a pack file and
> inflate them in approximately the same time it takes to scan the
> packed-refs file for some 70k references. So we don't really slow down
> much to use this. And there's huge gains to be had by taking advantage
> of the tree structure and only inflating the components you need to
> answer a particular read.

Yes, that's another nice aspect of the design.

I do worry a bit that the hierarchical storage only helps if people
shard their reference namespace reasonably. Somebody who stores 100k
references in a single reference "directory" (imagine a
"refs/ci-tests/*") is going to suffer from expensive reference update
performance. But I guess they will suffer from poor performance within
Git as well, and that will probably encourage them to improve their
practices :-) I suppose this is not really much different than people
who store 100k files within a single directory of their working tree.

Michael

-- 
Michael Haggerty
mhagger@xxxxxxxxxxxx

--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html