Re: reftable [v4]: new ref storage format

Jeff King <peff@xxxxxxxx> · Wed, 2 Aug 2017 05:28:46 -0400

On Tue, Aug 01, 2017 at 07:38:37PM -0700, Shawn Pearce wrote:

> > OBJS blocks can also be
> > unbounded in size if very many references point at the same object,
> > thought that is perhaps only a theoretical problem.
> 
> Gah, I missed that in reftable. The block id pointer list could cause
> a single object id to exceed what fits in a block, and that will cause
> the writer to fail unless its caller sets the block size larger. I
> basically assumed this overflow condition is very unlikely, as its not
> common to have a huge number of refs pointing to the same object.

It's actually quite common for us, as we have big shared-object repos
that contain a copy of the refs of all of their child repos (for
reachability during packing, etc). So tags, where the value is the same
in each fork, you have one ref per fork pointing to it.

Just peeking at torvalds/linux, we have some objects with ~35K refs
pointing to them (e.g., the v2.6.11 tag).

> > Peff and I discussed off-list whether the lookup-by-SHA-1 feature is
> > so important in the first place. Currently, all references must be
> > scanned for the advertisement anyway,
> 
> Not really. You can hide refs and allow-tip-sha1 so clients can fetch
> a ref even if it wasn't in the advertisement. We really want to use
> that wire protocol capability with Gerrit Code Review to hide the
> refs/changes/ namespace from the advertisement, but allow clients to
> fetch any of those refs if they send its current SHA-1 in a want line
> anyway.
> 
> So a server could scan only the refs/{heads,tags}/ prefixes for the
> advertisement, and then leverage the lookup-by-SHA1 to verify other
> SHA-1s sent by the client.

Yeah, that makes sense (though I hope in the end that strategy will go
away in favor of a better protocol, as getting the sha1 out-of-band has
obvious UX complexities).

> > OTOH a mythical protocol v2 might reduce the need to scan the
> > references for advertisement, so maybe this optimization will be more
> > helpful in the future?
> 
> Yes, I'm hopeful we can get a v2 protocol built on the work Jonathan
> Tan is doing, and switch the advertisement around to "client speaks
> first", so that we can be smarter on the server about which refs are
> read and sent. That is a long way off, lets say 3-5 years before its
> really common in clients.

I was actually planning to spend some time on this in the next month or
two. I don't think it needs to be that complicated. We don't need a
whole protocol revamp. We just need a way to get a few bits from the
client before the advertisement, and from there we can bootstrap any
more radical protocol changes we want.

I know it will take a while before it's something we can expect in
clients, but it's definitely worth planning around. And sometimes a
feature like this can drive upgrades, if it's something that produces an
immediate and obvious benefit to the client.

> Servers today don't update HEAD reflog when a branch is pushed. I
> think trying to record that is overkill, you have the reflog data in
> the ref itself that the client sent the command to modify.

I think they do, at least for C git:

  $ git init --bare dst.git
  $ git -C dst.git config core.logallrefupdates
  $ git push dst.git
  ...
  To dst.git
   * [new branch]      master -> master
  $ find dst.git/logs -type f | xargs wc -l
    1 dst.git/logs/refs/heads/master
    1 dst.git/logs/HEAD

The special logic for "see if we're updating the ref that HEAD points
to" is deep in the ref machinery, so it gets triggered for all updates,
including pushes.

I agree it's not actually that interesting for a bare repo, where HEAD
isn't that meaningful (and doesn't tend to change a lot anyway).

> > That's what I was thinking. But I've yet to hear anybody complain
> > about missing reflogs for symrefs if the underlying reference is
> > updated directly, so maybe we should just leave that "problem"
> > unsolved. It is certainly simpler and less brittle not to have to keep
> > backreferences like these in sync with the forward references.
> 
> Yup, that is my take on it, and why I didn't try to put this into
> reftable drafts, even though it was discussed between us on the list
> in earlier messages.

Yeah, I'd agree. It might be worth doing a better job of showing the
before/after destinations in the reflog when updating a symbolic ref,
which would let you reconstruct the state from the pointed-to reflogs if
you cared to. But that's orthogonal to the storage format (you can do it
already if you bother to pass a good message to "symbolic-ref -m").

-Peff