Re: RFC: Another proposed hash function transition plan

Jeff King <peff@xxxxxxxx> · Mon, 6 Mar 2017 04:43:34 -0500

On Mon, Mar 06, 2017 at 10:29:33AM +0100, ankostis wrote:

> On 5 March 2017 at 12:02, David Lang <david@xxxxxxx> wrote:
> >> Translation table
> >> ~~~~~~~~~~~~~~~~~
> >> A fast bidirectional mapping between sha1-names and sha256-names of
> >> all local objects in the repository is kept on disk. The exact format
> >> of that mapping is to be determined.
> >>
> >> All operations that make new objects (e.g., "git commit") add the new
> >> objects to the translation table.
> >
> >
> > This seems like a rather nontrival thing to design. It will need to hold
> > millions of mappings, and be quickly searchable from either direction
> > (sha1->new and new->sha1) while still be fairly fast to insert new records
> > into.
> >
> > For Linux, just the list of hashes recording the commits is going to be in
> > the millions, whiel the list of hashes of individual files for all those
> > commits is going to be substantially larger.
> 
> Apologies if it is a stupid idea, but could we avoid the mappings-table
> just by
> hard-linking to the same object from both (or more) hashes?
> So instead of creating a text-db format, just use the filesystem.

No, for a few reasons:

  1. Most of these objects will not be in the filesystem at all, but
     rather in a packfile.

  2. It's not just a different hash over the same bytes. The sha256-name
     is taken over the sha256-content (which refers to other objects
     using sha256). So they really are different objects. You probably
     wouldn't keep the sha1 version around separately, but rather
     generate it on the fly during a push to a sha1 server.

  3. You really need to be able to take a sha256 name and convert it to
     a sha1 and vice versa. Hardlinks don't help with that, because they
     only point in one direction. That get you to the same _content_,
     but not the other name (and I guess this is where your "look up the
     name and then compute the other digest comes in, but that's
     probably too expensive to be workable).

I do think updating the mapping could potentially be deferred until
interacting with a sha1 server. But because it needs to be generated in
reverse-topological order, it's conceptually easier to do it one object
at a time.

-Peff