Re: RFC v3: Another proposed hash function transition plan

Jonathan Nieder <jrnieder@xxxxxxxxx> · Fri, 10 Mar 2017 11:55:24 -0800

Jeff King wrote:
> On Thu, Mar 09, 2017 at 12:24:08PM -0800, Jonathan Nieder wrote:

>>> SHA-1 to SHA-3: lookup SHA-1 in .msha1, reverse .idx, find offset to
>>> read the SHA-3.
>>> SHA-3 to SHA-1: lookup SHA-3 in .idx, and reverse the .msha1 file to
>>> translate offset to SHA-1.
>>
>> Thanks for this suggestion.  I was initially vaguely nervous about
>> lookup times in an idx-style file, but as you say, object reads from a
>> packfile already have to deal with this kind of lookup and work fine.
>
> Not exactly. The "reverse .idx" step has to build the reverse mapping on
> the fly, and it's non-trivial.

Sure.  To be clear, I was handwaving over that since adding an on-disk
reverse .idx is a relatively small change.

[...]
> So I think it's solvable, but I suspect we would want an extension to
> the .idx format to store the mapping array, in order to keep it log-n.

i.e., this.

The loose object side is the more worrying bit, since we currently don't
have any practical bound on the number of loose objects.

One way to deal with that is to disallow loose objects completely.
Use packfiles for new objects, batching the objects produced by a
single process into a single packfile.  Teach "git gc --auto" a
behavior similar to Martin Fick's "git exproll" to combine packfiles
between full gcs to maintain reasonable performance.  For unreachable
objects, instead of using loose objects, use "unreachable garbage"
packs explicitly labeled as such, with similar semantics to what
JGit's DfsRepository backend uses (described in the discussion at
https://git.eclipse.org/r/89455).

That's a direction that I want in the long term anyway.  I was hoping
not to couple such changes with the hash transition but it might be
one of the simpler ways to go.

Jonathan