Re: RFC v3: Another proposed hash function transition plan

Jonathan Nieder <jrnieder@xxxxxxxxx> · Thu, 9 Mar 2017 12:24:08 -0800

Hi,

Shawn Pearce wrote:
> On Mon, Mar 6, 2017 at 4:17 PM, Jonathan Nieder <jrnieder@xxxxxxxxx> wrote:

>> Alongside the packfile, a sha3 repository stores a bidirectional
>> mapping between sha3 and sha1 object names. The mapping is generated
>> locally and can be verified using "git fsck". Object lookups use this
>> mapping to allow naming objects using either their sha1 and sha3 names
>> interchangeably.
>
> I saw some discussion about using LevelDB for this mapping table. I
> think any existing database may be overkill.
>
> For packs, you may be able to simplify by having only one file
> (pack-*.msha1) that maps SHA-1 to pack offset; idx v2. The CRC32 table
> in v2 is unnecessary, but you need the 64 bit offset support.
>
> SHA-1 to SHA-3: lookup SHA-1 in .msha1, reverse .idx, find offset to
> read the SHA-3.
> SHA-3 to SHA-1: lookup SHA-3 in .idx, and reverse the .msha1 file to
> translate offset to SHA-1.

Thanks for this suggestion.  I was initially vaguely nervous about
lookup times in an idx-style file, but as you say, object reads from a
packfile already have to deal with this kind of lookup and work fine.

> For loose objects, the loose object directories should have only
> O(4000) entries before auto gc is strongly encouraging
> packing/pruning. With 256 shards, each given directory has O(16) loose
> objects in it. When writing a SHA-3 loose object, Git could also
> append a line "$sha3 $sha1\n" to objects/${first_byte}/sha1, which
> GC/prune rewrites to remove entries. With O(16) objects in a
> directory, these files should only have O(16) entries in them.

Insertion time is what worries me.  When writing a small number of
objects using a command like "git commit", I don't want to have to
regenerate an entire idx file.  I don't want to move the pain to
O(loose objects) work at read time, either --- some people disable
auto gc, and others have a large number of loose objects due to gc
ejecting unreachable objects.

But some kind of simplification along these lines should be possible.
I'll experiment.

Jonathan