Re: RFC v3: Another proposed hash function transition plan

Shawn Pearce <spearce@xxxxxxxxxxx> · Thu, 9 Mar 2017 11:14:12 -0800

On Mon, Mar 6, 2017 at 4:17 PM, Jonathan Nieder <jrnieder@xxxxxxxxx> wrote:
> Linus Torvalds wrote:
>> On Fri, Mar 3, 2017 at 5:12 PM, Jonathan Nieder <jrnieder@xxxxxxxxx> wrote:
>
>>> This document is still in flux but I thought it best to send it out
>>> early to start getting feedback.
>>
>> This actually looks very reasonable if you can implement it cleanly
>> enough.
>
> Thanks for the kind words on what had quite a few flaws still.  Here's
> a new draft.  I think the next version will be a patch against
> Documentation/technical/.

FWIW, I like this approach.

> Alongside the packfile, a sha3 repository stores a bidirectional
> mapping between sha3 and sha1 object names. The mapping is generated
> locally and can be verified using "git fsck". Object lookups use this
> mapping to allow naming objects using either their sha1 and sha3 names
> interchangeably.

I saw some discussion about using LevelDB for this mapping table. I
think any existing database may be overkill.

For packs, you may be able to simplify by having only one file
(pack-*.msha1) that maps SHA-1 to pack offset; idx v2. The CRC32 table
in v2 is unnecessary, but you need the 64 bit offset support.

SHA-1 to SHA-3: lookup SHA-1 in .msha1, reverse .idx, find offset to
read the SHA-3.
SHA-3 to SHA-1: lookup SHA-3 in .idx, and reverse the .msha1 file to
translate offset to SHA-1.

For loose objects, the loose object directories should have only
O(4000) entries before auto gc is strongly encouraging
packing/pruning. With 256 shards, each given directory has O(16) loose
objects in it. When writing a SHA-3 loose object, Git could also
append a line "$sha3 $sha1\n" to objects/${first_byte}/sha1, which
GC/prune rewrites to remove entries. With O(16) objects in a
directory, these files should only have O(16) entries in them.

SHA-3 to SHA-1: open objects/${sha3_first_byte}/sha1 and scan until a
match is found.
SHA-1 to SHA-3: brute force read 256 files. Callers performing this
mapping may load all 256 files into a table in memory.