Hi, Shawn Pearce wrote: > On Mon, Mar 6, 2017 at 4:17 PM, Jonathan Nieder <jrnieder@xxxxxxxxx> wrote: >> Alongside the packfile, a sha3 repository stores a bidirectional >> mapping between sha3 and sha1 object names. The mapping is generated >> locally and can be verified using "git fsck". Object lookups use this >> mapping to allow naming objects using either their sha1 and sha3 names >> interchangeably. > > I saw some discussion about using LevelDB for this mapping table. I > think any existing database may be overkill. > > For packs, you may be able to simplify by having only one file > (pack-*.msha1) that maps SHA-1 to pack offset; idx v2. The CRC32 table > in v2 is unnecessary, but you need the 64 bit offset support. > > SHA-1 to SHA-3: lookup SHA-1 in .msha1, reverse .idx, find offset to > read the SHA-3. > SHA-3 to SHA-1: lookup SHA-3 in .idx, and reverse the .msha1 file to > translate offset to SHA-1. Thanks for this suggestion. I was initially vaguely nervous about lookup times in an idx-style file, but as you say, object reads from a packfile already have to deal with this kind of lookup and work fine. > For loose objects, the loose object directories should have only > O(4000) entries before auto gc is strongly encouraging > packing/pruning. With 256 shards, each given directory has O(16) loose > objects in it. When writing a SHA-3 loose object, Git could also > append a line "$sha3 $sha1\n" to objects/${first_byte}/sha1, which > GC/prune rewrites to remove entries. With O(16) objects in a > directory, these files should only have O(16) entries in them. Insertion time is what worries me. When writing a small number of objects using a command like "git commit", I don't want to have to regenerate an entire idx file. I don't want to move the pain to O(loose objects) work at read time, either --- some people disable auto gc, and others have a large number of loose objects due to gc ejecting unreachable objects. But some kind of simplification along these lines should be possible. I'll experiment. Jonathan