On Thu, Mar 09, 2017 at 12:24:08PM -0800, Jonathan Nieder wrote: > > SHA-1 to SHA-3: lookup SHA-1 in .msha1, reverse .idx, find offset to > > read the SHA-3. > > SHA-3 to SHA-1: lookup SHA-3 in .idx, and reverse the .msha1 file to > > translate offset to SHA-1. > > Thanks for this suggestion. I was initially vaguely nervous about > lookup times in an idx-style file, but as you say, object reads from a > packfile already have to deal with this kind of lookup and work fine. Not exactly. The "reverse .idx" step has to build the reverse mapping on the fly, and it's non-trivial. For instance, try: sha1=$(git rev-parse HEAD) time echo $sha1 | git cat-file --batch-check='%(objectsize)' time echo $sha1 | git cat-file --batch-check='%(objectsize:disk)' on a large repo (where HEAD is in a big pack). The on-disk size is conceptually simpler, as we only need to look at the offset of the object versus the offset of the object after it. But in practice it takes much longer, because it has to build the revindex on the fly (I get 7ms versus 179ms on linux.git). The effort is linear in the number of objects (we create the revindex with a radix sort). The reachability bitmaps suffer from this, too, as they need the revindex to know which object is at which bit position. At GitHub we added an extension to the .bitmap files that stores this "bit cache". Here are timings before and after on linux.git: $ time git rev-list --use-bitmap-index --count master 659371 real 0m0.182s user 0m0.136s sys 0m0.044s $ time git.gh rev-list --use-bitmap-index --count master 659371 real 0m0.016s user 0m0.008s sys 0m0.004s It's not a full revindex, but it's enough for bitmap use. You can also use it to generate the revindex slightly more quickly, because you can skip the sorting step (you just insert the entries in the correct order by walking the bit cache and dereferencing the offsets from the .idx portion). So it's still linear, but with a smaller constant factor. I think for the purposes here, though, we don't actually care about the offsets. For the cost of one uint32_t per object, you can keep a list mapping positions in the sha1 index into the sha3 index. So then you do the log-n binary search to find the sha1, a constant-time lookup in the mapping array, and that gives you the position in the sha3 index, from which you can then access the sha3 (or the actual pack offset, for that matter). So I think it's solvable, but I suspect we would want an extension to the .idx format to store the mapping array, in order to keep it log-n. -Peff