Re: RFC v3: Another proposed hash function transition plan

Jeff King <peff@xxxxxxxx> · Fri, 10 Mar 2017 14:38:35 -0500

On Thu, Mar 09, 2017 at 12:24:08PM -0800, Jonathan Nieder wrote:

> > SHA-1 to SHA-3: lookup SHA-1 in .msha1, reverse .idx, find offset to
> > read the SHA-3.
> > SHA-3 to SHA-1: lookup SHA-3 in .idx, and reverse the .msha1 file to
> > translate offset to SHA-1.
> 
> Thanks for this suggestion.  I was initially vaguely nervous about
> lookup times in an idx-style file, but as you say, object reads from a
> packfile already have to deal with this kind of lookup and work fine.

Not exactly. The "reverse .idx" step has to build the reverse mapping on
the fly, and it's non-trivial. For instance, try:

  sha1=$(git rev-parse HEAD)
  time echo $sha1 | git cat-file --batch-check='%(objectsize)'
  time echo $sha1 | git cat-file --batch-check='%(objectsize:disk)'

on a large repo (where HEAD is in a big pack). The on-disk size is
conceptually simpler, as we only need to look at the offset of the
object versus the offset of the object after it. But in practice it
takes much longer, because it has to build the revindex on the fly (I
get 7ms versus 179ms on linux.git).

The effort is linear in the number of objects (we create the revindex
with a radix sort).

The reachability bitmaps suffer from this, too, as they need the
revindex to know which object is at which bit position. At GitHub we
added an extension to the .bitmap files that stores this "bit cache".
Here are timings before and after on linux.git:

  $ time git rev-list --use-bitmap-index --count master
  659371

  real	0m0.182s
  user	0m0.136s
  sys	0m0.044s

  $ time git.gh rev-list --use-bitmap-index --count master
  659371

  real	0m0.016s
  user	0m0.008s
  sys	0m0.004s

It's not a full revindex, but it's enough for bitmap use. You can also
use it to generate the revindex slightly more quickly, because you can
skip the sorting step (you just insert the entries in the correct order
by walking the bit cache and dereferencing the offsets from the .idx
portion). So it's still linear, but with a smaller constant factor.

I think for the purposes here, though, we don't actually care about the
offsets. For the cost of one uint32_t per object, you can keep a list
mapping positions in the sha1 index into the sha3 index. So then you do
the log-n binary search to find the sha1, a constant-time lookup in the
mapping array, and that gives you the position in the sha3 index, from
which you can then access the sha3 (or the actual pack offset, for that
matter).

So I think it's solvable, but I suspect we would want an extension to
the .idx format to store the mapping array, in order to keep it log-n.

-Peff