Re: [PATCH 0/6] PATH WALK I: The path-walk API

Jeff King <peff@xxxxxxxx> · Mon, 4 Nov 2024 12:25:33 -0500

On Mon, Nov 04, 2024 at 10:48:49AM -0500, Derrick Stolee wrote:

> > I was discussing this a bit off-list with Peff (who I hope will join the
> > thread and share his own thoughts), but I wonder if it was a mistake to
> > discard your '--full-name-hash' idea (or something similar, which I'll
> > discuss in a bit below) from earlier.
> 
> I'd be happy to resurrect that series, adding in the learnings from
> working on the path-walk feature. It helps that the current series adds
> the path-walk API and has no conflicting changes in the pack-objects or
> repack builtins. (I can handle those conflicts as things merge down.)

Adding my two cents, the discussion we had came after reading this post:

  https://www.jonathancreamer.com/how-we-shrunk-our-git-repo-size-by-94-percent/

I think a few of the low-level details in there are confusing, but it
seemed to me that most of the improvement he mentions is just about
finding better delta candidates. And it seems obvious that our current
pack_name_hash() is pretty rudimentary as context-sensitive hashing
goes and won't do well for long paths with similar endings.

So just swapping that out for something better seems like an easy thing
to do regardless of whether we pursue --path-walk. It doesn't
drastically change how we choose delta pairs so it's not much code and
it shouldn't conflict with other features. And the risk of making
anything worse should be pretty low.

I wouldn't at all be surprised if --path-walk can do better, but if we
do the easy thing first then I think it gives us a better idea of the
cost/benefit it's providing.

I suspect there's room for both in the long run. You seem to be focused
on push size and cost, whereas I think Taylor and I are more interested
in overall repo size and cost of serving bitmapped fetches.

> When the optimization idea is to focus on the full path and not care
> about the "locality" of the path name by its later bits, storing the
> full name in a list and storing an index into that list would have a
> very similar effect.
> 
> I'd be interested to explore the idea of storing the full path name.
> Based on my exploration with the 'test-tool name-hash' test helper in
> that series, I'm not sure that we will make significant gains by doing
> so. Worth trying.

The way I look at it is a possible continuum. We want to use pathnames
as a way to sort delta candidates near each other, since we expect them
to have high locality with respect to delta-able contents. The current
name_hash uses a very small bit of that path information and throws away
most of it. The other extreme end is holding the whole path. We may want
to end up in the middle for two reasons:

  1. Dealing with whole paths might be costly (though I'm not yet
     convinced of that; outside of pathological cases, the number of
     paths in a repo tends to pale in comparison to the number of
     objects, and the per-object costs dominate during repacking).

  2. It's possible that over-emphasizing the path might be a slightly
     worse heuristic (and I think this is a potential danger of
     --path-walk, too). We still want to find candidate pairs that were
     copied or renamed, for example, or that substantially share content
     found in different parts of the tree.

So it would be interesting to be able to see the performance of various
points on that line, from full path down to partial paths down to longer
hashes down to the current hash. The true extreme end of course is no
path info at all, but I think we know that sucks; that's why we
implemented the bitmap name-hash extension in the first place.

> I don't know too much about SimHash or MinHash, but based on what I
> could gather from some initial reading I'm not sure that they would be
> effective without increasing the hash length. We'd also get a different
> kind of locality, such as the appearance of a common word would be more
> likely to affect the locality than the end of the path.

Good point. This is all heuristics, of course, but I suspect that the
order of the path is important, and that foo/bar.c and bar/foo.c are
unlikely to be good matches.

> > I realize that this is taking us back to an idea you've already
> > presented to the list, but I think (to me, at least) the benefit and
> > simplicity of that approach has only become clear to me in hindsight
> > when seeing some alternatives. I would like to apologize for the time
> > you spent reworking this series back and forth to have the response be
> > "maybe we should have just done the first thing you suggested". Like I
> > said, I think to me it was really only clear in hindsight.
> 
> I always assumed that we'd come back to it eventually. There is also the
> extra bit about making the change to the name-hash compatible with the
> way name-hashes are stored in the reachability bitmaps. That will need
> some work before it is ready for prime time.

Having worked on that feature of bitmaps, I'm not too worried about it.
I think we'd just need to:

  - introduce a new bitmap ext with a flag (HASH_CACHE_V2 or something,
    either with the new hash, or with a "version" byte at the start for
    extensibility).

  - when bitmaps are not in use, we're free to use whichever hash we
    want internally. If the new hash is consistently better, we'd
    probably just enable it by default.

  - when packing using on-disk bitmaps, use internally whichever format
    the on-disk file provided. Technically the format could even provide
    both (in which case we'd prefer the new hash), but I don't see much
    point.

  - when writing bitmaps, use whichever hash the command-line options
    asked for. There's an off chance somebody might want to generate a
    .bitmap file whose hashes will be understood by an older version of
    git, in which case they'd use --no-full-name-hash or whatever while
    repacking.

If we're considering full paths, then that is potentially a bit more
involved, just because we'd want the format to avoid repeating duplicate
paths for each object (plus they're no longer fixed-size). So probably
an extension with packed NUL-terminated path strings, plus a set of
fixed-length offsets into that block, one per object.

> I disagree that all environments will prefer the --full-name-hash. I'm
> currently repeating the performance tests right now, and I've added one.
> The issues are:
> 
>  1. The --full-name-hash approach sometimes leads to a larger pack when
>     using "git push" on the client, especially when the name-hash is
>     already effective for compressing across paths.

That's interesting. I wonder which cases get worse, and if a larger
window size might help. I.e., presumably we are pushing the candidates
further away in the sorted delta list.

>  2. A depth 1 shallow clone cannot use previous versions of a path, so
>     those situations will want to use the normal name hash. This can be
>     accomplished simply by disabling the --full-name-hash option when
>     the --shallow option is present; a more detailed version could be
>     used to check for a large depth before disabling it. This case also
>     disables bitmaps, so that isn't something to worry about.

I'm not sure why a larger hash would be worse in a shallow clone. As you
note, with only one version of each path the name-similarity heuristic
is not likely to buy you much. But I'd have thought that would be true
for the existing name hash as well as a longer one. Maybe this is the
"over-emphasizing" case.

-Peff