Re: [PATCH 0/6] PATH WALK I: The path-walk API

Jeff King <peff@xxxxxxxx> · Mon, 11 Nov 2024 17:04:47 -0500

On Fri, Nov 08, 2024 at 10:17:24AM -0500, Derrick Stolee wrote:

> > > I'm not sure why a larger hash would be worse in a shallow clone. As you
> > > note, with only one version of each path the name-similarity heuristic
> > > is not likely to buy you much. But I'd have thought that would be true
> > > for the existing name hash as well as a longer one. Maybe this is the
> > > "over-emphasizing" case.
> 
> I'm confused by your wording of "larger hash" because the hash size
> is the exact same: 32 bits. It's just that the --full-name-hash option
> has fewer collisions by abandoning the hope of locality.
> 
> In a depth 1 shallow clone, there are no repeated paths, so any hash
> collisions are true collisions instead of good candidates for deltas.
> The full name hash is essentially random, so the delta compression
> algorithm basically says:
> 
>   1. Sort by type.
>   2. Within each type, sort the objects randomly.
> 
> With that sort, the delta compression scan is less effective than the
> standard name hash.

Ah, OK. I'm sorry, I had not really investigated the full-name-hash and
misunderstood what it was doing. I thought we were using a larger hash
in order to give more locality hints. I.e., to let us distinguish
"foo/long-path.c" from "bar/long-path.c", but still retain some locality
between them.

But it is throwing away all locality outside of the exact name. So yes,
it would never find a rename from "foo/" to "bar/", as those mean the
name-hash is effectively random.

So I guess getting back to what Taylor and I had talked about off-list:
we were wondering if there was a way to provide a better "slider" for
locality as part of the normal delta candidate sorting process. I think
you could store the full pathname and then doing a sort based on the
reverse of the string (so "foo.c" and "bar.c" would compare "c", then
".", etc). And that would let you tell the difference between
"foo/long-path.c" and "bar/long-path.c" (preferring the identical
filenames over the merely similar ones), but still sort them together
versus "some-unrelated-path.c".

And what I wondered (and what I had initially thought full-name-hash was
doing) was whether we could store some fixed-size hash that produces a
similar distribution (modulo collisions) to what that reverse-name sort
would do.

-Peff