On Fri, Nov 08, 2024 at 10:17:24AM -0500, Derrick Stolee wrote: > > > I'm not sure why a larger hash would be worse in a shallow clone. As you > > > note, with only one version of each path the name-similarity heuristic > > > is not likely to buy you much. But I'd have thought that would be true > > > for the existing name hash as well as a longer one. Maybe this is the > > > "over-emphasizing" case. > > I'm confused by your wording of "larger hash" because the hash size > is the exact same: 32 bits. It's just that the --full-name-hash option > has fewer collisions by abandoning the hope of locality. > > In a depth 1 shallow clone, there are no repeated paths, so any hash > collisions are true collisions instead of good candidates for deltas. > The full name hash is essentially random, so the delta compression > algorithm basically says: > > 1. Sort by type. > 2. Within each type, sort the objects randomly. > > With that sort, the delta compression scan is less effective than the > standard name hash. Ah, OK. I'm sorry, I had not really investigated the full-name-hash and misunderstood what it was doing. I thought we were using a larger hash in order to give more locality hints. I.e., to let us distinguish "foo/long-path.c" from "bar/long-path.c", but still retain some locality between them. But it is throwing away all locality outside of the exact name. So yes, it would never find a rename from "foo/" to "bar/", as those mean the name-hash is effectively random. So I guess getting back to what Taylor and I had talked about off-list: we were wondering if there was a way to provide a better "slider" for locality as part of the normal delta candidate sorting process. I think you could store the full pathname and then doing a sort based on the reverse of the string (so "foo.c" and "bar.c" would compare "c", then ".", etc). And that would let you tell the difference between "foo/long-path.c" and "bar/long-path.c" (preferring the identical filenames over the merely similar ones), but still sort them together versus "some-unrelated-path.c". And what I wondered (and what I had initially thought full-name-hash was doing) was whether we could store some fixed-size hash that produces a similar distribution (modulo collisions) to what that reverse-name sort would do. -Peff