Re: [PATCH 0/4] pack-objects: create new name-hash algorithm

Junio C Hamano <gitster@xxxxxxxxx> · Mon, 09 Sep 2024 11:06:27 -0700

"Derrick Stolee via GitGitGadget" <gitgitgadget@xxxxxxxxx> writes:

> One way to find some improvement in these repositories is to increase the
> window size, which was an initial indicator that the delta compression could
> be improved, but was not a clear indicator. After some digging (and
> prototyping some analysis tools) the main discovery was that the current
> name-hash algorithm only considers the last 16 characters in the path name
> and has some naturally-occurring collisions within that scope.

Yes, as I explained in the other message, this "collision" is an
integral part of the design to allow us gather candidates together
that may yield good deltas among them.  In addition, header files
whose names end with ".h" tend to share a bit comment at the
beginning of them in many projects, and the proximity (not
"collision") of the hash value is used to make them delta candidates
with each other.

I do agree that considering files at the same path from different
(but close-by) revisions as the prime candidates is very important,
but if you spread the "collissions" very thin by using "uniform
distribution", I am afraid that you'd end up discarding anything but
the blobs at the same path, which may go too far.  Having name hash
value that are close by no longer has any meaning in such a system.

I hope you can find a solution that strikes a good balance at the
end of the series (I saw only the first step), but I suspect an easy
way to avoid the downsides you observed is to use both.  Compare
with a handful of blobs taken from nearby commits (the original
object order is roughly in traversal order, and you can take
advantage of that fact) from exactly the same path (using your
"uniform distribution") before comparing with the blobs with close
value (of the current function) like the current implementation
does, may go a long way.