Re: [PATCH 0/4] pack-objects: create new name-hash algorithm

Derrick Stolee <stolee@xxxxxxxxx> · Tue, 10 Sep 2024 17:09:22 -0400

On 9/10/24 4:36 PM, Junio C Hamano wrote:
Junio C Hamano <gitster@xxxxxxxxx> writes:

Derrick Stolee <stolee@xxxxxxxxx> writes:

The thing that surprised me is just how effective this is for the
creation of large pack-files that include many versions of most
files. The cross-path deltas have less of an effect here, and the
benefits of avoiding name-hash collisions can be overwhelming in
many cases.

Yes, "make sure we notice a file F moving from directory A to B" is
inherently optimized for short span of history, i.e. a smallish push
rather than a whole history clone, where the definition of
"smallish" is that even if you create optimal delta chains, the
length of these delta chains will not exceed the "--depth" option.

If the history you are pushing modified A/F twice, renamed it to B/F
(with or without modification at the same time), then modified B/F
twice more, you'd want to pack the 5-commit segment and having to
artificially cut the delta chain that can contain all of these 5
blobs into two at the renaming commit is a huge loss.

Which actually leads me to suspect that we probably do not even have
to expose the --full-name-hash option to the end users in "git repack".

If we are doing incremental that would fit within the depth setting,
it is likely that we would be better off without the full-name-hash
optimization, and if we are doing "repack -a" for the whole
repository, especially with "-f", it would make sense to do the
full-name-hash optimization.

Depending on how much we learn from others testing the --full-name-hash
option, I could see the potential that -a could imply --full-name-hash.
I hesitate to introduce that in the first release with this option,
though.

If we can tell how large a chunk of history we are packing before we
actually start calling builtin/pack-objects.c:add_object_entry(), we
probably should be able to even select between with and without
full-name-hash automatically, but I do not think we know the object
count before we finish calling add_object_entry(), so unless we are
willing to compute and keep both while reading and pick between the
two after we finish reading the list of objects, or something, it
will require a major surgery to do so, I am afraid.

It's also possible that we could check the list of paths at HEAD to
see how many collisions the default name-hash gives. In cases like
the Git repository, there are very few collisions and thus we don't
need to use --full-name-hash. Restricting to just HEAD (or the
default ref) is not a complete analysis, but might be a good
heuristic.

Thanks,
-Stolee