Re: [PATCH 0/6] PATH WALK I: The path-walk API

Junio C Hamano <gitster@xxxxxxxxx> · Mon, 04 Nov 2024 16:11:55 -0800

Jeff King <peff@xxxxxxxx> writes:

> On Mon, Nov 04, 2024 at 10:48:49AM -0500, Derrick Stolee wrote:
>> I disagree that all environments will prefer the --full-name-hash. I'm
>> currently repeating the performance tests right now, and I've added one.
>> The issues are:
>> 
>>  1. The --full-name-hash approach sometimes leads to a larger pack when
>>     using "git push" on the client, especially when the name-hash is
>>     already effective for compressing across paths.
>
> That's interesting. I wonder which cases get worse, and if a larger
> window size might help. I.e., presumably we are pushing the candidates
> further away in the sorted delta list.
>
>>  2. A depth 1 shallow clone cannot use previous versions of a path, so
>>     those situations will want to use the normal name hash. This can be
>>     accomplished simply by disabling the --full-name-hash option when
>>     the --shallow option is present; a more detailed version could be
>>     used to check for a large depth before disabling it. This case also
>>     disables bitmaps, so that isn't something to worry about.
>
> I'm not sure why a larger hash would be worse in a shallow clone. As you
> note, with only one version of each path the name-similarity heuristic
> is not likely to buy you much. But I'd have thought that would be true
> for the existing name hash as well as a longer one. Maybe this is the
> "over-emphasizing" case.

I too am curious to hear Derrick explain the above points and what
was learned from the performance tests.  The original hash was
designed to place files that are renamed across directories closer
to each other in the list sorted by the name hash, so a/Makefile and
b/Makefile would likely be treated as delta-base candidates while
foo/bar.c and bar/foo.c are treated as unrelated things.  A push
of a handful of commits that rename paths would likely place the
rename source of older commits and rename destination of newer
commits into the same delta chain, even with a smaller delta window.

In such a history, uniformly-distributed-without-regard-to-renames
hash is likely to make them into two distinct delta chains, leading
to less optimal delta-base selection.

A whole-repository packing, or a large push or fetch, of the same
history with renamed files are affected a lot less by such negative
effects of full-name hash.  When generating a pack with more commits
than the "--window", use of the original hash would mean blobs from
paths that share similar names (e.g., "Makefile"s everywhere in the
directory hierarchy) are placed close to each other, but full-name
hash will likely group the blobs from exactly the same path and
nothing else together, and the resulting delta-chain for identical
(and not similar) paths would be sufficiently long.  A long delta
chain has to be broken into multiple chains _anyway_ due to finite
"--depth" setting, so placing blobs from each path into its own
(initial) delta chain, completely ignoring renamed paths, would
likely to give us long enough (initial) delta chain to be split at
the depth limit.

It would lead to a good delta-base selection with smaller window
size quite efficiently with full-name hash.

I think a full-name hash forces a single-commit pack of a wide tree
to give up on deltified blobs, but with the original hash, at least
similar and common files (e.g. Makefile and COPYING) would sit close
together in the delta queue and can be deltified with each other,
which may be where the inefficiency comes from when full-name hash
is used.