Re: [PATCH 0/6] PATH WALK I: The path-walk API

Derrick Stolee <stolee@xxxxxxxxx> · Fri, 8 Nov 2024 10:17:24 -0500

On 11/4/24 7:11 PM, Junio C Hamano wrote:
Jeff King <peff@xxxxxxxx> writes:

On Mon, Nov 04, 2024 at 10:48:49AM -0500, Derrick Stolee wrote:
I disagree that all environments will prefer the --full-name-hash. I'm
currently repeating the performance tests right now, and I've added one.
The issues are:

  1. The --full-name-hash approach sometimes leads to a larger pack when
     using "git push" on the client, especially when the name-hash is
     already effective for compressing across paths.

That's interesting. I wonder which cases get worse, and if a larger
window size might help. I.e., presumably we are pushing the candidates
further away in the sorted delta list.

I think the cases that make things get worse with --full-name-hash are:

  1. The presence of renames, partitioning objects that used to fit into
     the same bucket (in the case of directory renames).

  2. Some standard kinds of files may appear several times across the
     tree but do not change very often and are similar across path.

  3. Common patterns across similar file types, such as similar includes
     in .c and .h files or other kinds of boilerplate in different
     languages.

A larger window size will always expand the range of possible deltas, at
a cost to the time taken to compute the deltas. My first experiment in
these repositories was to increase the window size to 250 (from the default
10). This caused a very slow repack, but the repositories shrunk.

For example, the big Javascript monorepo that repacked to ~100 GB with
default settings would repack to ~30 GB with --window=250. This was an
indicator that delta compression would work if we can find the right pairs
to use for deltas.

The point of the two strategies (--full-name-hash and --path-walk) is
about putting objects close to each other in better ways than the name
hash sort.

  2. A depth 1 shallow clone cannot use previous versions of a path, so
     those situations will want to use the normal name hash. This can be
     accomplished simply by disabling the --full-name-hash option when
     the --shallow option is present; a more detailed version could be
     used to check for a large depth before disabling it. This case also
     disables bitmaps, so that isn't something to worry about.

I'm not sure why a larger hash would be worse in a shallow clone. As you
note, with only one version of each path the name-similarity heuristic
is not likely to buy you much. But I'd have thought that would be true
for the existing name hash as well as a longer one. Maybe this is the
"over-emphasizing" case.

I'm confused by your wording of "larger hash" because the hash size
is the exact same: 32 bits. It's just that the --full-name-hash option
has fewer collisions by abandoning the hope of locality.

In a depth 1 shallow clone, there are no repeated paths, so any hash
collisions are true collisions instead of good candidates for deltas.
The full name hash is essentially random, so the delta compression
algorithm basically says:

  1. Sort by type.
  2. Within each type, sort the objects randomly.

With that sort, the delta compression scan is less effective than the
standard name hash.

I too am curious to hear Derrick explain the above points and what
was learned from the performance tests.  The original hash was
designed to place files that are renamed across directories closer
to each other in the list sorted by the name hash, so a/Makefile and
b/Makefile would likely be treated as delta-base candidates while
foo/bar.c and bar/foo.c are treated as unrelated things.  A push
of a handful of commits that rename paths would likely place the
rename source of older commits and rename destination of newer
commits into the same delta chain, even with a smaller delta window.

In such a history, uniformly-distributed-without-regard-to-renames
hash is likely to make them into two distinct delta chains, leading
to less optimal delta-base selection.

Yes. This is the downside of the --full-name-hash compared to the
standard name hash. When repacking an entire repository, the effect
of these renames is typically not important in the long run as it's
basically a single break in the delta chain. The downside comes in
when doing a small fetch or push where the rename has more impact.

The --path-walk approach does not suffer from this problem because
it has a second pass that sorts by the name hash and looks for
better deltas than the ones that already exist. Thus, it gets the
best of both worlds.

The performance impact of the two passes of the --path-walk
approach is interesting, as you'd typically expect this to always
be slower. However:

 1. The delta compression within each batch only compares the
    objects within that batch. We do not compare these objects to
    unrelated objects, which can be expensive and wasteful. This
    also means that small batches may even be smaller than the
    delta window, reducing the number of comparisons.

 2. In the second pass, the delta calculation can short-circuit if
    the computed delta would be larger than the current-best delta.
    Thus, the good deltas from the first pass make the second pass
    faster.

A whole-repository packing, or a large push or fetch, of the same
history with renamed files are affected a lot less by such negative
effects of full-name hash.  When generating a pack with more commits
than the "--window", use of the original hash would mean blobs from
paths that share similar names (e.g., "Makefile"s everywhere in the
directory hierarchy) are placed close to each other, but full-name
hash will likely group the blobs from exactly the same path and
nothing else together, and the resulting delta-chain for identical
(and not similar) paths would be sufficiently long.  A long delta
chain has to be broken into multiple chains _anyway_ due to finite
"--depth" setting, so placing blobs from each path into its own
(initial) delta chain, completely ignoring renamed paths, would
likely to give us long enough (initial) delta chain to be split at
the depth limit.

It would lead to a good delta-base selection with smaller window
size quite efficiently with full-name hash.>
I think a full-name hash forces a single-commit pack of a wide tree
to give up on deltified blobs, but with the original hash, at least
similar and common files (e.g. Makefile and COPYING) would sit close
together in the delta queue and can be deltified with each other,
which may be where the inefficiency comes from when full-name hash
is used.

Yes, this is a good summary of why this works for the data
efficiency in long histories. Your earlier observations are why
the full-name hash has demonstrated issues with smaller time scales.

These numbers are carefully detailed in the performance tests in
the refreshed series [1]. The series also has a way to disable the
full-name hash when serving a shallow clone for this reason.

[1] https://lore.kernel.org/git/pull.1823.git.1730775907.gitgitgadget@xxxxxxxxx/

Thanks,
-Stolee