Jeff King <peff@xxxxxxxx> writes: > On Mon, Nov 04, 2024 at 10:48:49AM -0500, Derrick Stolee wrote: >> I disagree that all environments will prefer the --full-name-hash. I'm >> currently repeating the performance tests right now, and I've added one. >> The issues are: >> >> 1. The --full-name-hash approach sometimes leads to a larger pack when >> using "git push" on the client, especially when the name-hash is >> already effective for compressing across paths. > > That's interesting. I wonder which cases get worse, and if a larger > window size might help. I.e., presumably we are pushing the candidates > further away in the sorted delta list. > >> 2. A depth 1 shallow clone cannot use previous versions of a path, so >> those situations will want to use the normal name hash. This can be >> accomplished simply by disabling the --full-name-hash option when >> the --shallow option is present; a more detailed version could be >> used to check for a large depth before disabling it. This case also >> disables bitmaps, so that isn't something to worry about. > > I'm not sure why a larger hash would be worse in a shallow clone. As you > note, with only one version of each path the name-similarity heuristic > is not likely to buy you much. But I'd have thought that would be true > for the existing name hash as well as a longer one. Maybe this is the > "over-emphasizing" case. I too am curious to hear Derrick explain the above points and what was learned from the performance tests. The original hash was designed to place files that are renamed across directories closer to each other in the list sorted by the name hash, so a/Makefile and b/Makefile would likely be treated as delta-base candidates while foo/bar.c and bar/foo.c are treated as unrelated things. A push of a handful of commits that rename paths would likely place the rename source of older commits and rename destination of newer commits into the same delta chain, even with a smaller delta window. In such a history, uniformly-distributed-without-regard-to-renames hash is likely to make them into two distinct delta chains, leading to less optimal delta-base selection. A whole-repository packing, or a large push or fetch, of the same history with renamed files are affected a lot less by such negative effects of full-name hash. When generating a pack with more commits than the "--window", use of the original hash would mean blobs from paths that share similar names (e.g., "Makefile"s everywhere in the directory hierarchy) are placed close to each other, but full-name hash will likely group the blobs from exactly the same path and nothing else together, and the resulting delta-chain for identical (and not similar) paths would be sufficiently long. A long delta chain has to be broken into multiple chains _anyway_ due to finite "--depth" setting, so placing blobs from each path into its own (initial) delta chain, completely ignoring renamed paths, would likely to give us long enough (initial) delta chain to be split at the depth limit. It would lead to a good delta-base selection with smaller window size quite efficiently with full-name hash. I think a full-name hash forces a single-commit pack of a wide tree to give up on deltified blobs, but with the original hash, at least similar and common files (e.g. Makefile and COPYING) would sit close together in the delta queue and can be deltified with each other, which may be where the inefficiency comes from when full-name hash is used.