On 12/2/24 10:23 PM, Junio C Hamano wrote:
"Derrick Stolee via GitGitGadget" <gitgitgadget@xxxxxxxxx> writes:
This series creates a mechanism to select alternative name hashes using a
new --name-hash-version=<n> option. The versions are:
1. Version 1 is the default name hash that already exists. This option
focuses on the final bytes of the path to maximize locality for
cross-path deltas.
2. Version 2 is the new path-component hash function suggested by Jonathan
Tan in the previous version (with some modifications). This hash
function essentially computes the v1 name hash of each path component
and then overlays those hashes with a shift to make the parent
directories contribute less to the final hash, but enough to break many
collisions that exist in v1.
3. Version 3 is the hash function that I submitted under the
--full-name-hash feature in the previous versions. This uses a
pseudorandom hash procedure to minimize collisions but at the expense of
losing on locality. This version is implemented in the final patch of
the series mostly for comparison purposes, as it is unlikely to be
selected as a valuable hash function over v2. The final patch could be
omitted from the merged version.
See the patches themselves for detailed results in the p5313-pack-objects.sh
performance test and the p5314-name-hash.sh test that demonstrates how many
collisions occur with each hash function.
These do not sound like versions but more like variants to me,
especially if one is expected to perform better than another in some
cases and worse in some other cases. Is it expected that JTan's hash
to perform better than the original and current hash in almost all
cases (I would not be surprised at all if that were the case)?
There are some cases, such as the Linux kernel repo, that have slightly
worse compression using JTan's hash. But the naming conventions in that
repo are such that the v1 name hash was already pretty effective for
that repo. The Git repository has similar issues. See Patch 5 for
detailed analysis of these scenarios using the p5313-pack-objects.sh
test script.
It may be possible to adapt some of the collision rate analysis from
the test helper in Patch 6 to create a tool that recommends or
dynamically selects the hash function that works best for the repo.
Such a feature should be delayed until this code is exercised in more
places.
Thanks,
-Stolee