Re: [PATCH v2 0/8] pack-objects: Create an alternative name hash algorithm (recreated)

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 12/2/24 10:23 PM, Junio C Hamano wrote:
"Derrick Stolee via GitGitGadget" <gitgitgadget@xxxxxxxxx> writes:

This series creates a mechanism to select alternative name hashes using a
new --name-hash-version=<n> option. The versions are:

  1. Version 1 is the default name hash that already exists. This option
     focuses on the final bytes of the path to maximize locality for
     cross-path deltas.

  2. Version 2 is the new path-component hash function suggested by Jonathan
     Tan in the previous version (with some modifications). This hash
     function essentially computes the v1 name hash of each path component
     and then overlays those hashes with a shift to make the parent
     directories contribute less to the final hash, but enough to break many
     collisions that exist in v1.

  3. Version 3 is the hash function that I submitted under the
     --full-name-hash feature in the previous versions. This uses a
     pseudorandom hash procedure to minimize collisions but at the expense of
     losing on locality. This version is implemented in the final patch of
     the series mostly for comparison purposes, as it is unlikely to be
     selected as a valuable hash function over v2. The final patch could be
     omitted from the merged version.

See the patches themselves for detailed results in the p5313-pack-objects.sh
performance test and the p5314-name-hash.sh test that demonstrates how many
collisions occur with each hash function.

These do not sound like versions but more like variants to me,
especially if one is expected to perform better than another in some
cases and worse in some other cases.  Is it expected that JTan's hash
to perform better than the original and current hash in almost all
cases (I would not be surprised at all if that were the case)?

There are some cases, such as the Linux kernel repo, that have slightly
worse compression using JTan's hash. But the naming conventions in that
repo are such that the v1 name hash was already pretty effective for
that repo. The Git repository has similar issues. See Patch 5 for
detailed analysis of these scenarios using the p5313-pack-objects.sh
test script.

It may be possible to adapt some of the collision rate analysis from
the test helper in Patch 6 to create a tool that recommends or
dynamically selects the hash function that works best for the repo.
Such a feature should be delayed until this code is exercised in more
places.

Thanks,
-Stolee






[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux