Re: [PATCH v3 0/8] pack-objects: Create an alternative name hash algorithm (recreated)

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 12/20/24 12:19 PM, Derrick Stolee via GitGitGadget wrote:
This is a recreation of the topic in [1] that was closed. (I force-pushed my
branch and GitHub won't let me reopen the PR for GitGitGadget to create this
as v3.)

[1]
https://lore.kernel.org/git/pull.1785.v2.git.1726692381.gitgitgadget@xxxxxxxxx/

I've been focused recently on understanding and mitigating the growth of a
few internal repositories. Some of these are growing much larger than
expected for the number of contributors, and there are multiple aspects to
why this growth is so large.

The main issue plaguing these repositories is that deltas are not being
computed against objects that appear at the same path. While the size of
these files at tip is one aspect of growth that would prevent this issue,
the changes to these files are reasonable and should result in good delta
compression. However, Git is not discovering the connections across
different versions of the same file.

This series creates a mechanism to select alternative name hashes using a
new --name-hash-version=<n> option. The versions are:

  1. Version 1 is the default name hash that already exists. This option
     focuses on the final bytes of the path to maximize locality for
     cross-path deltas.

  2. Version 2 is the new path-component hash function suggested by Jonathan
     Tan in the previous version (with some modifications). This hash
     function essentially computes the v1 name hash of each path component
     and then overlays those hashes with a shift to make the parent
     directories contribute less to the final hash, but enough to break many
     collisions that exist in v1.

  3. Version 3 is the hash function that I submitted under the
     --full-name-hash feature in the previous versions. This uses a
     pseudorandom hash procedure to minimize collisions but at the expense of
     losing on locality. This version is implemented in the final patch of
     the series mostly for comparison purposes, as it is unlikely to be
     selected as a valuable hash function over v2. The final patch could be
     omitted from the merged version.
This series has been at this version for a while. I'm pretty sure that this
is the most promising direction we have at the moment for improving delta
compression for many users.

The only decision point I think remains is whether or not to include the last
patch (--name-hash-version=3) which I would be happy either way.

Thanks,
-Stolee




[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux