On 12/20/24 12:19 PM, Derrick Stolee via GitGitGadget wrote:
This is a recreation of the topic in [1] that was closed. (I force-pushed my branch and GitHub won't let me reopen the PR for GitGitGadget to create this as v3.) [1] https://lore.kernel.org/git/pull.1785.v2.git.1726692381.gitgitgadget@xxxxxxxxx/ I've been focused recently on understanding and mitigating the growth of a few internal repositories. Some of these are growing much larger than expected for the number of contributors, and there are multiple aspects to why this growth is so large.
The main issue plaguing these repositories is that deltas are not being computed against objects that appear at the same path. While the size of these files at tip is one aspect of growth that would prevent this issue, the changes to these files are reasonable and should result in good delta compression. However, Git is not discovering the connections across different versions of the same file.
This series creates a mechanism to select alternative name hashes using a new --name-hash-version=<n> option. The versions are: 1. Version 1 is the default name hash that already exists. This option focuses on the final bytes of the path to maximize locality for cross-path deltas. 2. Version 2 is the new path-component hash function suggested by Jonathan Tan in the previous version (with some modifications). This hash function essentially computes the v1 name hash of each path component and then overlays those hashes with a shift to make the parent directories contribute less to the final hash, but enough to break many collisions that exist in v1. 3. Version 3 is the hash function that I submitted under the --full-name-hash feature in the previous versions. This uses a pseudorandom hash procedure to minimize collisions but at the expense of losing on locality. This version is implemented in the final patch of the series mostly for comparison purposes, as it is unlikely to be selected as a valuable hash function over v2. The final patch could be omitted from the merged version.
This series has been at this version for a while. I'm pretty sure that this is the most promising direction we have at the moment for improving delta compression for many users. The only decision point I think remains is whether or not to include the last patch (--name-hash-version=3) which I would be happy either way. Thanks, -Stolee