On Fri, Dec 20, 2024 at 05:19:48PM +0000, Derrick Stolee via GitGitGadget wrote: > From: Derrick Stolee <stolee@xxxxxxxxx> > > The previous change introduced a new pack_name_hash_v2() function that > intends to satisfy much of the hash locality features of the existing > pack_name_hash() function while also distinguishing paths with similar > final components of their paths. > > This change adds a new --name-hash-version option for 'git pack-objects' > to allow users to select their preferred function version. This use of > an integer version allows for future expansion and a direct way to later > store a name hash version in the .bitmap format. > > For now, let's consider how effective this mechanism is when repacking a > repository with different name hash versions. Specifically, we will > execute 'git pack-objects' the same way a 'git repack -adf' process > would, except we include --name-hash-version=<n> for testing. > > On the Git repository, we do not expect much difference. All path names > are short. This is backed by our results: > > | Stage | Pack Size | Repack Time | > |-----------------------|-----------|-------------| > | After clone | 260 MB | N/A | > | --name-hash-version=1 | 127 MB | 129s | > | --name-hash-version=2 | 127 MB | 112s | > > This example demonstrates how there is some natural overhead coming from > the cloned copy because the server is hosting many forks and has not > optimized for exactly this set of reachable objects. But the full repack > has similar characteristics for both versions. > > Let's consider some repositories that are hitting too many collisions > with version 1. First, let's explore the kinds of paths that are > commonly causing these collisions: > > * "/CHANGELOG.json" is 15 characters, and is created by the beachball > [1] tool. Only the final character of the parent directory can > differentiate different versions of this file, but also only the two > most-significant digits. If that character is a letter, then this is > always a collision. Similar issues occur with the similar > "/CHANGELOG.md" path, though there is more opportunity for > differences In the parent directory. > > * Localization files frequently have common filenames but > differentiates via parent directories. In C#, the name > "/strings.resx.lcl" is used for these localization files and they > will all collide in name-hash. > > [1] https://github.com/microsoft/beachball > > I've come across many other examples where some internal tool uses a > common name across multiple directories and is causing Git to repack > poorly due to name-hash collisions. > > One open-source example is the fluentui [2] repo, which uses beachball > to generate CHANGELOG.json and CHANGELOG.md files, and these files have > very poor delta characteristics when comparing against versions across > parent directories. > > | Stage | Pack Size | Repack Time | > |-----------------------|-----------|-------------| > | After clone | 694 MB | N/A | > | --name-hash-version=1 | 438 MB | 728s | > | --name-hash-version=2 | 168 MB | 142s | > > [2] https://github.com/microsoft/fluentui > > In this example, we see significant gains in the compressed packfile > size as well as the time taken to compute the packfile. > > Using a collection of repositories that use the beachball tool, I was > able to make similar comparisions with dramatic results. While the > fluentui repo is public, the others are private so cannot be shared for > reproduction. The results are so significant that I find it important to > share here: > > | Repo | --name-hash-version=1 | --name-hash-version=2 | > |----------|-----------------------|-----------------------| > | fluentui | 440 MB | 161 MB | > | Repo B | 6,248 MB | 856 MB | > | Repo C | 37,278 MB | 6,755 MB | > | Repo D | 131,204 MB | 7,463 MB | > > Future changes could include making --name-hash-version implied by a config > value or even implied by default during a full repack. > > It is important to point out that the name hash value is stored in the > .bitmap file format, so we must force --name-hash-version=1 when bitmaps > are being read or written. Later, the bitmap format could be updated to > be aware of the name hash version so deltas can be quickly computed > across the bitmapped/not-bitmapped boundary. > > Signed-off-by: Derrick Stolee <stolee@xxxxxxxxx> > --- > Documentation/git-pack-objects.txt | 32 ++++++++++++++++++- > builtin/pack-objects.c | 49 +++++++++++++++++++++++++++--- > t/t5300-pack-object.sh | 31 +++++++++++++++++++ > 3 files changed, 106 insertions(+), 6 deletions(-) > > diff --git a/Documentation/git-pack-objects.txt b/Documentation/git-pack-objects.txt > index e32404c6aae..7f69ae4855f 100644 > --- a/Documentation/git-pack-objects.txt > +++ b/Documentation/git-pack-objects.txt > @@ -15,7 +15,8 @@ SYNOPSIS > [--revs [--unpacked | --all]] [--keep-pack=<pack-name>] > [--cruft] [--cruft-expiration=<time>] > [--stdout [--filter=<filter-spec>] | <base-name>] > - [--shallow] [--keep-true-parents] [--[no-]sparse] < <object-list> > + [--shallow] [--keep-true-parents] [--[no-]sparse] > + [--name-hash-version=<n>] < <object-list> > > > DESCRIPTION > @@ -345,6 +346,35 @@ raise an error. > Restrict delta matches based on "islands". See DELTA ISLANDS > below. > > +--name-hash-version=<n>:: > + While performing delta compression, Git groups objects that may be > + similar based on heuristics using the path to that object. While > + grouping objects by an exact path match is good for paths with > + many versions, there are benefits for finding delta pairs across > + different full paths. Git collects objects by type and then by a > + "name hash" of the path and then by size, hoping to group objects > + that will compress well together. > ++ > +The default name hash version is `1`, which prioritizes hash locality by > +considering the final bytes of the path as providing the maximum magnitude > +to the hash function. This version excels at distinguishing short paths > +and finding renames across directories. However, the hash function depends > +primarily on the final 16 bytes of the path. If there are many paths in > +the repo that have the same final 16 bytes and differ only by parent > +directory, then this name-hash may lead to too many collisions and cause > +poor results. At the moment, this version is required when writing > +reachability bitmap files with `--write-bitmap-index`. > ++ > +The name hash version `2` has similar locality features as version `1`, > +except it considers each path component separately and overlays the hashes > +with a shift. This still prioritizes the final bytes of the path, but also > +"salts" the lower bits of the hash using the parent directory names. This > +method allows for some of the locality benefits of version `1` while > +breaking most of the collisions from a similarly-named file appearing in > +many different directories. At the moment, this version is not allowed > +when writing reachability bitmap files with `--write-bitmap-index` and it > +will be automatically changed to version `1`. > + > > DELTA ISLANDS > ------------- > diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c > index 08007142671..90ea19417bc 100644 > --- a/builtin/pack-objects.c > +++ b/builtin/pack-objects.c > @@ -266,6 +266,28 @@ struct configured_exclusion { > static struct oidmap configured_exclusions; > > static struct oidset excluded_by_config; > +static int name_hash_version = 1; > + > +static void validate_name_hash_version(void) > +{ > + if (name_hash_version < 1 || name_hash_version > 2) > + die(_("invalid --name-hash-version option: %d"), name_hash_version); > +} > + > +static inline uint32_t pack_name_hash_fn(const char *name) > +{ > + switch (name_hash_version) > + { This is definitely a nitpick, but the opening curly brace should appear on the preceding line. I don't think our CodingGuidelines explicitly say this. But in my head the convention is that only function bodies have their enclosing braces on their own line, as opposed to: static inline uint32_t pack_name_hash_fn(const char *name) { > + case 1: > + return pack_name_hash(name); > + > + case 2: > + return pack_name_hash_v2((const unsigned char *)name); > + > + default: > + BUG("invalid name-hash version: %d", name_hash_version); > + } > +} Otherwise this function looks very reasonable. > @@ -4576,6 +4609,12 @@ int cmd_pack_objects(int argc, > if (pack_to_stdout || !rev_list_all) > write_bitmap_index = 0; > > + validate_name_hash_version(); > + if (write_bitmap_index && name_hash_version != 1) { > + warning(_("currently, --write-bitmap-index requires --name-hash-version=1")); > + name_hash_version = 1; > + } > + Hmm. validate_name_hash_version() is its own function, which seems good, but then we do further validation on it here. I wonder if we should either move the latter step into validate_name_hash_version(), which would require us to pass a pointer to name_hash_version, or assign it the return value of validate_name_hash_version() (assuming it were changed to return the appropriate type instead of void). I think that we are probably pretty far into bike-shedding territory, but figured I'd share as it jumped out to me while reviewing. > if (use_delta_islands) > strvec_push(&rp, "--topo-order"); > > diff --git a/t/t5300-pack-object.sh b/t/t5300-pack-object.sh > index 3b9dae331a5..4270eabe8b7 100755 > --- a/t/t5300-pack-object.sh > +++ b/t/t5300-pack-object.sh > @@ -674,4 +674,35 @@ do > ' > done > > +test_expect_success 'valid and invalid --name-hash-versions' ' > + # Valid values are hard to verify other than "do not fail". > + # Performance tests will be more valuable to validate these versions. > + for value in 1 2 > + do > + git pack-objects base --all --name-hash-version=$value || return 1 > + done && > + > + # Invalid values have clear post-conditions. > + for value in -1 0 3 > + do > + test_must_fail git pack-objects base --all --name-hash-version=$value 2>err && > + test_grep "invalid --name-hash-version option" err || return 1 > + done > +' Looks great, and thanks for handling a few nonsensical values in the latter loop. > +# The following test is not necessarily a permanent choice, but since we do not > +# have a "name hash version" bit in the .bitmap file format, we cannot write the > +# hash values into the .bitmap file without risking breakage later. > +# > +# TODO: Make these compatible in the future and replace this test with the > +# expected behavior when both are specified. > +test_expect_success '--name-hash-version=2 and --write-bitmap-index are incompatible' ' > + git pack-objects base --all --name-hash-version=2 --write-bitmap-index 2>err && > + test_grep "currently, --write-bitmap-index requires --name-hash-version=1" err && > + > + # --stdout option silently removes --write-bitmap-index > + git pack-objects --stdout --all --name-hash-version=2 --write-bitmap-index >out 2>err && > + ! test_grep "currently, --write-bitmap-index requires --name-hash-version=1" err > +' This is grea, too, I appreciate having the reminder here for the future (and it'll feel so satisfying to change/remove this test when the time comes) ;-). Thanks, Taylor