Re: [PATCH v3 2/8] pack-objects: add --name-hash-version option

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Fri, Dec 20, 2024 at 05:19:48PM +0000, Derrick Stolee via GitGitGadget wrote:
> From: Derrick Stolee <stolee@xxxxxxxxx>
>
> The previous change introduced a new pack_name_hash_v2() function that
> intends to satisfy much of the hash locality features of the existing
> pack_name_hash() function while also distinguishing paths with similar
> final components of their paths.
>
> This change adds a new --name-hash-version option for 'git pack-objects'
> to allow users to select their preferred function version. This use of
> an integer version allows for future expansion and a direct way to later
> store a name hash version in the .bitmap format.
>
> For now, let's consider how effective this mechanism is when repacking a
> repository with different name hash versions. Specifically, we will
> execute 'git pack-objects' the same way a 'git repack -adf' process
> would, except we include --name-hash-version=<n> for testing.
>
> On the Git repository, we do not expect much difference. All path names
> are short. This is backed by our results:
>
> | Stage                 | Pack Size | Repack Time |
> |-----------------------|-----------|-------------|
> | After clone           | 260 MB    | N/A         |
> | --name-hash-version=1 | 127 MB    | 129s        |
> | --name-hash-version=2 | 127 MB    | 112s        |
>
> This example demonstrates how there is some natural overhead coming from
> the cloned copy because the server is hosting many forks and has not
> optimized for exactly this set of reachable objects. But the full repack
> has similar characteristics for both versions.
>
> Let's consider some repositories that are hitting too many collisions
> with version 1. First, let's explore the kinds of paths that are
> commonly causing these collisions:
>
>  * "/CHANGELOG.json" is 15 characters, and is created by the beachball
>    [1] tool. Only the final character of the parent directory can
>    differentiate different versions of this file, but also only the two
>    most-significant digits. If that character is a letter, then this is
>    always a collision. Similar issues occur with the similar
>    "/CHANGELOG.md" path, though there is more opportunity for
>    differences In the parent directory.
>
>  * Localization files frequently have common filenames but
>    differentiates via parent directories. In C#, the name
>    "/strings.resx.lcl" is used for these localization files and they
>    will all collide in name-hash.
>
> [1] https://github.com/microsoft/beachball
>
> I've come across many other examples where some internal tool uses a
> common name across multiple directories and is causing Git to repack
> poorly due to name-hash collisions.
>
> One open-source example is the fluentui [2] repo, which  uses beachball
> to generate CHANGELOG.json and CHANGELOG.md files, and these files have
> very poor delta characteristics when comparing against versions across
> parent directories.
>
> | Stage                 | Pack Size | Repack Time |
> |-----------------------|-----------|-------------|
> | After clone           | 694 MB    | N/A         |
> | --name-hash-version=1 | 438 MB    | 728s        |
> | --name-hash-version=2 | 168 MB    | 142s        |
>
> [2] https://github.com/microsoft/fluentui
>
> In this example, we see significant gains in the compressed packfile
> size as well as the time taken to compute the packfile.
>
> Using a collection of repositories that use the beachball tool, I was
> able to make similar comparisions with dramatic results. While the
> fluentui repo is public, the others are private so cannot be shared for
> reproduction. The results are so significant that I find it important to
> share here:
>
> | Repo     | --name-hash-version=1 | --name-hash-version=2 |
> |----------|-----------------------|-----------------------|
> | fluentui |               440 MB  |               161 MB  |
> | Repo B   |             6,248 MB  |               856 MB  |
> | Repo C   |            37,278 MB  |             6,755 MB  |
> | Repo D   |           131,204 MB  |             7,463 MB  |
>
> Future changes could include making --name-hash-version implied by a config
> value or even implied by default during a full repack.
>
> It is important to point out that the name hash value is stored in the
> .bitmap file format, so we must force --name-hash-version=1 when bitmaps
> are being read or written. Later, the bitmap format could be updated to
> be aware of the name hash version so deltas can be quickly computed
> across the bitmapped/not-bitmapped boundary.
>
> Signed-off-by: Derrick Stolee <stolee@xxxxxxxxx>
> ---
>  Documentation/git-pack-objects.txt | 32 ++++++++++++++++++-
>  builtin/pack-objects.c             | 49 +++++++++++++++++++++++++++---
>  t/t5300-pack-object.sh             | 31 +++++++++++++++++++
>  3 files changed, 106 insertions(+), 6 deletions(-)
>
> diff --git a/Documentation/git-pack-objects.txt b/Documentation/git-pack-objects.txt
> index e32404c6aae..7f69ae4855f 100644
> --- a/Documentation/git-pack-objects.txt
> +++ b/Documentation/git-pack-objects.txt
> @@ -15,7 +15,8 @@ SYNOPSIS
>  	[--revs [--unpacked | --all]] [--keep-pack=<pack-name>]
>  	[--cruft] [--cruft-expiration=<time>]
>  	[--stdout [--filter=<filter-spec>] | <base-name>]
> -	[--shallow] [--keep-true-parents] [--[no-]sparse] < <object-list>
> +	[--shallow] [--keep-true-parents] [--[no-]sparse]
> +	[--name-hash-version=<n>] < <object-list>
>
>
>  DESCRIPTION
> @@ -345,6 +346,35 @@ raise an error.
>  	Restrict delta matches based on "islands". See DELTA ISLANDS
>  	below.
>
> +--name-hash-version=<n>::
> +	While performing delta compression, Git groups objects that may be
> +	similar based on heuristics using the path to that object. While
> +	grouping objects by an exact path match is good for paths with
> +	many versions, there are benefits for finding delta pairs across
> +	different full paths. Git collects objects by type and then by a
> +	"name hash" of the path and then by size, hoping to group objects
> +	that will compress well together.
> ++
> +The default name hash version is `1`, which prioritizes hash locality by
> +considering the final bytes of the path as providing the maximum magnitude
> +to the hash function. This version excels at distinguishing short paths
> +and finding renames across directories. However, the hash function depends
> +primarily on the final 16 bytes of the path. If there are many paths in
> +the repo that have the same final 16 bytes and differ only by parent
> +directory, then this name-hash may lead to too many collisions and cause
> +poor results. At the moment, this version is required when writing
> +reachability bitmap files with `--write-bitmap-index`.
> ++
> +The name hash version `2` has similar locality features as version `1`,
> +except it considers each path component separately and overlays the hashes
> +with a shift. This still prioritizes the final bytes of the path, but also
> +"salts" the lower bits of the hash using the parent directory names. This
> +method allows for some of the locality benefits of version `1` while
> +breaking most of the collisions from a similarly-named file appearing in
> +many different directories. At the moment, this version is not allowed
> +when writing reachability bitmap files with `--write-bitmap-index` and it
> +will be automatically changed to version `1`.
> +
>
>  DELTA ISLANDS
>  -------------
> diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
> index 08007142671..90ea19417bc 100644
> --- a/builtin/pack-objects.c
> +++ b/builtin/pack-objects.c
> @@ -266,6 +266,28 @@ struct configured_exclusion {
>  static struct oidmap configured_exclusions;
>
>  static struct oidset excluded_by_config;
> +static int name_hash_version = 1;
> +
> +static void validate_name_hash_version(void)
> +{
> +	if (name_hash_version < 1 || name_hash_version > 2)
> +		die(_("invalid --name-hash-version option: %d"), name_hash_version);
> +}
> +
> +static inline uint32_t pack_name_hash_fn(const char *name)
> +{
> +	switch (name_hash_version)
> +	{

This is definitely a nitpick, but the opening curly brace should appear
on the preceding line. I don't think our CodingGuidelines explicitly say
this. But in my head the convention is that only function bodies have
their enclosing braces on their own line, as opposed to:

    static inline uint32_t pack_name_hash_fn(const char *name) {

> +	case 1:
> +		return pack_name_hash(name);
> +
> +	case 2:
> +		return pack_name_hash_v2((const unsigned char *)name);
> +
> +	default:
> +		BUG("invalid name-hash version: %d", name_hash_version);
> +	}
> +}

Otherwise this function looks very reasonable.

> @@ -4576,6 +4609,12 @@ int cmd_pack_objects(int argc,
>  	if (pack_to_stdout || !rev_list_all)
>  		write_bitmap_index = 0;
>
> +	validate_name_hash_version();
> +	if (write_bitmap_index && name_hash_version != 1) {
> +		warning(_("currently, --write-bitmap-index requires --name-hash-version=1"));
> +		name_hash_version = 1;
> +	}
> +

Hmm. validate_name_hash_version() is its own function, which seems good,
but then we do further validation on it here. I wonder if we should
either move the latter step into validate_name_hash_version(), which
would require us to pass a pointer to name_hash_version, or assign it
the return value of validate_name_hash_version() (assuming it were
changed to return the appropriate type instead of void).

I think that we are probably pretty far into bike-shedding territory,
but figured I'd share as it jumped out to me while reviewing.

>  	if (use_delta_islands)
>  		strvec_push(&rp, "--topo-order");
>
> diff --git a/t/t5300-pack-object.sh b/t/t5300-pack-object.sh
> index 3b9dae331a5..4270eabe8b7 100755
> --- a/t/t5300-pack-object.sh
> +++ b/t/t5300-pack-object.sh
> @@ -674,4 +674,35 @@ do
>  	'
>  done
>
> +test_expect_success 'valid and invalid --name-hash-versions' '
> +	# Valid values are hard to verify other than "do not fail".
> +	# Performance tests will be more valuable to validate these versions.
> +	for value in 1 2
> +	do
> +		git pack-objects base --all --name-hash-version=$value || return 1
> +	done &&
> +
> +	# Invalid values have clear post-conditions.
> +	for value in -1 0 3
> +	do
> +		test_must_fail git pack-objects base --all --name-hash-version=$value 2>err &&
> +		test_grep "invalid --name-hash-version option" err || return 1
> +	done
> +'

Looks great, and thanks for handling a few nonsensical values in the
latter loop.

> +# The following test is not necessarily a permanent choice, but since we do not
> +# have a "name hash version" bit in the .bitmap file format, we cannot write the
> +# hash values into the .bitmap file without risking breakage later.
> +#
> +# TODO: Make these compatible in the future and replace this test with the
> +# expected behavior when both are specified.
> +test_expect_success '--name-hash-version=2 and --write-bitmap-index are incompatible' '
> +	git pack-objects base --all --name-hash-version=2 --write-bitmap-index 2>err &&
> +	test_grep "currently, --write-bitmap-index requires --name-hash-version=1" err &&
> +
> +	# --stdout option silently removes --write-bitmap-index
> +	git pack-objects --stdout --all --name-hash-version=2 --write-bitmap-index >out 2>err &&
> +	! test_grep "currently, --write-bitmap-index requires --name-hash-version=1" err
> +'

This is grea, too, I appreciate having the reminder here for the future
(and it'll feel so satisfying to change/remove this test when the time
comes) ;-).

Thanks,
Taylor




[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux