Re: [PATCH 02/13] pack-objects: add --path-walk option

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Mon, Mar 10, 2025 at 01:50:44AM +0000, Derrick Stolee via GitGitGadget wrote:
> From: Derrick Stolee <stolee@xxxxxxxxx>
>
> In order to more easily compute delta bases among objects that appear at
> the exact same path, add a --path-walk option to 'git pack-objects'.
>
> This option will use the path-walk API instead of the object walk given
> by the revision machinery. Since objects will be provided in batches
> representing a common path, those objects can be tested for delta bases
> immediately instead of waiting for a sort of the full object list by
> name-hash. This has multiple benefits, including avoiding collisions by
> name-hash.
>
> The objects marked as UNINTERESTING are included in these batches, so we
> are guaranteeing some locality to find good delta bases.
>
> After the individual passes are done on a per-path basis, the default
> name-hash is used to find other opportunistic delta bases that did not
> match exactly by the full path name.
>
> The current implementation performs delta calculations while walking
> objects, which is not ideal for a few reasons. First, this will cause
> the "Enumerating objects" phase to be much longer than usual. Second, it
> does not take advantage of threading during the path-scoped delta
> calculations. Even with this lack of threading, the path-walk option is
> sometimes faster than the usual approach. Future changes will refactor
> this code to allow for threading, but that complexity is deferred until
> later to keep this patch as simple as possible.
>
> This new walk is incompatible with some features and is ignored by
> others:
>
>  * Object filters are not currently integrated with the path-walk API,
>    such as sparse-checkout or tree depth. A blobless packfile could be
>    integrated easily, but that is deferred for later.
>
>  * Server-focused features such as delta islands, shallow packs, and
>    using a bitmap index are incompatible with the path-walk API.
>
>  * The path walk API is only compatible with the --revs option, not
>    taking object lists or pack lists over stdin. These alternative ways
>    to specify the objects currently ignores the --path-walk option
>    without even a warning.
>
> Future changes will create performance tests that demonstrate the power
> of this approach.
>
> Signed-off-by: Derrick Stolee <stolee@xxxxxxxxx>
> ---
>  Documentation/git-pack-objects.adoc        |  13 +-
>  Documentation/technical/api-path-walk.adoc |   1 +
>  builtin/pack-objects.c                     | 147 +++++++++++++++++++--
>  t/t5300-pack-object.sh                     |  15 +++
>  4 files changed, 166 insertions(+), 10 deletions(-)
>
> diff --git a/Documentation/git-pack-objects.adoc b/Documentation/git-pack-objects.adoc
> index 7f69ae4855f..7dbbe6d54d2 100644
> --- a/Documentation/git-pack-objects.adoc
> +++ b/Documentation/git-pack-objects.adoc
> @@ -16,7 +16,7 @@ SYNOPSIS
>  	[--cruft] [--cruft-expiration=<time>]
>  	[--stdout [--filter=<filter-spec>] | <base-name>]
>  	[--shallow] [--keep-true-parents] [--[no-]sparse]
> -	[--name-hash-version=<n>] < <object-list>
> +	[--name-hash-version=<n>] [--path-walk] < <object-list>
>
>
>  DESCRIPTION
> @@ -375,6 +375,17 @@ many different directories. At the moment, this version is not allowed
>  when writing reachability bitmap files with `--write-bitmap-index` and it
>  will be automatically changed to version `1`.
>
> +--path-walk::
> +	By default, `git pack-objects` walks objects in an order that
> +	presents trees and blobs in an order unrelated to the path they
> +	appear relative to a commit's root tree. The `--path-walk` option
> +	enables a different walking algorithm that organizes trees and
> +	blobs by path. This has the potential to improve delta compression
> +	especially in the presence of filenames that cause collisions in
> +	Git's default name-hash algorithm. Due to changing how the objects
> +	are walked, this option is not compatible with `--delta-islands`,
> +	`--shallow`, or `--filter`.

I think from reading further below that this feature is somewhat
incompatible with --use-bitmap-index, at least in the sense that we
implicitly disable the latter whenever we see the former. Would that be
worth mentioning here?

> +static int add_objects_by_path(const char *path,
> +			       struct oid_array *oids,
> +			       enum object_type type,
> +			       void *data)
> +{
> +	struct object_entry **delta_list;
> +	size_t oe_start = to_pack.nr_objects;
> +	size_t oe_end;
> +	unsigned int sub_list_size;
> +	unsigned int *processed = data;
> +
> +	/*
> +	 * First, add all objects to the packing data, including the ones
> +	 * marked UNINTERESTING (translated to 'exclude') as they can be
> +	 * used as delta bases.
> +	 */
> +	for (size_t i = 0; i < oids->nr; i++) {
> +		int exclude;
> +		struct object_info oi = OBJECT_INFO_INIT;
> +		struct object_id *oid = &oids->oid[i];
> +
> +		/* Skip objects that do not exist locally. */
> +		if (exclude_promisor_objects &&
> +		    oid_object_info_extended(the_repository, oid, &oi,
> +					     OBJECT_INFO_FOR_PREFETCH) < 0)
> +			continue;
> +
> +		exclude = !is_oid_interesting(the_repository, oid);
> +
> +		if (exclude && !thin)
> +			continue;
> +
> +		add_object_entry(oid, type, path, exclude);
> +	}
> +
> +	oe_end = to_pack.nr_objects;
> +
> +	/* We can skip delta calculations if it is a no-op. */
> +	if (oe_end == oe_start || !window)
> +		return 0;
> +
> +	sub_list_size = 0;
> +	ALLOC_ARRAY(delta_list, oe_end - oe_start);

Makes sense, and seems all reasonable.

> +	for (size_t i = 0; i < oe_end - oe_start; i++) {
> +		struct object_entry *entry = to_pack.objects + oe_start + i;
> +
> +		if (!should_attempt_deltas(entry))
> +			continue;
> +
> +		delta_list[sub_list_size++] = entry;
> +	}
> +
> +	/*
> +	 * Find delta bases among this list of objects that all match the same
> +	 * path. This causes the delta compression to be interleaved in the
> +	 * object walk, which can lead to confusing progress indicators. This is
> +	 * also incompatible with threaded delta calculations. In the future,
> +	 * consider creating a list of regions in the full to_pack.objects array
> +	 * that could be picked up by the threaded delta computation.
> +	 */
> +	if (sub_list_size && window) {
> +		QSORT(delta_list, sub_list_size, type_size_sort);
> +		find_deltas(delta_list, &sub_list_size, window, depth, processed);
> +	}

Interesting. I like the "regions in to_pack.objects" idea as a way of
threading the delta selection process in the future.

Thanks,
Taylor




[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux