On Mon, Mar 10, 2025 at 01:50:44AM +0000, Derrick Stolee via GitGitGadget wrote: > From: Derrick Stolee <stolee@xxxxxxxxx> > > In order to more easily compute delta bases among objects that appear at > the exact same path, add a --path-walk option to 'git pack-objects'. > > This option will use the path-walk API instead of the object walk given > by the revision machinery. Since objects will be provided in batches > representing a common path, those objects can be tested for delta bases > immediately instead of waiting for a sort of the full object list by > name-hash. This has multiple benefits, including avoiding collisions by > name-hash. > > The objects marked as UNINTERESTING are included in these batches, so we > are guaranteeing some locality to find good delta bases. > > After the individual passes are done on a per-path basis, the default > name-hash is used to find other opportunistic delta bases that did not > match exactly by the full path name. > > The current implementation performs delta calculations while walking > objects, which is not ideal for a few reasons. First, this will cause > the "Enumerating objects" phase to be much longer than usual. Second, it > does not take advantage of threading during the path-scoped delta > calculations. Even with this lack of threading, the path-walk option is > sometimes faster than the usual approach. Future changes will refactor > this code to allow for threading, but that complexity is deferred until > later to keep this patch as simple as possible. > > This new walk is incompatible with some features and is ignored by > others: > > * Object filters are not currently integrated with the path-walk API, > such as sparse-checkout or tree depth. A blobless packfile could be > integrated easily, but that is deferred for later. > > * Server-focused features such as delta islands, shallow packs, and > using a bitmap index are incompatible with the path-walk API. > > * The path walk API is only compatible with the --revs option, not > taking object lists or pack lists over stdin. These alternative ways > to specify the objects currently ignores the --path-walk option > without even a warning. > > Future changes will create performance tests that demonstrate the power > of this approach. > > Signed-off-by: Derrick Stolee <stolee@xxxxxxxxx> > --- > Documentation/git-pack-objects.adoc | 13 +- > Documentation/technical/api-path-walk.adoc | 1 + > builtin/pack-objects.c | 147 +++++++++++++++++++-- > t/t5300-pack-object.sh | 15 +++ > 4 files changed, 166 insertions(+), 10 deletions(-) > > diff --git a/Documentation/git-pack-objects.adoc b/Documentation/git-pack-objects.adoc > index 7f69ae4855f..7dbbe6d54d2 100644 > --- a/Documentation/git-pack-objects.adoc > +++ b/Documentation/git-pack-objects.adoc > @@ -16,7 +16,7 @@ SYNOPSIS > [--cruft] [--cruft-expiration=<time>] > [--stdout [--filter=<filter-spec>] | <base-name>] > [--shallow] [--keep-true-parents] [--[no-]sparse] > - [--name-hash-version=<n>] < <object-list> > + [--name-hash-version=<n>] [--path-walk] < <object-list> > > > DESCRIPTION > @@ -375,6 +375,17 @@ many different directories. At the moment, this version is not allowed > when writing reachability bitmap files with `--write-bitmap-index` and it > will be automatically changed to version `1`. > > +--path-walk:: > + By default, `git pack-objects` walks objects in an order that > + presents trees and blobs in an order unrelated to the path they > + appear relative to a commit's root tree. The `--path-walk` option > + enables a different walking algorithm that organizes trees and > + blobs by path. This has the potential to improve delta compression > + especially in the presence of filenames that cause collisions in > + Git's default name-hash algorithm. Due to changing how the objects > + are walked, this option is not compatible with `--delta-islands`, > + `--shallow`, or `--filter`. I think from reading further below that this feature is somewhat incompatible with --use-bitmap-index, at least in the sense that we implicitly disable the latter whenever we see the former. Would that be worth mentioning here? > +static int add_objects_by_path(const char *path, > + struct oid_array *oids, > + enum object_type type, > + void *data) > +{ > + struct object_entry **delta_list; > + size_t oe_start = to_pack.nr_objects; > + size_t oe_end; > + unsigned int sub_list_size; > + unsigned int *processed = data; > + > + /* > + * First, add all objects to the packing data, including the ones > + * marked UNINTERESTING (translated to 'exclude') as they can be > + * used as delta bases. > + */ > + for (size_t i = 0; i < oids->nr; i++) { > + int exclude; > + struct object_info oi = OBJECT_INFO_INIT; > + struct object_id *oid = &oids->oid[i]; > + > + /* Skip objects that do not exist locally. */ > + if (exclude_promisor_objects && > + oid_object_info_extended(the_repository, oid, &oi, > + OBJECT_INFO_FOR_PREFETCH) < 0) > + continue; > + > + exclude = !is_oid_interesting(the_repository, oid); > + > + if (exclude && !thin) > + continue; > + > + add_object_entry(oid, type, path, exclude); > + } > + > + oe_end = to_pack.nr_objects; > + > + /* We can skip delta calculations if it is a no-op. */ > + if (oe_end == oe_start || !window) > + return 0; > + > + sub_list_size = 0; > + ALLOC_ARRAY(delta_list, oe_end - oe_start); Makes sense, and seems all reasonable. > + for (size_t i = 0; i < oe_end - oe_start; i++) { > + struct object_entry *entry = to_pack.objects + oe_start + i; > + > + if (!should_attempt_deltas(entry)) > + continue; > + > + delta_list[sub_list_size++] = entry; > + } > + > + /* > + * Find delta bases among this list of objects that all match the same > + * path. This causes the delta compression to be interleaved in the > + * object walk, which can lead to confusing progress indicators. This is > + * also incompatible with threaded delta calculations. In the future, > + * consider creating a list of regions in the full to_pack.objects array > + * that could be picked up by the threaded delta computation. > + */ > + if (sub_list_size && window) { > + QSORT(delta_list, sub_list_size, type_size_sort); > + find_deltas(delta_list, &sub_list_size, window, depth, processed); > + } Interesting. I like the "regions in to_pack.objects" idea as a way of threading the delta selection process in the future. Thanks, Taylor