Re: [PATCH 5/5] index-pack: repack local links into promisor packs

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Mostly looks good, just one point of confusion on my part, and one nit.
Thanks for the patch! (comments inline below)


On 2024.10.24 11:08, Jonathan Tan wrote:
> Teach index-pack to, when processing the objects in a pack with
> --promisor specified on the CLI, repack local objects (and the local
> objects that they refer to, recursively) referenced by these objects
> into promisor packs.
> 
> This prevents the situation in which, when fetching from a promisor
> remote, we end up with promisor objects (newly fetched) referring
> to non-promisor objects (locally created prior to the fetch). This
> situation may arise if the client had previously pushed objects to the
> remote, for example. One issue that arises in this situation is that,
> if the non-promisor objects become inaccessible except through promisor
> objects (for example, if the branch pointing to them has moved to
> point to the promisor object that refers to them), then GC will garbage
> collect them. There are other ways to solve this, but the simplest
> seems to be to enforce the invariant that we don't have promisor objects
> referring to non-promisor objects.
> 
> This repacking is done from index-pack to minimize the performance
> impact. During a fetch, the only time most objects are fully inflated
> in memory is when their object ID is computed, so we also scan the
> objects (to see which objects they refer to) during this time.
> 
> Also to minimize the performance impact, an object is calculated to be
> local if it's a loose object or present in a non-promisor pack. (If it's
> also in a promisor pack or referred to by an object in a promisor pack,
> it is technically already a promisor object. But a misidentification
> of a promisor object as a non-promisor object is relatively benign
> here - we will thus repack that promisor object into a promisor pack,
> duplicating it in the object store, but there is no correctness issue,
> just an issue of inefficiency.)
> 
> Signed-off-by: Jonathan Tan <jonathantanmy@xxxxxxxxxx>
> ---
>  Documentation/git-index-pack.txt |   5 ++
>  builtin/index-pack.c             | 110 ++++++++++++++++++++++++++++++-
>  builtin/pack-objects.c           |  28 ++++++++
>  t/t5616-partial-clone.sh         |  30 +++++++++
>  4 files changed, 171 insertions(+), 2 deletions(-)
> 
> diff --git a/Documentation/git-index-pack.txt b/Documentation/git-index-pack.txt
> index 5a20deefd5..4be09e58e7 100644
> --- a/Documentation/git-index-pack.txt
> +++ b/Documentation/git-index-pack.txt
> @@ -139,6 +139,11 @@ include::object-format-disclaimer.txt[]
>  	written. If a `<message>` is provided, then that content will be
>  	written to the .promisor file for future reference. See
>  	link:technical/partial-clone.html[partial clone] for more information.
> ++
> +Also, if there are objects in the given pack that references non-promisor
> +objects (in the repo), repacks those non-promisor objects into a promisor
> +pack. This avoids a situation in which a repo has non-promisor objects that are
> +accessible through promisor objects.
>  
>  NOTES
>  -----
> diff --git a/builtin/index-pack.c b/builtin/index-pack.c
> index 9d23b41b3a..e4afd6725f 100644
> --- a/builtin/index-pack.c
> +++ b/builtin/index-pack.c
> @@ -9,6 +9,7 @@
>  #include "csum-file.h"
>  #include "blob.h"
>  #include "commit.h"
> +#include "tag.h"
>  #include "tree.h"
>  #include "progress.h"
>  #include "fsck.h"
> @@ -20,9 +21,14 @@
>  #include "object-file.h"
>  #include "object-store-ll.h"
>  #include "oid-array.h"
> +#include "oidset.h"
> +#include "path.h"
>  #include "replace-object.h"
> +#include "tree-walk.h"
>  #include "promisor-remote.h"
> +#include "run-command.h"
>  #include "setup.h"
> +#include "strvec.h"
>  
>  static const char index_pack_usage[] =
>  "git index-pack [-v] [-o <index-file>] [--keep | --keep=<msg>] [--[no-]rev-index] [--verify] [--strict[=<msg-id>=<severity>...]] [--fsck-objects[=<msg-id>=<severity>...]] (<pack-file> | --stdin [--fix-thin] [<pack-file>])";
> @@ -148,6 +154,13 @@ static uint32_t input_crc32;
>  static int input_fd, output_fd;
>  static const char *curr_pack;
>  
> +/*
> + * local_links is guarded by read_mutex, and record_local_links is read-only in
> + * a thread.
> + */
> +static struct oidset local_links = OIDSET_INIT;
> +static int record_local_links;
> +
>  static struct thread_local *thread_data;
>  static int nr_dispatched;
>  static int threads_active;
> @@ -799,6 +812,44 @@ static int check_collison(struct object_entry *entry)
>  	return 0;
>  }
>  
> +static void record_if_local_object(const struct object_id *oid)
> +{
> +	struct object_info info = OBJECT_INFO_INIT;
> +	if (oid_object_info_extended(the_repository, oid, &info, 0))
> +		/* Missing; assume it is a promisor object */
> +		return;
> +	if (info.whence == OI_PACKED && info.u.packed.pack->pack_promisor)
> +		return;
> +	oidset_insert(&local_links, oid);
> +}
> +
> +static void do_record_local_links(struct object *obj)
> +{
> +	if (obj->type == OBJ_TREE) {
> +		struct tree *tree = (struct tree *)obj;
> +		struct tree_desc desc;
> +		struct name_entry entry;
> +		if (init_tree_desc_gently(&desc, &tree->object.oid,
> +					  tree->buffer, tree->size, 0))

This part confused me for a bit, since I didn't see how simply casting
a `struct object *` to a `struct tree *` could possibly guarantee that
`tree->buffer` or `tree->size` pointed to valid memory. But the object
pointers in question have all been created via `parse_object_buffer` and
allocated as the appropriate blob/tree/commit/tag structs, and were
previously cast to `struct object *` by that function. So all looks good
here.


> +			/*
> +			 * Error messages are given when packs are
> +			 * verified, so do not print any here.
> +			 */
> +			return;
> +		while (tree_entry_gently(&desc, &entry))
> +			record_if_local_object(&entry.oid);
> +	} else if (obj->type == OBJ_COMMIT) {
> +		struct commit *commit = (struct commit *) obj;
> +		struct commit_list *parents = commit->parents;
> +
> +		for (; parents; parents = parents->next)
> +			record_if_local_object(&parents->item->object.oid);
> +	} else if (obj->type == OBJ_TAG) {
> +		struct tag *tag = (struct tag *) obj;
> +		record_if_local_object(get_tagged_oid(tag));
> +	}

We only ever `record_if_local_object()` on things referenced by `obj`
here but not on `obj` itself. We should only be calling this on objects
that are inside a promisor pack, so all good here I think.

> +}
> +
>  static void sha1_object(const void *data, struct object_entry *obj_entry,
>  			unsigned long size, enum object_type type,
>  			const struct object_id *oid)
> @@ -845,7 +896,7 @@ static void sha1_object(const void *data, struct object_entry *obj_entry,
>  		free(has_data);
>  	}
>  
> -	if (strict || do_fsck_object) {
> +	if (strict || do_fsck_object || record_local_links) {
>  		read_lock();
>  		if (type == OBJ_BLOB) {
>  			struct blob *blob = lookup_blob(the_repository, oid);
> @@ -877,6 +928,8 @@ static void sha1_object(const void *data, struct object_entry *obj_entry,
>  				die(_("fsck error in packed object"));
>  			if (strict && fsck_walk(obj, NULL, &fsck_options))
>  				die(_("Not all child objects of %s are reachable"), oid_to_hex(&obj->oid));
> +			if (record_local_links)
> +				do_record_local_links(obj);
>  
>  			if (obj->type == OBJ_TREE) {
>  				struct tree *item = (struct tree *) obj;
> @@ -1719,6 +1772,57 @@ static void show_pack_info(int stat_only)
>  	free(chain_histogram);
>  }
>  
> +static void repack_local_links(void)
> +{
> +	struct child_process cmd = CHILD_PROCESS_INIT;
> +	FILE *out;
> +	struct strbuf line = STRBUF_INIT;
> +	struct oidset_iter iter;
> +	struct object_id *oid;
> +	char *base_name;
> +
> +	if (!oidset_size(&local_links))
> +		return;
> +
> +	base_name = mkpathdup("%s/pack/pack", repo_get_object_directory(the_repository));
> +
> +	strvec_push(&cmd.args, "pack-objects");
> +	strvec_push(&cmd.args, "--exclude-promisor-objects-best-effort");
> +	strvec_push(&cmd.args, base_name);
> +	cmd.git_cmd = 1;
> +	cmd.in = -1;
> +	cmd.out = -1;
> +	if (start_command(&cmd))
> +		die(_("could not start pack-objects to repack local links"));
> +
> +	oidset_iter_init(&local_links, &iter);
> +	while ((oid = oidset_iter_next(&iter))) {
> +		if (write_in_full(cmd.in, oid_to_hex(oid), the_hash_algo->hexsz) < 0 ||
> +		    write_in_full(cmd.in, "\n", 1) < 0)
> +			die(_("failed to feed local object to pack-objects"));
> +	}
> +	close(cmd.in);
> +
> +	out = xfdopen(cmd.out, "r");
> +	while (strbuf_getline_lf(&line, out) != EOF) {
> +		unsigned char binary[GIT_MAX_RAWSZ];
> +		if (line.len != the_hash_algo->hexsz ||
> +		    !hex_to_bytes(binary, line.buf, line.len))
> +			die(_("index-pack: Expecting full hex object ID lines only from pack-objects."));

I'm not sure why we check the pack-objects output here, is this just to
detect errors? Could we instead just check the exit status of
pack-objects, and discard the output?


> +
> +		/*
> +		 * pack-objects creates the .pack and .idx files, but not the
> +		 * .promisor file. Create the .promisor file, which is empty.
> +		 */
> +		write_special_file("promisor", "", NULL, binary, NULL);
> +	}
> +
> +	fclose(out);
> +	if (finish_command(&cmd))
> +		die(_("could not finish pack-objects to repack local links"));
> +	strbuf_release(&line);
> +}
> +
>  int cmd_index_pack(int argc,
>  		   const char **argv,
>  		   const char *prefix,
> @@ -1794,7 +1898,7 @@ int cmd_index_pack(int argc,
>  			} else if (skip_to_optional_arg(arg, "--keep", &keep_msg)) {
>  				; /* nothing to do */
>  			} else if (skip_to_optional_arg(arg, "--promisor", &promisor_msg)) {
> -				; /* already parsed */
> +				record_local_links = 1;
>  			} else if (starts_with(arg, "--threads=")) {
>  				char *end;
>  				nr_threads = strtoul(arg+10, &end, 0);
> @@ -1970,6 +2074,8 @@ int cmd_index_pack(int argc,
>  		free((void *) curr_index);
>  	free(curr_rev_index);
>  
> +	repack_local_links();
> +
>  	/*
>  	 * Let the caller know this pack is not self contained
>  	 */
> diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
> index e15fbaeb21..a565ab9b40 100644
> --- a/builtin/pack-objects.c
> +++ b/builtin/pack-objects.c
> @@ -4310,6 +4310,18 @@ static int option_parse_cruft_expiration(const struct option *opt UNUSED,
>  	return 0;
>  }
>  
> +static int should_include_obj(struct object *obj, void *data UNUSED)
> +{
> +	struct object_info info = OBJECT_INFO_INIT;
> +	if (oid_object_info_extended(the_repository, &obj->oid, &info, 0))
> +		BUG("should_include_obj should only be called on existing objects");
> +	return info.whence != OI_PACKED || !info.u.packed.pack->pack_promisor;
> +}
> +
> +static int should_include(struct commit *commit, void *data) {
> +	return should_include_obj((struct object *) commit, data);
> +}
> +

Nit: these two functions could be named a bit more descriptively.


>  int cmd_pack_objects(int argc,
>  		     const char **argv,
>  		     const char *prefix,
> @@ -4326,6 +4338,7 @@ int cmd_pack_objects(int argc,
>  	struct list_objects_filter_options filter_options =
>  		LIST_OBJECTS_FILTER_INIT;
>  	int exclude_promisor_objects = 0;
> +	int exclude_promisor_objects_best_effort = 0;
>  
>  	struct option pack_objects_options[] = {
>  		OPT_CALLBACK_F('q', "quiet", &progress, NULL,
> @@ -4423,6 +4436,9 @@ int cmd_pack_objects(int argc,
>  		  option_parse_missing_action),
>  		OPT_BOOL(0, "exclude-promisor-objects", &exclude_promisor_objects,
>  			 N_("do not pack objects in promisor packfiles")),
> +		OPT_BOOL(0, "exclude-promisor-objects-best-effort",
> +			 &exclude_promisor_objects_best_effort,
> +			 N_("implies --missing=allow-any")),
>  		OPT_BOOL(0, "delta-islands", &use_delta_islands,
>  			 N_("respect islands during delta compression")),
>  		OPT_STRING_LIST(0, "uri-protocol", &uri_protocols,
> @@ -4503,10 +4519,18 @@ int cmd_pack_objects(int argc,
>  		strvec_push(&rp, "--unpacked");
>  	}
>  
> +	if (exclude_promisor_objects && exclude_promisor_objects_best_effort)
> +		die(_("options '%s' and '%s' cannot be used together"),
> +		    "--exclude-promisor-objects", "--exclude-promisor-objects-best-effort");
>  	if (exclude_promisor_objects) {
>  		use_internal_rev_list = 1;
>  		fetch_if_missing = 0;
>  		strvec_push(&rp, "--exclude-promisor-objects");
> +	} else if (exclude_promisor_objects_best_effort) {
> +		use_internal_rev_list = 1;
> +		fetch_if_missing = 0;
> +		option_parse_missing_action(NULL, "allow-any", 0);
> +		/* revs configured below */
>  	}
>  	if (unpack_unreachable || keep_unreachable || pack_loose_unreachable)
>  		use_internal_rev_list = 1;
> @@ -4626,6 +4650,10 @@ int cmd_pack_objects(int argc,
>  
>  		repo_init_revisions(the_repository, &revs, NULL);
>  		list_objects_filter_copy(&revs.filter, &filter_options);
> +		if (exclude_promisor_objects_best_effort) {
> +			revs.include_check = should_include;
> +			revs.include_check_obj = should_include_obj;
> +		}
>  		get_object_list(&revs, rp.nr, rp.v);
>  		release_revisions(&revs);
>  	}
> diff --git a/t/t5616-partial-clone.sh b/t/t5616-partial-clone.sh
> index c53e93be2f..2e67f59f89 100755
> --- a/t/t5616-partial-clone.sh
> +++ b/t/t5616-partial-clone.sh
> @@ -694,6 +694,36 @@ test_expect_success 'lazy-fetch in submodule succeeds' '
>  	git -C client restore --recurse-submodules --source=HEAD^ :/
>  '
>  
> +test_expect_success 'after fetching descendants of non-promisor commits, gc works' '
> +	# Setup
> +	git init full &&
> +	git -C full config uploadpack.allowfilter 1 &&
> + 	git -C full config uploadpack.allowanysha1inwant 1 &&
> +	touch full/foo &&
> +	git -C full add foo &&
> +	git -C full commit -m "commit 1" &&
> +	git -C full checkout --detach &&
> +
> +	# Partial clone and push commit to remote
> +	git clone "file://$(pwd)/full" --filter=blob:none partial &&
> +	echo "hello" > partial/foo &&
> +	git -C partial commit -a -m "commit 2" &&
> +	git -C partial push &&
> +
> +	# gc in partial repo
> +	git -C partial gc --prune=now &&
> +
> +	# Create another commit in normal repo
> +	git -C full checkout main &&
> +	echo " world" >> full/foo &&
> +	git -C full commit -a -m "commit 3" &&
> +
> +	# Pull from remote in partial repo, and run gc again
> +	git -C partial pull &&
> +	git -C partial gc --prune=now
> +'
> +
> +
>  . "$TEST_DIRECTORY"/lib-httpd.sh
>  start_httpd
>  
> -- 
> 2.47.0.163.g1226f6d8fa-goog
> 
> 




[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux