Re: [PATCH v3 02/11] bundle: verify using check_connected()

Junio C Hamano <gitster@xxxxxxxxx> · Tue, 31 Jan 2023 09:35:24 -0800

"Derrick Stolee via GitGitGadget" <gitgitgadget@xxxxxxxxx> writes:

> Thus, the code in verify_bundle() has previously had the additional
> check that all prerequisite commits are reachable from repository
> references. This is done via a revision walk from all references,
> stopping only if all prerequisite commits are discovered or all commits
> are walked. This uses a custom walk to verify_bundle().
>
> This check is more strict than what Git applies to fetched pack-files.
> In the fetch case, Git guarantees that the new references are closed
> under reachability by walking from the new references until walking
> commits that are reachable from repository refs. This is done through
> the well-used check_connected() method.
>
> To better align with the restrictions required by 'git fetch',
> reimplement this check in verify_bundle() to use check_connected(). This
> also simplifies the code significantly.

As I often say, breaking repository faster is not the kind of
performance gain we want to have in Git, and I am in favor of this
iteration compared to the earlier version that mostly punted on
ensuring the correctness (rather, it relied on assumption that "most
of the time this should be OK").

>  bundle.c               | 75 ++++++++++++++++--------------------------
>  t/t6020-bundle-misc.sh |  8 ++---
>  2 files changed, 33 insertions(+), 50 deletions(-)

The diffstat is very pleasing to see.

Let me read the postimage along aloud (preimage omitted).

> diff --git a/bundle.c b/bundle.c
> index 4ef7256aa11..76c3a904898 100644
> --- a/bundle.c
> +++ b/bundle.c
> @@ -187,6 +188,21 @@ static int list_refs(struct string_list *r, int argc, const char **argv)
>  /* Remember to update object flag allocation in object.h */
>  #define PREREQ_MARK (1u<<16)
>  
> +struct string_list_iterator {
> +	struct string_list *list;
> +	size_t cur;
> +};
> +
> +static const struct object_id *iterate_ref_map(void *cb_data)
> +{
> +	struct string_list_iterator *iter = cb_data;
> +
> +	if (iter->cur >= iter->list->nr)
> +		return NULL;
> +
> +	return iter->list->items[iter->cur++].util;
> +}

This is to let check_connected() collect all the prerequisite object
names.  OK.

>  int verify_bundle(struct repository *r,
>  		  struct bundle_header *header,
>  		  enum verify_bundle_flags flags)
>  {
>  	/*
>  	 * Do fast check, then if any prereqs are missing then go line by line
>  	 * to be verbose about the errors
>  	 */
>  	struct string_list *p = &header->prerequisites;
> +	int i, ret = 0;
>  	const char *message = _("Repository lacks these prerequisite commits:");
> +	struct string_list_iterator iter = {
> +		.list = p,
> +	};
> +	struct check_connected_options opts = {
> +		.quiet = 1,
> +	};
>  
>  	if (!r || !r->objects || !r->objects->odb)
>  		return error(_("need a repository to verify a bundle"));
>  
>  	for (i = 0; i < p->nr; i++) {
>  		struct string_list_item *e = p->items + i;
>  		const char *name = e->string;
>  		struct object_id *oid = e->util;
>  		struct object *o = parse_object(r, oid);
> +		if (o)
>  			continue;
>  		ret++;
>  		if (flags & VERIFY_BUNDLE_QUIET)
>  			continue;
>  		if (ret == 1)
>  			error("%s", message);
>  		error("%s %s", oid_to_hex(oid), name);
>  	}
> +	if (ret)
>  		goto cleanup;

The "quick fail" logic as before.  Looking sensible.

>  
> +	if ((ret = check_connected(iterate_ref_map, &iter, &opts)))
> +		error(_("some prerequisite commits exist in the object store, "
> +			"but are not connected to the repository's history"));

And then we let check_connected() to ensure that traversing from
these prerequisite objects down to the DAG formed by existing refs
will not die from missing objects.  Makes sense.

> +	/* TODO: preserve this verbose language. */

I am lost -- aren't we preserving the BUNDLE_VERBOSE code below?

>  	if (flags & VERIFY_BUNDLE_VERBOSE) {

> diff --git a/t/t6020-bundle-misc.sh b/t/t6020-bundle-misc.sh
> index 38dbbf89155..7d40994991e 100755
> --- a/t/t6020-bundle-misc.sh
> +++ b/t/t6020-bundle-misc.sh
> @@ -595,14 +595,14 @@ test_expect_success 'verify catches unreachable, broken prerequisites' '
>  		# Verify should fail
>  		test_must_fail git bundle verify \
>  			../clone-from/tip.bundle 2>err &&
> -		grep "Could not read $BAD_OID" err &&
> -		grep "Failed to traverse parents of commit $TIP_OID" err &&
> +		grep "some prerequisite commits .* are not connected" err &&
> +		test_line_count = 1 err &&
>  
>  		# Unbundling should fail
>  		test_must_fail git bundle unbundle \
>  			../clone-from/tip.bundle 2>err &&
> -		grep "Could not read $BAD_OID" err &&
> -		grep "Failed to traverse parents of commit $TIP_OID" err
> +		grep "some prerequisite commits .* are not connected" err &&
> +		test_line_count = 1 err
>  	)
>  '

Especially with the new test added in the previous step, we know we
are not trading correctness off.  Excellent.

I wonder how much the performance hit does this version incur over
the "not safe at all" version and over the "use custom and
stricter-than-needed" version, by the way?

Thanks.