Re: [PATCH v2 07/19] midx: introduce `bsearch_one_midx()`

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Wed, Jul 17, 2024 at 05:12:16PM -0400, Taylor Blau wrote:

> The `bsearch_midx()` function will be extended in a following commit to
> search for the location of a given object ID across all MIDXs in a chain
> (or the single non-chain MIDX if no chain is available).
> 
> While most callers will naturally want to use the updated
> `bsearch_midx()` function, there are a handful of special cases that
> will want finer control and will only want to search through a single
> MIDX.
> 
> For instance, the object abbreviation code, which cares about object IDs
> near to where we'd expect to find a match in a MIDX. In that case, we
> want to look at the nearby matches in each layer of the MIDX chain, not
> just a single one).

Hmm. That seems like a weird thing for the object abbreviation code to
want, just because the layers of the midx are essentially random with
respect to object names. So you have to search each layer individually
when looking for a contiguous segment of hash names. But maybe that's
why you want the one_midx() variant. Let's see...

>  static void unique_in_midx(struct multi_pack_index *m,
>  			   struct disambiguate_state *ds)
>  {
> -	uint32_t num, i, first = 0;
> -	const struct object_id *current = NULL;
> -	int len = ds->len > ds->repo->hash_algo->hexsz ?
> -		ds->repo->hash_algo->hexsz : ds->len;
> -	num = m->num_objects;
> +	for (; m; m = m->base_midx) {
> +		uint32_t num, i, first = 0;
> +		const struct object_id *current = NULL;
> +		int len = ds->len > ds->repo->hash_algo->hexsz ?
> +			ds->repo->hash_algo->hexsz : ds->len;
>  
> -	if (!num)
> -		return;
> +		num = m->num_objects + m->num_objects_in_base;
>  
> -	bsearch_midx(&ds->bin_pfx, m, &first);
> +		if (!num)
> +			continue;
>  
> -	/*
> -	 * At this point, "first" is the location of the lowest object
> -	 * with an object name that could match "bin_pfx".  See if we have
> -	 * 0, 1 or more objects that actually match(es).
> -	 */
> -	for (i = first; i < num && !ds->ambiguous; i++) {
> -		struct object_id oid;
> -		current = nth_midxed_object_oid(&oid, m, i);
> -		if (!match_hash(len, ds->bin_pfx.hash, current->hash))
> -			break;
> -		update_candidates(ds, current);
> +		bsearch_one_midx(&ds->bin_pfx, m, &first);
> +
> +		/*
> +		 * At this point, "first" is the location of the lowest
> +		 * object with an object name that could match
> +		 * "bin_pfx".  See if we have 0, 1 or more objects that
> +		 * actually match(es).
> +		 */
> +		for (i = first; i < num && !ds->ambiguous; i++) {
> +			struct object_id oid;
> +			current = nth_midxed_object_oid(&oid, m, i);
> +			if (!match_hash(len, ds->bin_pfx.hash, current->hash))
> +				break;
> +			update_candidates(ds, current);
> +		}
>  	}
>  }

This is much easier to read with "-w", of course. So yeah, the gist of
it is that we're going to loop over items in the chain via the base_midx
pointer, and then search each individually. So that makes sense.

One thing that confused me, though, is setting "num". From the "-w"
diff:

  -       num = m->num_objects;
  +
  +               num = m->num_objects + m->num_objects_in_base;
  
                  if (!num)
  -               return;
  +                       continue;

Before we only had one midx, so that was our limit. But now we are
looking at "num" as a limit in the global size of the chained midx.
Which feels weird, since we're just considering a single layer here. We
seem to use "num" in two ways:

  - we return if it's 0 (or now continue to the next layer). But
    wouldn't we want to do that per-layer? I don't think it will produce
    wrong answers, but we're less likely to kick in this early return
    (though it's not clear to me when it would ever kick in really; a
    zero-length midx?).

  - later we loop "i" from "first", using "num" as a boundary. But this
    "i" is a global position, since that's what bsearch_one_midx()
    returns, and what nth_midxed_object_oid() expects.

    So I think it's correct, though it feels like bsearch_one_midx()
    should still return the position within that midx (and then
    bsearch_midx() could add back m->num_objects_in_base to get a global
    position). And then I guess likewise there would need to be a
    midx-local version of nth_midxed_object_oid().

    I'm not sure if that would make things simpler, or just add to the
    confusion, though. It's easy to get the global/local functions mixed
    up, since of course the global ones also have to take a "struct
    multi_pack_index". We could make them take a
    "multi_pack_index_chain" instead, but now all of the other code
    which wants to treat chains and single midx files the same has to
    care about the distinction (probably by wrapping the single midx in
    a one-entry chain struct).


> @@ -708,37 +712,40 @@ static int repo_extend_abbrev_len(struct repository *r UNUSED,
>  static void find_abbrev_len_for_midx(struct multi_pack_index *m,
>  				     struct min_abbrev_data *mad)

And likewise here (again, much more readable with "-w"). Interestingly,
in this one...

>  {
> -	int match = 0;
> -	uint32_t num, first = 0;
> -	struct object_id oid;
> -	const struct object_id *mad_oid;
> +	for (; m; m = m->base_midx) {
> +		int match = 0;
> +		uint32_t num, first = 0;
> +		struct object_id oid;
> +		const struct object_id *mad_oid;
>  
> -	if (!m->num_objects)
> -		return;
> +		if (!m->num_objects)
> +			continue;

We do the early return/continue directly on the layer's m->num_objects,
which makes sense.

> -	num = m->num_objects;
> -	mad_oid = mad->oid;
> -	match = bsearch_midx(mad_oid, m, &first);
> +		num = m->num_objects + m->num_objects_in_base;
> +		mad_oid = mad->oid;
> +		match = bsearch_one_midx(mad_oid, m, &first);

But then of course we go back to the same global "num", as we must.

So I think it's all correct, minus the early continue on "!num" in the
first function.

-Peff




[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux