Re: [PATCH v2 09/16] refs/packed-backend.c: implement jump lists to avoid excluded pattern(s)

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Mon, May 15, 2023 at 03:23:33PM -0400, Taylor Blau wrote:
> When iterating through the `packed-refs` file in order to answer a query
> like:
> 
>     $ git for-each-ref --exclude=refs/__hidden__
> 
> it would be useful to avoid walking over all of the entries in
> `refs/__hidden__/*` when possible, since we know that the ref-filter
> code is going to throw them away anyways.
> 
> In certain circumstances, doing so is possible. The algorithm for doing
> so is as follows:
> 
>   - For each excluded pattern, find the first record that matches it,
>     and the first record that *doesn't* match it (i.e. the location
>     you'd next want to consider when excluding that pattern).
> 
>   - Sort the set of excluded regions from the previous step in ascending
>     order of the first location within the `packed-refs` file that
>     matches.
> 
>   - Clean up the results from the previous step: discard empty regions,
>     and combine adjacent regions.
> 
> Then when iterating through the `packed-refs` file, if `iter->pos` is
> ever contained in one of the regions from the previous steps, advance
> `iter->pos` past the end of that region, and continue enumeration.
> 
> Note that we only perform this optimization when none of the excluded
> pattern(s) have special meta-characters in them. For a pattern like
> "refs/foo[ac]", the excluded regions ("refs/fooa", "refs/fooc", and
> everything underneath them) are not connected. A future implementation
> that handles this case may split the character class (pretending as if
> two patterns were excluded: "refs/fooa", and "refs/fooc").
> 
> There are a few other gotchas worth considering. First, note that the
> jump list is sorted, so once we jump past a region, we can avoid
> considering it (or any regions preceding it) again. The member
> `jump_pos` is used to track the first next-possible region to jump
> through.
> 
> Second, note that the exclusion list is best-effort, since we do not
> handle loose references, and because of the meta-character issue above.
> 
> In repositories with a large number of hidden references, the speed-up
> can be significant. Tests here are done with a copy of linux.git with a
> reference "refs/pull/N" pointing at every commit, as in:
> 
>     $ git rev-list HEAD | awk '{ print "create refs/pull/" NR " " $0 }' |
>         git update-ref --stdin
>     $ git pack-refs --all
> 
> , it is significantly faster to have `for-each-ref` jump over the
> excluded references, as opposed to filtering them out after the fact:
> 
>     $ hyperfine \
>       'git for-each-ref --format="%(objectname) %(refname)" | grep -vE "^[0-9a-f]{40} refs/pull/"' \
>       'git.compile for-each-ref --format="%(objectname) %(refname)" --exclude="refs/pull"'
>     Benchmark 1: git for-each-ref --format="%(objectname) %(refname)" | grep -vE "^[0-9a-f]{40} refs/pull/"
>       Time (mean ± σ):     802.7 ms ±   2.1 ms    [User: 691.6 ms, System: 147.0 ms]
>       Range (min … max):   800.0 ms … 807.7 ms    10 runs
> 
>     Benchmark 2: git.compile for-each-ref --format="%(objectname) %(refname)" --exclude="refs/pull"
>       Time (mean ± σ):       4.7 ms ±   0.3 ms    [User: 0.7 ms, System: 4.0 ms]
>       Range (min … max):     4.3 ms …   6.7 ms    422 runs
> 
>     Summary
>       'git.compile for-each-ref --format="%(objectname) %(refname)" --exclude="refs/pull"' ran
>       172.03 ± 9.60 times faster than 'git for-each-ref --format="%(objectname) %(refname)" | grep -vE "^[0-9a-f]{40} refs/pull/"'
> 
> Using the jump list is fairly straightforward (see the changes to
> `refs/packed-backend.c::next_record()`), but constructing the list is
> not. To ensure that the construction is correct, add a new suite of
> tests in t1419 covering various corner cases (overlapping regions,
> partially overlapping regions, adjacent regions, etc.).
> 
> Co-authored-by: Jeff King <peff@xxxxxxxx>
> Signed-off-by: Jeff King <peff@xxxxxxxx>
> Signed-off-by: Taylor Blau <me@xxxxxxxxxxxx>
> ---
>  ref-filter.c              |   5 +-
>  refs/packed-backend.c     | 166 ++++++++++++++++++++++++++++++++++++--
>  t/helper/test-ref-store.c |  10 +++
>  t/t1419-exclude-refs.sh   | 101 +++++++++++++++++++++++
>  4 files changed, 274 insertions(+), 8 deletions(-)
>  create mode 100755 t/t1419-exclude-refs.sh
> 
> diff --git a/ref-filter.c b/ref-filter.c
> index 717c3c4bcf..ddc7f5204f 100644
> --- a/ref-filter.c
> +++ b/ref-filter.c
> @@ -2210,12 +2210,13 @@ static int for_each_fullref_in_pattern(struct ref_filter *filter,
>  	if (!filter->name_patterns[0]) {
>  		/* no patterns; we have to look at everything */
>  		return refs_for_each_fullref_in(get_main_ref_store(the_repository),
> -						 "", NULL, cb, cb_data);
> +						 "", filter->exclude.v, cb, cb_data);
>  	}
>  
>  	return refs_for_each_fullref_in_prefixes(get_main_ref_store(the_repository),
>  						 NULL, filter->name_patterns,
> -						 NULL, cb, cb_data);
> +						 filter->exclude.v,
> +						 cb, cb_data);
>  }
>  
>  /*
> diff --git a/refs/packed-backend.c b/refs/packed-backend.c
> index 33639f73e1..67327e579c 100644
> --- a/refs/packed-backend.c
> +++ b/refs/packed-backend.c
> @@ -303,7 +303,8 @@ static int cmp_packed_ref_records(const void *v1, const void *v2)
>   * Compare a snapshot record at `rec` to the specified NUL-terminated
>   * refname.
>   */
> -static int cmp_record_to_refname(const char *rec, const char *refname)
> +static int cmp_record_to_refname(const char *rec, const char *refname,
> +				 int start)
>  {
>  	const char *r1 = rec + the_hash_algo->hexsz + 1;
>  	const char *r2 = refname;
> @@ -312,7 +313,7 @@ static int cmp_record_to_refname(const char *rec, const char *refname)
>  		if (*r1 == '\n')
>  			return *r2 ? -1 : 0;
>  		if (!*r2)
> -			return 1;
> +			return start ? 1 : -1;
>  		if (*r1 != *r2)
>  			return (unsigned char)*r1 < (unsigned char)*r2 ? -1 : +1;
>  		r1++;
> @@ -528,7 +529,8 @@ static int load_contents(struct snapshot *snapshot)
>  }
>  
>  static const char *find_reference_location_1(struct snapshot *snapshot,
> -					     const char *refname, int mustexist)
> +					     const char *refname, int mustexist,
> +					     int start)
>  {
>  	/*
>  	 * This is not *quite* a garden-variety binary search, because
> @@ -558,7 +560,7 @@ static const char *find_reference_location_1(struct snapshot *snapshot,
>  
>  		mid = lo + (hi - lo) / 2;
>  		rec = find_start_of_record(lo, mid);
> -		cmp = cmp_record_to_refname(rec, refname);
> +		cmp = cmp_record_to_refname(rec, refname, start);
>  		if (cmp < 0) {
>  			lo = find_end_of_record(mid, hi);
>  		} else if (cmp > 0) {
> @@ -591,7 +593,22 @@ static const char *find_reference_location_1(struct snapshot *snapshot,
>  static const char *find_reference_location(struct snapshot *snapshot,
>  					   const char *refname, int mustexist)
>  {
> -	return find_reference_location_1(snapshot, refname, mustexist);
> +	return find_reference_location_1(snapshot, refname, mustexist, 1);
> +}
> +
> +/*
> + * Find the place in `snapshot->buf` after the end of the record for
> + * `refname`. In other words, find the location of first thing *after*
> + * `refname`.
> + *
> + * Other semantics are identical to the ones in
> + * `find_reference_location()`.
> + */
> +static const char *find_reference_location_end(struct snapshot *snapshot,
> +					       const char *refname,
> +					       int mustexist)
> +{
> +	return find_reference_location_1(snapshot, refname, mustexist, 0);
>  }
>  
>  /*
> @@ -785,6 +802,13 @@ struct packed_ref_iterator {
>  	/* The end of the part of the buffer that will be iterated over: */
>  	const char *eof;
>  
> +	struct jump_list_entry {
> +		const char *start;
> +		const char *end;
> +	} *jump;
> +	size_t jump_nr, jump_alloc;
> +	size_t jump_pos;
> 
Nit: I had some trouble with `jump_pos` given that it sounds so similar
to `iter->pos`, and thus you tend to think that they both apply to the
position in the packed-refs file. `jump_curr` or `jump_idx` might help
to avoid this confusion.

>  	/* Scratch space for current values: */
>  	struct object_id oid, peeled;
>  	struct strbuf refname_buf;
> @@ -802,14 +826,34 @@ struct packed_ref_iterator {
>   */
>  static int next_record(struct packed_ref_iterator *iter)
>  {
> -	const char *p = iter->pos, *eol;
> +	const char *p, *eol;
>  
>  	strbuf_reset(&iter->refname_buf);
>  
> +	/*
> +	 * If iter->pos is contained within a skipped region, jump past
> +	 * it.
> +	 *
> +	 * Note that each skipped region is considered at most once,
> +	 * since they are ordered based on their starting position.
> +	 */
> +	while (iter->jump_pos < iter->jump_nr) {
> +		struct jump_list_entry *curr = &iter->jump[iter->jump_pos];
> +		if (iter->pos < curr->start)
> +			break; /* not to the next jump yet */
> +
> +		iter->jump_pos++;
> +		if (iter->pos < curr->end) {
> +			iter->pos = curr->end;
> +			break;
> +		}
> +	}
> +
>  	if (iter->pos == iter->eof)
>  		return ITER_DONE;
>  
>  	iter->base.flags = REF_ISPACKED;
> +	p = iter->pos;
>  
>  	if (iter->eof - p < the_hash_algo->hexsz + 2 ||
>  	    parse_oid_hex(p, &iter->oid, &p) ||
> @@ -917,6 +961,7 @@ static int packed_ref_iterator_abort(struct ref_iterator *ref_iterator)
>  	int ok = ITER_DONE;
>  
>  	strbuf_release(&iter->refname_buf);
> +	free(iter->jump);
>  	release_snapshot(iter->snapshot);
>  	base_ref_iterator_free(ref_iterator);
>  	return ok;
> @@ -928,6 +973,112 @@ static struct ref_iterator_vtable packed_ref_iterator_vtable = {
>  	.abort = packed_ref_iterator_abort
>  };
>  
> +static int jump_list_entry_cmp(const void *va, const void *vb)
> +{
> +	const struct jump_list_entry *a = va;
> +	const struct jump_list_entry *b = vb;
> +
> +	if (a->start < b->start)
> +		return -1;
> +	if (a->start > b->start)
> +		return 1;
> +	return 0;
> +}
> +
> +static int has_glob_special(const char *str)
> +{
> +	const char *p;
> +	for (p = str; *p; p++) {
> +		if (is_glob_special(*p))
> +			return 1;
> +	}
> +	return 0;
> +}
> +
> +static const char *ptr_max(const char *x, const char *y)
> +{
> +	if (x > y)
> +		return x;
> +	return y;
> +}
> +
> +static void populate_excluded_jump_list(struct packed_ref_iterator *iter,
> +					struct snapshot *snapshot,
> +					const char **excluded_patterns)
> +{
> +	size_t i, j;
> +	const char **pattern;
> +	struct jump_list_entry *last_disjoint;
> +
> +	if (!excluded_patterns)
> +		return;
> +
> +	for (pattern = excluded_patterns; *pattern; pattern++) {
> +		struct jump_list_entry *e;
> +
> +		/*
> +		 * We can't feed any excludes with globs in them to the
> +		 * refs machinery.  It only understands prefix matching.
> +		 * We likewise can't even feed the string leading up to
> +		 * the first meta-character, as something like "foo[a]"
> +		 * should not exclude "foobar" (but the prefix "foo"
> +		 * would match that and mark it for exclusion).
> +		 */
> +		if (has_glob_special(*pattern))
> +			continue;
> +
> +		ALLOC_GROW(iter->jump, iter->jump_nr + 1, iter->jump_alloc);
> +
> +		e = &iter->jump[iter->jump_nr++];
> +		e->start = find_reference_location(snapshot, *pattern, 0);
> +		e->end = find_reference_location_end(snapshot, *pattern, 0);

Nit: we could detect the non-matching case here already, which would
allow us to skip an allocation. It's probably pre-mature optimization
though, so please feel free to ignore.

> +	}
> +
> +	if (!iter->jump_nr) {
> +		/*
> +		 * Every entry in exclude_patterns has a meta-character,
> +		 * nothing to do here.
> +		 */
> +		return;
> +	}
> +
> +	QSORT(iter->jump, iter->jump_nr, jump_list_entry_cmp);
> +
> +	/*
> +	 * As an optimization, merge adjacent entries in the jump list
> +	 * to jump forwards as far as possible when entering a skipped
> +	 * region.
> +	 *
> +	 * For example, if we have two skipped regions:
> +	 *
> +	 *	[[A, B], [B, C]]
> +	 *
> +	 * we want to combine that into a single entry jumping from A to
> +	 * C.
> +	 */
> +	last_disjoint = iter->jump;

Nit: if we initialized `j = 0`, then `last_disjoint` would always be
equal to `iter->jump[j]`. We could then declare the variable inside of
the loop to make it a bit easier to understand.

Patrick

> +	for (i = 1, j = 1; i < iter->jump_nr; i++) {
> +		struct jump_list_entry *ours = &iter->jump[i];
> +
> +		if (ours->start == ours->end) {
> +			/* ignore empty regions (no matching entries) */
> +			continue;
> +		} else if (ours->start <= last_disjoint->end) {
> +			/* overlapping regions extend the previous one */
> +			last_disjoint->end = ptr_max(last_disjoint->end, ours->end);
> +		} else {
> +			/* otherwise, insert a new region */
> +			iter->jump[j++] = *ours;
> +			last_disjoint = ours;
> +
> +		}
> +	}
> +
> +	iter->jump_nr = j;
> +	iter->jump_pos = 0;
> +}
> +
>  static struct ref_iterator *packed_ref_iterator_begin(
>  		struct ref_store *ref_store,
>  		const char *prefix, const char **exclude_patterns,
> @@ -963,6 +1114,9 @@ static struct ref_iterator *packed_ref_iterator_begin(
>  	ref_iterator = &iter->base;
>  	base_ref_iterator_init(ref_iterator, &packed_ref_iterator_vtable, 1);
>  
> +	if (exclude_patterns)
> +		populate_excluded_jump_list(iter, snapshot, exclude_patterns);
> +
>  	iter->snapshot = snapshot;
>  	acquire_snapshot(snapshot);
>  
> diff --git a/t/helper/test-ref-store.c b/t/helper/test-ref-store.c
> index 6d8f844e9c..2bff003f7c 100644
> --- a/t/helper/test-ref-store.c
> +++ b/t/helper/test-ref-store.c
> @@ -175,6 +175,15 @@ static int cmd_for_each_ref(struct ref_store *refs, const char **argv)
>  	return refs_for_each_ref_in(refs, prefix, each_ref, NULL);
>  }
>  
> +static int cmd_for_each_ref__exclude(struct ref_store *refs, const char **argv)
> +{
> +	const char *prefix = notnull(*argv++, "prefix");
> +	const char **exclude_patterns = argv;
> +
> +	return refs_for_each_fullref_in(refs, prefix, exclude_patterns, each_ref,
> +					NULL);
> +}
> +
>  static int cmd_resolve_ref(struct ref_store *refs, const char **argv)
>  {
>  	struct object_id oid = *null_oid();
> @@ -307,6 +316,7 @@ static struct command commands[] = {
>  	{ "delete-refs", cmd_delete_refs },
>  	{ "rename-ref", cmd_rename_ref },
>  	{ "for-each-ref", cmd_for_each_ref },
> +	{ "for-each-ref--exclude", cmd_for_each_ref__exclude },
>  	{ "resolve-ref", cmd_resolve_ref },
>  	{ "verify-ref", cmd_verify_ref },
>  	{ "for-each-reflog", cmd_for_each_reflog },
> diff --git a/t/t1419-exclude-refs.sh b/t/t1419-exclude-refs.sh
> new file mode 100755
> index 0000000000..bc534c8ea1
> --- /dev/null
> +++ b/t/t1419-exclude-refs.sh
> @@ -0,0 +1,101 @@
> +#!/bin/sh
> +
> +test_description='test exclude_patterns functionality in main ref store'
> +
> +GIT_TEST_DEFAULT_INITIAL_BRANCH_NAME=main
> +export GIT_TEST_DEFAULT_INITIAL_BRANCH_NAME
> +
> +TEST_PASSES_SANITIZE_LEAK=true
> +. ./test-lib.sh
> +
> +for_each_ref__exclude () {
> +	test-tool ref-store main for-each-ref--exclude "$@" >actual.raw
> +	cut -d ' ' -f 2 actual.raw
> +}
> +
> +for_each_ref () {
> +	git for-each-ref --format='%(refname)' "$@"
> +}
> +
> +test_expect_success 'setup' '
> +	test_commit --no-tag base &&
> +	base="$(git rev-parse HEAD)" &&
> +
> +	for name in foo bar baz quux
> +	do
> +		for i in 1 2 3
> +		do
> +			echo "create refs/heads/$name/$i $base" || return 1
> +		done || return 1
> +	done >in &&
> +	echo "delete refs/heads/main" >>in &&
> +
> +	git update-ref --stdin <in &&
> +	git pack-refs --all
> +'
> +
> +test_expect_success 'excluded region in middle' '
> +	for_each_ref__exclude refs/heads refs/heads/foo >actual &&
> +	for_each_ref refs/heads/bar refs/heads/baz refs/heads/quux >expect &&
> +
> +	test_cmp expect actual
> +'
> +
> +test_expect_success 'excluded region at beginning' '
> +	for_each_ref__exclude refs/heads refs/heads/bar >actual &&
> +	for_each_ref refs/heads/baz refs/heads/foo refs/heads/quux >expect &&
> +
> +	test_cmp expect actual
> +'
> +
> +test_expect_success 'excluded region at end' '
> +	for_each_ref__exclude refs/heads refs/heads/quux >actual &&
> +	for_each_ref refs/heads/foo refs/heads/bar refs/heads/baz >expect &&
> +
> +	test_cmp expect actual
> +'
> +
> +test_expect_success 'disjoint excluded regions' '
> +	for_each_ref__exclude refs/heads refs/heads/bar refs/heads/quux >actual &&
> +	for_each_ref refs/heads/baz refs/heads/foo >expect &&
> +
> +	test_cmp expect actual
> +'
> +
> +test_expect_success 'adjacent, non-overlapping excluded regions' '
> +	for_each_ref__exclude refs/heads refs/heads/bar refs/heads/baz >actual &&
> +	for_each_ref refs/heads/foo refs/heads/quux >expect &&
> +
> +	test_cmp expect actual
> +'
> +
> +test_expect_success 'overlapping excluded regions' '
> +	for_each_ref__exclude refs/heads refs/heads/ba refs/heads/baz >actual &&
> +	for_each_ref refs/heads/foo refs/heads/quux >expect &&
> +
> +	test_cmp expect actual
> +'
> +
> +test_expect_success 'several overlapping excluded regions' '
> +	for_each_ref__exclude refs/heads \
> +		refs/heads/bar refs/heads/baz refs/heads/foo >actual &&
> +	for_each_ref refs/heads/quux >expect &&
> +
> +	test_cmp expect actual
> +'
> +
> +test_expect_success 'non-matching excluded section' '
> +	for_each_ref__exclude refs/heads refs/heads/does/not/exist >actual &&
> +	for_each_ref >expect &&
> +
> +	test_cmp expect actual
> +'
> +
> +test_expect_success 'meta-characters are discarded' '
> +	for_each_ref__exclude refs/heads "refs/heads/ba*" >actual &&
> +	for_each_ref >expect &&
> +
> +	test_cmp expect actual
> +'
> +
> +test_done
> -- 
> 2.40.1.572.g5c4ab523ef
> 

Attachment: signature.asc
Description: PGP signature


[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux