Re: [PATCH 09/15] refs/packed-backend.c: implement skip lists to avoid excluded pattern(s)

Taylor Blau <me@xxxxxxxxxxxx> · Tue, 9 May 2023 16:55:55 -0400

On Tue, May 09, 2023 at 05:15:43PM +0200, Patrick Steinhardt wrote:
> On Mon, May 08, 2023 at 06:00:08PM -0400, Taylor Blau wrote:
>
> > Note that this optimization is only possible when none of the excluded
> > pattern(s) have special meta-characters in them. To see why this is the
> > case, consider the exclusion pattern "refs/foo[a]". In general, in order
> > to find the location of the first record that matches this pattern, we
> > could only consider up to the first meta-character, "refs/foo". But this
> > doesn't work, since the excluded region we'd come up with would include
> > "refs/foobar", even though it is not excluded.
>
> Is this generally true though? A naive implementation would iterate
> through all references and find the first reference that matches the
> exclusion regular exepression. From thereon we continue to iterate until
> we find the first entry that doesn't match. This may cause us to end up
> with a suboptimal skip list, but the skip list would still be valid.
>
> As I said, this implementation would be naive as we're now forced to
> iterate through all references starting at the beginning. I assume that
> your implementation will instead use a binary search to locate the first
> entry that matches the exclusion pattern and the last pattern. But the
> way this paragraph is formulated makes it sound like this is a general
> fact, even though it is a fact that derives from the implementation.
>
> I of course don't propose to change the algorithm here, but instead to
> clarify where this restriction actually comes from and why the tradeoff
> makes sense.

In the example you include, it's possible. But consider something like:

    $ git for-each-ref --exclude='refs/foo[ac]'

The region that matches that expression ("refs/fooa", "refs/fooc" and
everything underneath them) does not have to appear as a continuous
single region in the packed-refs file. If you have, say, "refs/foobar",
that will appear between the two regions you want to exclude.

So I think you *might* be able to do it in general, but at the very
least it would involve splitting each character class and finding the
start and end of any region(s) that it matches.

Even so, you'd have to try and match each entry as you determine the
width of the excluded region, at which point you're at par with
enumerating them anyway and having the caller discard any entries it
doesn't want.

> > +	for (pattern = excluded_patterns; *pattern; pattern++) {
> > +		struct skip_list_entry *e;
> > +
> > +		/*
> > +		 * We can't feed any excludes with globs in them to the
> > +		 * refs machinery.  It only understands prefix matching.
> > +		 * We likewise can't even feed the string leading up to
> > +		 * the first meta-character, as something like "foo[a]"
> > +		 * should not exclude "foobar" (but the prefix "foo"
> > +		 * would match that and mark it for exclusion).
> > +		 */
> > +		if (has_glob_special(*pattern))
> > +			continue;
> > +
> > +		ALLOC_GROW(iter->skip, iter->skip_nr + 1, iter->skip_alloc);
> > +
> > +		e = &iter->skip[iter->skip_nr++];
> > +		e->start = find_reference_location(snapshot, *pattern, 0);
> > +		e->end = find_reference_location_end(snapshot, *pattern, 0);
>
> One thing that makes this hard to reason about is that we don't
> explicitly handle the case where the pattern doesn't match at all. So
> you require a bunch of knowledge about what exactly the functions
> `find_reference_location()` and `find_reference_location_end()` do in
> that case in order to determine whether we will end up doing the right
> thing.
>
> Explicitly handling this would give us some benefits:
>
> - It makes the code more obvious.
>
> - We can avoid an extra skip list entry for every non-matching
>   pattern.
>
> - We wouldn't have to perform binary search over the snapshot twice.
>
> Might be I'm missing something though.

We handle it implicitly via the mustexist parameter that both
find_reference_location() and find_reference_location_end() take, which
returns the position that you would find a matching entry if one does
not already exist.

You're right that it would save you a second binary search, but that is
likely to be a vanishingly small cost compared to the actual traversal,
disk I/O, etc.

I could imagine modifying the signature of both of those functions to be
something like:

    int find_reference_location(struct snapshot *, const char *pattern,
                                int mustexist, char *out)

Which would propagate the result through `out`, and the return value
would be whether or not it found a matching entry. But that only really
makes sense when `mustexist` is set to 0, and it adds some verbosity
throughout.

I think that it's totally possible to avoid the second search, but I'm
not sure that the cost of additional complexity and verbosity is worth
the benefit of avoiding one binary search (which will likely be
resident, anyway).

Note though that if e->start == e->end here we will discard the empty
skip region below.

> > +	for (i = 1, j = 1; i < iter->skip_nr; i++) {
> > +		struct skip_list_entry *ours = &iter->skip[i];
> > +		struct skip_list_entry *prev = &iter->skip[i - 1];
> > +
> > +		if (ours->start == ours->end) {
> > +			/* ignore empty regions (no matching entries) */
> > +			continue;
> > +		} else if (prev->end >= ours->start) {
> > +			/* overlapping regions extend the previous one */
> > +			prev->end = ptr_max(prev->end, ours->end);
> > +		} else {
> > +			/* otherwise, insert a new region */
> > +			iter->skip[j++] = *ours;
> > +		}
> > +	}
>
> Mh. Does this do the right thing in case we have multiple consecutive
> overlapping skip list entries? We always end up adjusting the immediate
> predecessor as we use `i - 1` to find `prev`. Shouldn't we instead start
> with `j = 0` and assign `prev = &iter->skip[j]`?

Good catch. I think applying this on top should do the trick:

--- 8< ---

diff --git a/refs/packed-backend.c b/refs/packed-backend.c
index 137a4233f6..3b1337267a 100644
--- a/refs/packed-backend.c
+++ b/refs/packed-backend.c
@@ -1054,9 +1054,9 @@ static void populate_excluded_skip_list(struct packed_ref_iterator *iter,
 	 * we want to combine that into a single entry jumping from A to
 	 * C.
 	 */
-	for (i = 1, j = 1; i < iter->skip_nr; i++) {
+	for (i = 0, j = 0; i < iter->skip_nr; i++) {
 		struct skip_list_entry *ours = &iter->skip[i];
-		struct skip_list_entry *prev = &iter->skip[i - 1];
+		struct skip_list_entry *prev = &iter->skip[j];

 		if (ours->start == ours->end) {
 			/* ignore empty regions (no matching entries) */
@@ -1066,7 +1066,7 @@ static void populate_excluded_skip_list(struct packed_ref_iterator *iter,
 			prev->end = ptr_max(prev->end, ours->end);
 		} else {
 			/* otherwise, insert a new region */
-			iter->skip[j++] = *ours;
+			iter->skip[++j] = *ours;
 		}
 	}
--- >8 ---

> > +test_expect_success 'for_each_ref__exclude(refs/heads/foo/)' '
> > +	# region in middle
> > +	for_each_ref__exclude refs/heads refs/heads/foo >actual &&
> > +	for_each_ref refs/heads/bar refs/heads/baz refs/heads/quux >expect &&
> > +
> > +	test_cmp expect actual
> > +'
>
> Nit: it might be a bit more readable if we put the comment into the test
> description instead of having an opaque description that mentions ref
> names that don't have much of a meaning without reading the test itself.

Fair enough ;-).

Thanks,
Taylor