Re: [PATCH 09/15] refs/packed-backend.c: implement skip lists to avoid excluded pattern(s)

Taylor Blau <me@xxxxxxxxxxxx> · Tue, 9 May 2023 22:30:35 -0400

On Tue, May 09, 2023 at 04:40:50PM -0700, Junio C Hamano wrote:
> Taylor Blau <me@xxxxxxxxxxxx> writes:
>
> > When iterating through the `packed-refs` file in order to answer a query
> > like:
> >
> >     $ git for-each-ref --exclude=refs/__hidden__
> >
> > it would be useful to avoid walking over all of the entries in
> > `refs/__hidden__/*` when possible, since we know that the ref-filter
> > code is going to throw them away anyways.
> >
> > In certain circumstances, doing so is possible. The algorithm for doing
> > so is as follows:
> >
> >   - For each excluded pattern, find the first record that matches it,
> >     and the first pattern that *doesn't* match it (i.e. the location
> >     you'd next want to consider when excluding that pattern).
>
> Do we find "record" and then "pattern", or is the latter a misspelt
> "record"?  I will assume it is the latter while reading the rest.
> Is the latter "the record that does not match the pattern, whose
> record number is the smallest but yet larger than the first record
> that matches the pattern?"  That is we assume that the refs in the
> packed refs file are sorted and can be partitioned into three by
> each pattern: (1) refs before the first matching record---they do
> not match the pattern. (2) refs at or after the first matching
> record that match.  (3) refs after (2) that do not match.
>
> I am not sure if that would work.  What if the pattern were refs/tags/[ad-z]*
> and there are packed tags refs/tags/{a,b,c,d,e}?  The pattern would partition
> the refs into [a](matches), [b,c](does not match), [d,e](matches).
>
> Perhaps I am grossly misunderstanding what the above explanation
> says.

Sorry, you're right to assume the latter. This should read "..., find
the record that matches it, and the first record that *doesn't* match."

In the example you give, we would ignore that whole pattern and
enumerate everything (including entries that match 'refs/tags/[ad-z]*'),
and the caller is expected to discard them.

> >   - Sort the patterns by their starting location within the
> >     `packed-refs` file.
>
> So the idea is that the patterns are sorted by the first record they
> match, and after sorting these patterns, refs that are between the
> beginning of the list of refs and the first record associated with
> the first pattern will not match _any_ pattern in that list?

My patch could use some clarification here, since it is much easier to
treat the patterns as unsorted, and then sort the beginning of the range
that they match.

> >   - Construct a skip list of regions by combining adjacent and
> >     overlapping regions from the previous step.
>
> "skip list of regions" -> "list of regions to skip", I guess.

Thanks, will update.

> > Using the skip list is fairly straightforward (see the changes to
> > `refs/packed-backend.c::next_record()`), but constructing the list is
> > not. To ensure that the construction is correct, add a new suite of
> > tests in t1419 covering various corner cases (overlapping regions,
> > partially overlapping regions, adjacent regions, etc.).
>
> Sounds good.  Does this actually use the skip list data structure,
> or do they happen to share the same two words in their names, but
> otherwise have nothing common with each other?  If the latter, we
> may want to revise the explanation, data type names, and variable
> names, to avoid confusion, as Chris pointed out earlier.

They have nothing to do with each other ;-). I made a note in this patch
in the revised version I'll send in the next day or two to note the
distinction. But I'm fine with renaming the whole concept to "jump list"
or something like that if you prefer.

> > +	for (pattern = excluded_patterns; *pattern; pattern++) {
> > +		struct skip_list_entry *e;
> > +
> > +		/*
> > +		 * We can't feed any excludes with globs in them to the
> > +		 * refs machinery.  It only understands prefix matching.
> > +		 * We likewise can't even feed the string leading up to
> > +		 * the first meta-character, as something like "foo[a]"
> > +		 * should not exclude "foobar" (but the prefix "foo"
> > +		 * would match that and mark it for exclusion).
> > +		 */
> > +		if (has_glob_special(*pattern))
> > +			continue;
>
> Hmph.  I would have expected that a set of patterns with any one
> pattern with glob would invalidate the whole skip optimization, but
> it is nice if we can salvage such a set and still optimize, if only
> for literal patterns.  Interesting.
>
> Ah, that is because we are dealing with ranges that cannot possibly
> match.  With the mention of "first record that matches" etc. in the
> earlier descriptoin, I somehow misled myself that we are dealing
> with ranges that have interesting records.  So, a pattern with glob
> does not contribute any range to be skipped, but that is OK.

Exactly right. This is all "best effort" anyway, since there are some
patterns that we cannot construct a skip list entry for in the general
case. So it's fine if we enumerate some references that match one or
more of the excluded patterns, because the caller is expected to drop
those results themselves.

> > +	/*
> > +	 * As an optimization, merge adjacent entries in the skip list
> > +	 * to jump forwards as far as possible when entering a skipped
> > +	 * region.
> > +	 *
> > +	 * For example, if we have two skipped regions:
> > +	 *
> > +	 *	[[A, B], [B, C]]
>
> I am confused.  The first pattern may never match records in [A..B]
> range, and the second pattern may never match records in [B..C]
> range, but what does it mean to combine these two ranges?

The patterns would match all records in their respective regions, but
since they are excluded we want to jump over those references instead of
iterating (and then discarding them later).

So if we have a jump from A->B, and another from B->C, it would be fine
to perform two jumps to get from A to C. But we can detect this case by
combining adjacent/overlapping regions.

> > +		if (ours->start == ours->end) {
> > +			/* ignore empty regions (no matching entries) */
> > +			continue;
> > +		} else if (prev->end >= ours->start) {
> > +			/* overlapping regions extend the previous one */
> > +			prev->end = ptr_max(prev->end, ours->end);
> > +		} else {
> > +			/* otherwise, insert a new region */
> > +			iter->skip[j++] = *ours;
> > +		}
>
> None of the {braces} seem needed, but OK.

Yeah. I added the braces here intentionally to match the
CodingGuidelines, which state that this case is an exception.

Thanks,
Taylor