On Wed, Jun 26, 2019 at 3:44 PM Taylor Blau <me@xxxxxxxxxxxx> wrote: > > Since cfe004a5a9 (ref-filter: limit traversal to prefix, 2017-05-22), > the ref-filter code has sought to limit the traversals to a prefix of > the given patterns. > > That code stopped short of handling more than one pattern, because it > means invoking 'for_each_ref_in' multiple times. If we're not careful > about which patterns overlap, we will output the same refs multiple > times. Right. > > For instance, consider the set of patterns 'refs/heads/a/*', > 'refs/heads/a/b/c', and 'refs/tags/v1.0.0'. If we naďvely ran: > > for_each_ref_in("refs/heads/a/*", ...); > for_each_ref_in("refs/heads/a/b/c", ...); > for_each_ref_in("refs/tags/v1.0.0", ...); > > we would see 'refs/heads/a/b/c' (and everything underneath it) twice. > Explaining why we ignored solving this before.. > Instead, we want to partition the patterns into disjoint sets, where we > know that no ref will be matched by any two patterns in different sets. > In the above, these are: > > - {'refs/heads/a/*', 'refs/heads/a/b/c'}, and > - {'refs/tags/v1.0.0'} Is this disjoint set calculation already existing, or did you have to add it in this patch? > > Given one of these disjoint sets, what is a suitable pattern to pass to > 'for_each_ref_in'? One approach is to compute the longest common prefix > over all elements in that disjoint set, and let the caller cull out the > refs they didn't want. Computing the longest prefix means that in most > cases, we won't match too many things the caller would like to ignore. > > The longest common prefixes of the above are: > > - {'refs/heads/a/*', 'refs/heads/a/b/c'} -> refs/heads/a/* > - {'refs/tags/v1.0.0'} -> refs/tags/v1.0.0 > > We instead invoke: > > for_each_ref_in("refs/heads/a/*", ...); > for_each_ref_in("refs/tags/v1.0.0", ...); > > Which provides us with the refs we were looking for with a minimal > amount of extra cruft, but never a duplicate of the ref we asked for. > > Implemented here is an algorithm which accomplishes the above, which > works as follows: > > 1. Lexicographically sort the given list of patterns. > > 2. Initialize 'prefix' to the empty string, where our goal is to > build each element in the above set of longest common prefixes. > > 3. Consider each pattern in the given set, and emit 'prefix' if it > reaches the end of a pattern, or touches a wildcard character. The > end of a string is treated as if it precedes a wildcard. (Note that > there is some room for future work to detect that, e.g., 'a?b' and > 'abc' are disjoint). Neat, so you're calculating disjoint patterns inline while also figuring out which one is the "best" for any given pattern set... Neat! > > 4. Otherwise, recurse on step (3) with the slice of the list > corresponding to our current prefix (i.e., the subset of patterns > that have our prefix as a literal string prefix.) > > This algorithm is 'O(kn + n log(n))', where 'k' is max(len(pattern)) for > each pattern in the list, and 'n' is len(patterns). > ok, so if we can assume that k is some relatively small constant number (since the maximum pattern length isn't likely to grow without bounds), this is O(n*log(n)) on the number of patterns, so we don't even approach n^2 even when we are given a large number of patterns. Nice! > By discovering this set of interesting patterns, we reduce the runtime > of multi-pattern 'git for-each-ref' (and other ref traversals) from > O(N) to O(n log(N)), where 'N' is the total number of packed references. So here, n is the number of patterns still? This seems like a pretty significant gane when we have a large number of packed references. > > Running 'git for-each-ref refs/tags/a refs/tags/b' on a repository with > 10,000,000 refs in 'refs/tags/huge-N', my best-of-five times drop from: > > real 0m5.805s > user 0m5.188s > sys 0m0.468s > > to: > > real 0m0.001s > user 0m0.000s > sys 0m0.000s > That's a pretty significant decrease! > On linux.git, the times to dig out two of the latest -rc tags drops from > 0.002s to 0.001s, so the change on repositories with fewer tags is much > less noticeable. > This explains why it might not have been done before.. many repositories wouldn't benefit much. That said, the patch description doesn't make it seem very complicated. I did run out of time reading the message, so I'll have to follow up reviewing the actual change below later. I think the description of the goal and solution is sound though. Thanks, Jake > Co-authored-by: Jeff King <peff@xxxxxxxx> > Signed-off-by: Jeff King <peff@xxxxxxxx> > Signed-off-by: Taylor Blau <me@xxxxxxxxxxxx> > --- > ref-filter.c | 89 +++++++++++++++++++++++++++++------------ > t/t6300-for-each-ref.sh | 26 ++++++++++++ > 2 files changed, 89 insertions(+), 26 deletions(-) > > diff --git a/ref-filter.c b/ref-filter.c > index 8500671bc6..3e10fd647b 100644 > --- a/ref-filter.c > +++ b/ref-filter.c > @@ -20,6 +20,7 @@ > #include "commit-slab.h" > #include "commit-graph.h" > #include "commit-reach.h" > +#include "argv-array.h" > > static struct ref_msg { > const char *gone; > @@ -1790,21 +1791,62 @@ static int filter_pattern_match(struct ref_filter *filter, const char *refname) > return match_pattern(filter, refname); > } > > -/* > - * Find the longest prefix of pattern we can pass to > - * `for_each_fullref_in()`, namely the part of pattern preceding the > - * first glob character. (Note that `for_each_fullref_in()` is > - * perfectly happy working with a prefix that doesn't end at a > - * pathname component boundary.) > - */ > -static void find_longest_prefix(struct strbuf *out, const char *pattern) > +static int qsort_strcmp(const void *va, const void *vb) > { > - const char *p; > + const char *a = *(const char **)va; > + const char *b = *(const char **)vb; > > - for (p = pattern; *p && !is_glob_special(*p); p++) > - ; > + return strcmp(a, b); > +} > > - strbuf_add(out, pattern, p - pattern); > +static void find_longest_prefixes_1(struct string_list *out, > + struct strbuf *prefix, > + const char **patterns, size_t nr) > +{ > + size_t i; > + > + for (i = 0; i < nr; i++) { > + char c = patterns[i][prefix->len]; > + if (!c || is_glob_special(c)) { > + string_list_append(out, prefix->buf); > + return; > + } > + } > + > + i = 0; > + while (i < nr) { > + size_t end; > + > + /* > + * Set "end" to the index of the element _after_ the last one > + * in our group. > + */ > + for (end = i + 1; end < nr; end++) { > + if (patterns[i][prefix->len] != patterns[end][prefix->len]) > + break; > + } > + > + strbuf_addch(prefix, patterns[i][prefix->len]); > + find_longest_prefixes_1(out, prefix, patterns + i, end - i); > + strbuf_setlen(prefix, prefix->len - 1); > + > + i = end; > + } > +} > + > +static void find_longest_prefixes(struct string_list *out, > + const char **patterns) > +{ > + struct argv_array sorted = ARGV_ARRAY_INIT; > + struct strbuf prefix = STRBUF_INIT; > + > + argv_array_pushv(&sorted, patterns); > + QSORT(sorted.argv, sorted.argc, qsort_strcmp); > + > + find_longest_prefixes_1(out, &prefix, sorted.argv, sorted.argc); > + > + argv_array_clear(&sorted); > + strbuf_release(&prefix); > } > > /* > @@ -1817,7 +1859,8 @@ static int for_each_fullref_in_pattern(struct ref_filter *filter, > void *cb_data, > int broken) > { > - struct strbuf prefix = STRBUF_INIT; > + struct string_list prefixes = STRING_LIST_INIT_DUP; > + struct string_list_item *prefix; > int ret; > > if (!filter->match_as_path) { > @@ -1843,21 +1886,15 @@ static int for_each_fullref_in_pattern(struct ref_filter *filter, > return for_each_fullref_in("", cb, cb_data, broken); > } > > - if (filter->name_patterns[1]) { > - /* > - * multiple patterns; in theory this could still work as long > - * as the patterns are disjoint. We'd just make multiple calls > - * to for_each_ref(). But if they're not disjoint, we'd end up > - * reporting the same ref multiple times. So let's punt on that > - * for now. > - */ > - return for_each_fullref_in("", cb, cb_data, broken); > + find_longest_prefixes(&prefixes, filter->name_patterns); > + > + for_each_string_list_item(prefix, &prefixes) { > + ret = for_each_fullref_in(prefix->string, cb, cb_data, broken); > + if (ret) > + break; > } > > - find_longest_prefix(&prefix, filter->name_patterns[0]); > - > - ret = for_each_fullref_in(prefix.buf, cb, cb_data, broken); > - strbuf_release(&prefix); > + string_list_clear(&prefixes, 0); > return ret; > } > > diff --git a/t/t6300-for-each-ref.sh b/t/t6300-for-each-ref.sh > index d9235217fc..ab69aa176d 100755 > --- a/t/t6300-for-each-ref.sh > +++ b/t/t6300-for-each-ref.sh > @@ -345,6 +345,32 @@ test_expect_success 'Verify descending sort' ' > test_cmp expected actual > ' > > +cat >expected <<\EOF > +refs/tags/testtag > +refs/tags/testtag-2 > +EOF > + > +test_expect_success 'exercise patterns with prefixes' ' > + git tag testtag-2 && > + test_when_finished "git tag -d testtag-2" && > + git for-each-ref --format="%(refname)" \ > + refs/tags/testtag refs/tags/testtag-2 >actual && > + test_cmp expected actual > +' > + > +cat >expected <<\EOF > +refs/tags/testtag > +refs/tags/testtag-2 > +EOF > + > +test_expect_success 'exercise glob patterns with prefixes' ' > + git tag testtag-2 && > + test_when_finished "git tag -d testtag-2" && > + git for-each-ref --format="%(refname)" \ > + refs/tags/testtag "refs/tags/testtag-*" >actual && > + test_cmp expected actual > +' > + > cat >expected <<\EOF > 'refs/heads/master' > 'refs/remotes/origin/master' > -- > 2.21.0.203.g358da99528