Re: [RFC PATCH 2/3] grep: honor sparse checkout patterns

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Tue, Mar 24, 2020 at 4:15 AM Elijah Newren <newren@xxxxxxxxx> wrote:
>
> On Mon, Mar 23, 2020 at 11:12 PM Matheus Tavares
> <matheus.bernardino@xxxxxx> wrote:
> >
> > Something I'm not entirely sure in this patch is how we implement the
> > mechanism to honor sparsity for the `git grep <commit-ish>` case (which
> > is treated in the grep_tree() function). Currently, the patch looks for
> > an index entry that matches the path, and then checks its skip_worktree
>
> As you discuss below, checking the index is both wrong _and_ costly.
> You should use the sparsity patterns; Stolee did a lot of work to make
> those correspond to simple hashes you could check to determine whether
> to even walk into a subdirectory.  So, O(1).  Yeah, that's "only" cone
> mode but the non-cone sparsity patterns were a performance nightmare
> waiting to rear its ugly head.  We should just try to encourage
> everyone to move to cone mode, or accept the slowness they get without
> it.

OK, makes sense. And your reply to Stolee, later in this thread, made
it clearer for me why checking the index is not only costly but also
wrong. Thanks for the great explanation! I will use the sparsity
patterns directly, in the next iteration.

> > diff --git a/builtin/grep.c b/builtin/grep.c
> > index 99e2685090..52ec72a036 100644
> > --- a/builtin/grep.c
> > +++ b/builtin/grep.c
> > @@ -388,7 +388,7 @@ static int grep_cache(struct grep_opt *opt,
> >                       const struct pathspec *pathspec, int cached);
> >  static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec,
> >                      struct tree_desc *tree, struct strbuf *base, int tn_len,
> > -                    int check_attr);
> > +                    int from_commit);
>
> I'm not familiar with grep.c and have to admit I don't know what
> "check_attr" means. Slightly surprised to see you replace it, but
> maybe reading the rest will explain...
...
>>                 if (S_ISREG(entry.mode)) {
>>                         hit |= grep_oid(opt, &entry.oid, base->buf, tn_len,
>> -                                        check_attr ? base->buf + tn_len : NULL);
>> +                                       from_commit ? base->buf + tn_len : NULL);
>
> Sadly, this doesn't help me understand check_attr or from_commit.
> Could you clue me in a bit?

Sure! The grep machinery can optionally look the .gitattributes file,
to see if a given path has a "diff" attribute assigned to it. This
attribute points to a diff driver in .gitconfig, which can specify
many things, such as whether the path should be treated as a binary or
not. The "check_attr" flag passed to grep_tree() tells the grep
machinery if it should perform this attribute lookup for the paths in
the given tree.

I decided to replace it with "from_commit" because the only times we
want an attribute lookup when grepping a tree, is when it comes from a
commit. I.e., when the tree is the root. (The reasoning goes in the
same lines as for why we only check sparsity patterns in git-grep for
commit-ish objects: we cannot check pattern matching for trees which
we are not sure to be rooted). Since "knowing if the tree is a root or
not" is useful in grep_tree() for both sparsity checks and attribute
checks, I thought we could use a single "from_commit" variable instead
of "check_attr" and "check_sparsity", which would always have matching
values. But on second thought, I could maybe rename the variable to
something as "is_root_tree" or add a comment explaining the usage of
"from_commit".

(I'm not a big fan of "is_root_tree", thought, because we could give a
root tree to grep_tree() but not really know it.)

> > diff --git a/t/t7817-grep-sparse-checkout.sh b/t/t7817-grep-sparse-checkout.sh
> > new file mode 100755
> > index 0000000000..fccf44e829
> > --- /dev/null
> > +++ b/t/t7817-grep-sparse-checkout.sh
...
> > +test_expect_success 'setup' '
> > +       echo "text" >a &&
> > +       echo "text" >b &&
> > +       mkdir dir &&
> > +       echo "text" >dir/c &&
> > +       git add a b dir &&
> > +       git commit -m "initial commit" &&
> > +       git tag -am t-commit t-commit HEAD &&
> > +       tree=$(git rev-parse HEAD^{tree}) &&
> > +       git tag -am t-tree t-tree $tree &&
> > +       cat >.git/info/sparse-checkout <<-EOF &&
> > +       /*
> > +       !/b
> > +       !/dir
> > +       EOF
> > +       git sparse-checkout init &&
>
> Using `git sparse-checkout init` but then manually writing to
> .git/info/sparse-checkout?  Seems like it'd make more sense to use
> `git sparse-checkout set` than writing the patterns directly yourself.
> Also, would prefer to have the examples use cone mode (even if you
> have to add subdirectories), as it makes the testcase a bit easier to
> read and more performant, though neither is a big deal.

OK, I will make use of the builtin here. I will also use the cone mode
(and leave one test without it, as Stolee suggested later in this
thread).

> > +test_expect_success 'grep <tree-ish> should search outside sparse checkout' '
>
> I think the test is fine but the title seems misleading.  "outside"
> and "inside" aren't defined because <tree-ish> isn't known to be
> rooted, meaning we have no way to apply the sparsity patterns.  So
> perhaps just 'grep <tree-ish> should ignore sparsity patterns'?

Right! "should ignore sparsity patterns" is a much better name, thanks.

Thanks a lot for the thoughtful review and comments!



[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux