Re: [RFC PATCH 2/3] grep: honor sparse checkout patterns

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Matheus,

On Mon, Mar 23, 2020 at 11:12 PM Matheus Tavares
<matheus.bernardino@xxxxxx> wrote:
>
> One of the main uses for a sparse checkout is to allow users to focus on
> the subset of files in a repository in which they are interested. But
> git-grep currently ignores the sparsity patterns and report all matches
> found outside this subset, which kind of goes in the oposity direction.
> Let's fix that, making it honor the sparsity boundaries for every
> grepping case:
>
> - git grep in worktree
> - git grep --cached
> - git grep $REVISION

Wahoo!  This is great.

> - git grep --untracked and git grep --no-index (which already respect
>   sparse checkout boundaries)
>
> This is also what some users reported[1] they would want as the default
> behavior.
>
> Note: for `git grep $REVISION`, we will choose to honor the sparsity
> patterns only when $REVISION is a commit-ish object. The reason is that,

Makes sense.

> for a tree, we don't know whether it represents the root of a
> repository or a subtree. So we wouldn't be able to correctly match it
> against the sparsity patterns. E.g. suppose we have a repository with
> these two sparsity rules: "/*" and "!/a"; and the following structure:
>
> /
> | - a (file)
> | - d (dir)
>     | - a (file)
>
> If `git grep $REVISION` were to honor the sparsity patterns for every
> object type, when grepping the /d tree, we would wrongly ignore the /d/a
> file. This happens because we wouldn't know it resides in /d and
> therefore it would wrongly match the pattern "!/a". Furthermore, for a
> search in a blob object, we wouldn't even have a path to check the
> patterns against. So, let's ignore the sparsity patterns when grepping
> non-commit-ish objects (tags to commits should be fine).
>
> Finally, the old behavior is still desirable for some use cases. So the
> next patch will add an option to allow restoring it when needed.
>
> [1]: https://lore.kernel.org/git/CABPp-BGuFhDwWZBRaD3nA8ui46wor-4=Ha1G1oApsfF8KNpfGQ@xxxxxxxxxxxxxx/
>
> Signed-off-by: Matheus Tavares <matheus.bernardino@xxxxxx>
> ---
>
> Something I'm not entirely sure in this patch is how we implement the
> mechanism to honor sparsity for the `git grep <commit-ish>` case (which
> is treated in the grep_tree() function). Currently, the patch looks for
> an index entry that matches the path, and then checks its skip_worktree

As you discuss below, checking the index is both wrong _and_ costly.
You should use the sparsity patterns; Stolee did a lot of work to make
those correspond to simple hashes you could check to determine whether
to even walk into a subdirectory.  So, O(1).  Yeah, that's "only" cone
mode but the non-cone sparsity patterns were a performance nightmare
waiting to rear its ugly head.  We should just try to encourage
everyone to move to cone mode, or accept the slowness they get without
it.

> bit. But this operation is perfomed in O(log(N)); N being the number of
> index entries. If there are many entries (and no so many sparsity
> patterns), maybe a better approach would be to try matching the path
> directly against the sparsity patterns. This would be O(M) in the number
> of patterns, and it could be done, in builtin/grep.c, with a function
> like the following:
>
> static struct pattern_list sparsity_patterns;
> static int sparsity_patterns_initialized = 0;
> static enum pattern_match_result path_matches_sparsity_patterns(
>                                         const char *path, int pathlen,
>                                         const char *basename,
>                                         struct repository *repo)
> {
>         int dtype = DT_UNKNOWN;
>
>         if (!sparsity_patterns_initialized) {
>                 char *sparse_file = git_pathdup("info/sparse-checkout");
>                 int ret;
>
>                 memset(&sparsity_patterns, 0, sizeof(sparsity_patterns));
>                 sparsity_patterns.use_cone_patterns = core_sparse_checkout_cone;
>                 ret = add_patterns_from_file_to_list(sparse_file, "", 0,
>                                                      &sparsity_patterns, NULL);
>                 free(sparse_file);
>
>                 if (ret < 0)
>                         die(_("failed to load sparse-checkout patterns"));
>                 sparsity_patterns_initialized = 1;
>         }
>
>         return path_matches_pattern_list(path, pathlen, basename, &dtype,
>                                          &sparsity_patterns, repo->index);
> }
>
> Also, if I understand correctly, the index doesn't hold paths to dirs,
> right? So even if a complete dir is excluded from sparse checkout, we
> still have to check all its subentries, only to discover that they
> should all be skipped from the search. However, if we were to check
> against the sparsity patterns directly (e.g. with the function above),
> we could skip such directories together with all their entries.
>
> Oh, and there is also the case of a commit whose tree paths are not in
> the index (maybe manually created objects?). For such commits, with the
> index lookup approach, we would have to fall back on ignoring the
> sparsity rules. I'm not sure if that would be OK, though.
>
> Any thoughts on these two approaches (looking up the skip_worktree bit
> in the index or directly matching against sparsity patterns), will be
> highly appreciated. (Note that it only concerns the `git grep
> <commit-ish>` case. The other cases already iterate thought the index, so
> there is no O(log(N)) extra complexity).
>
>  builtin/grep.c                   | 29 ++++++++---
>  t/t7011-skip-worktree-reading.sh |  9 ----
>  t/t7817-grep-sparse-checkout.sh  | 88 ++++++++++++++++++++++++++++++++
>  3 files changed, 111 insertions(+), 15 deletions(-)
>  create mode 100755 t/t7817-grep-sparse-checkout.sh
>
> diff --git a/builtin/grep.c b/builtin/grep.c
> index 99e2685090..52ec72a036 100644
> --- a/builtin/grep.c
> +++ b/builtin/grep.c
> @@ -388,7 +388,7 @@ static int grep_cache(struct grep_opt *opt,
>                       const struct pathspec *pathspec, int cached);
>  static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec,
>                      struct tree_desc *tree, struct strbuf *base, int tn_len,
> -                    int check_attr);
> +                    int from_commit);

I'm not familiar with grep.c and have to admit I don't know what
"check_attr" means.  Slightly surprised to see you replace it, but
maybe reading the rest will explain...

>
>  static int grep_submodule(struct grep_opt *opt,
>                           const struct pathspec *pathspec,
> @@ -486,6 +486,10 @@ static int grep_cache(struct grep_opt *opt,
>
>         for (nr = 0; nr < repo->index->cache_nr; nr++) {
>                 const struct cache_entry *ce = repo->index->cache[nr];
> +
> +               if (ce_skip_worktree(ce))
> +                       continue;
> +

Looks good for the case where we are grepping through what's cached.

>                 strbuf_setlen(&name, name_base_len);
>                 strbuf_addstr(&name, ce->name);
>
> @@ -498,8 +502,7 @@ static int grep_cache(struct grep_opt *opt,
>                          * cache entry are identical, even if worktree file has
>                          * been modified, so use cache version instead
>                          */
> -                       if (cached || (ce->ce_flags & CE_VALID) ||
> -                           ce_skip_worktree(ce)) {
> +                       if (cached || (ce->ce_flags & CE_VALID)) {

I had the same change when I was trying to hack something like this
patch into place but only handled the worktree case before realized it
was a bit bigger job.

>                                 if (ce_stage(ce) || ce_intent_to_add(ce))
>                                         continue;
>                                 hit |= grep_oid(opt, &ce->oid, name.buf,
> @@ -532,7 +535,7 @@ static int grep_cache(struct grep_opt *opt,
>
>  static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec,
>                      struct tree_desc *tree, struct strbuf *base, int tn_len,
> -                    int check_attr)
> +                    int from_commit)
>  {
>         struct repository *repo = opt->repo;
>         int hit = 0;
> @@ -546,6 +549,9 @@ static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec,
>                 name_base_len = name.len;
>         }
>
> +       if (from_commit && repo_read_index(repo) < 0)
> +               die(_("index file corrupt"));
> +

As above, I don't think we should need to read the index.  We should
compare to sparsity patterns, which in the important case (cone mode)
simplifies to a hash lookup as we walk directories.

>         while (tree_entry(tree, &entry)) {
>                 int te_len = tree_entry_len(&entry);
>
> @@ -564,9 +570,20 @@ static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec,
>
>                 strbuf_add(base, entry.path, te_len);
>
> +               if (from_commit) {
> +                       int pos = index_name_pos(repo->index,
> +                                                base->buf + tn_len,
> +                                                base->len - tn_len);
> +                       if (pos >= 0 &&
> +                           ce_skip_worktree(repo->index->cache[pos])) {
> +                               strbuf_setlen(base, old_baselen);
> +                               continue;
> +                       }
> +               }
> +
>                 if (S_ISREG(entry.mode)) {
>                         hit |= grep_oid(opt, &entry.oid, base->buf, tn_len,
> -                                        check_attr ? base->buf + tn_len : NULL);
> +                                       from_commit ? base->buf + tn_len : NULL);

Sadly, this doesn't help me understand check_attr or from_commit.
Could you clue me in a bit?

>                 } else if (S_ISDIR(entry.mode)) {
>                         enum object_type type;
>                         struct tree_desc sub;
> @@ -581,7 +598,7 @@ static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec,
>                         strbuf_addch(base, '/');
>                         init_tree_desc(&sub, data, size);
>                         hit |= grep_tree(opt, pathspec, &sub, base, tn_len,
> -                                        check_attr);
> +                                        from_commit);

Same.

>                         free(data);
>                 } else if (recurse_submodules && S_ISGITLINK(entry.mode)) {
>                         hit |= grep_submodule(opt, pathspec, &entry.oid,
> diff --git a/t/t7011-skip-worktree-reading.sh b/t/t7011-skip-worktree-reading.sh
> index 37525cae3a..26852586ac 100755
> --- a/t/t7011-skip-worktree-reading.sh
> +++ b/t/t7011-skip-worktree-reading.sh
> @@ -109,15 +109,6 @@ test_expect_success 'ls-files --modified' '
>         test -z "$(git ls-files -m)"
>  '
>
> -test_expect_success 'grep with skip-worktree file' '
> -       git update-index --no-skip-worktree 1 &&
> -       echo test > 1 &&
> -       git update-index 1 &&
> -       git update-index --skip-worktree 1 &&
> -       rm 1 &&
> -       test "$(git grep --no-ext-grep test)" = "1:test"
> -'
> -
>  echo ":000000 100644 $ZERO_OID $EMPTY_BLOB A   1" > expected
>  test_expect_success 'diff-index does not examine skip-worktree absent entries' '
>         setup_absent &&
> diff --git a/t/t7817-grep-sparse-checkout.sh b/t/t7817-grep-sparse-checkout.sh
> new file mode 100755
> index 0000000000..fccf44e829
> --- /dev/null
> +++ b/t/t7817-grep-sparse-checkout.sh
> @@ -0,0 +1,88 @@
> +#!/bin/sh
> +
> +test_description='grep in sparse checkout
> +
> +This test creates the following dir structure:
> +.
> +| - a
> +| - b
> +| - dir
> +    | - c
> +
> +Only "a" should be present due to the sparse checkout patterns:
> +"/*", "!/b" and "!/dir".
> +'
> +
> +. ./test-lib.sh
> +
> +test_expect_success 'setup' '
> +       echo "text" >a &&
> +       echo "text" >b &&
> +       mkdir dir &&
> +       echo "text" >dir/c &&
> +       git add a b dir &&
> +       git commit -m "initial commit" &&
> +       git tag -am t-commit t-commit HEAD &&
> +       tree=$(git rev-parse HEAD^{tree}) &&
> +       git tag -am t-tree t-tree $tree &&
> +       cat >.git/info/sparse-checkout <<-EOF &&
> +       /*
> +       !/b
> +       !/dir
> +       EOF
> +       git sparse-checkout init &&

Using `git sparse-checkout init` but then manually writing to
.git/info/sparse-checkout?  Seems like it'd make more sense to use
`git sparse-checkout set` than writing the patterns directly yourself.
Also, would prefer to have the examples use cone mode (even if you
have to add subdirectories), as it makes the testcase a bit easier to
read and more performant, though neither is a big deal.

> +       test_path_is_missing b &&
> +       test_path_is_missing dir &&
> +       test_path_is_file a
> +'
> +
> +test_expect_success 'grep in working tree should honor sparse checkout' '
> +       cat >expect <<-EOF &&
> +       a:text
> +       EOF
> +       git grep "text" >actual &&
> +       test_cmp expect actual
> +'
> +
> +test_expect_success 'grep --cached should honor sparse checkout' '
> +       cat >expect <<-EOF &&
> +       a:text
> +       EOF
> +       git grep --cached "text" >actual &&
> +       test_cmp expect actual
> +'
> +
> +test_expect_success 'grep <commit-ish> should honor sparse checkout' '
> +       commit=$(git rev-parse HEAD) &&
> +       cat >expect_commit <<-EOF &&
> +       $commit:a:text
> +       EOF
> +       cat >expect_t-commit <<-EOF &&
> +       t-commit:a:text
> +       EOF
> +       git grep "text" $commit >actual_commit &&
> +       test_cmp expect_commit actual_commit &&
> +       git grep "text" t-commit >actual_t-commit &&
> +       test_cmp expect_t-commit actual_t-commit
> +'
> +
> +test_expect_success 'grep <tree-ish> should search outside sparse checkout' '

I think the test is fine but the title seems misleading.  "outside"
and "inside" aren't defined because <tree-ish> isn't known to be
rooted, meaning we have no way to apply the sparsity patterns.  So
perhaps just 'grep <tree-ish> should ignore sparsity patterns'?

> +       commit=$(git rev-parse HEAD) &&
> +       tree=$(git rev-parse HEAD^{tree}) &&
> +       cat >expect_tree <<-EOF &&
> +       $tree:a:text
> +       $tree:b:text
> +       $tree:dir/c:text
> +       EOF
> +       cat >expect_t-tree <<-EOF &&
> +       t-tree:a:text
> +       t-tree:b:text
> +       t-tree:dir/c:text
> +       EOF
> +       git grep "text" $tree >actual_tree &&
> +       test_cmp expect_tree actual_tree &&
> +       git grep "text" t-tree >actual_t-tree &&
> +       test_cmp expect_t-tree actual_t-tree
> +'
> +
> +test_done
> --
> 2.25.1



[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux