Hi Matheus, On Mon, Mar 23, 2020 at 11:12 PM Matheus Tavares <matheus.bernardino@xxxxxx> wrote: > > One of the main uses for a sparse checkout is to allow users to focus on > the subset of files in a repository in which they are interested. But > git-grep currently ignores the sparsity patterns and report all matches > found outside this subset, which kind of goes in the oposity direction. > Let's fix that, making it honor the sparsity boundaries for every > grepping case: > > - git grep in worktree > - git grep --cached > - git grep $REVISION Wahoo! This is great. > - git grep --untracked and git grep --no-index (which already respect > sparse checkout boundaries) > > This is also what some users reported[1] they would want as the default > behavior. > > Note: for `git grep $REVISION`, we will choose to honor the sparsity > patterns only when $REVISION is a commit-ish object. The reason is that, Makes sense. > for a tree, we don't know whether it represents the root of a > repository or a subtree. So we wouldn't be able to correctly match it > against the sparsity patterns. E.g. suppose we have a repository with > these two sparsity rules: "/*" and "!/a"; and the following structure: > > / > | - a (file) > | - d (dir) > | - a (file) > > If `git grep $REVISION` were to honor the sparsity patterns for every > object type, when grepping the /d tree, we would wrongly ignore the /d/a > file. This happens because we wouldn't know it resides in /d and > therefore it would wrongly match the pattern "!/a". Furthermore, for a > search in a blob object, we wouldn't even have a path to check the > patterns against. So, let's ignore the sparsity patterns when grepping > non-commit-ish objects (tags to commits should be fine). > > Finally, the old behavior is still desirable for some use cases. So the > next patch will add an option to allow restoring it when needed. > > [1]: https://lore.kernel.org/git/CABPp-BGuFhDwWZBRaD3nA8ui46wor-4=Ha1G1oApsfF8KNpfGQ@xxxxxxxxxxxxxx/ > > Signed-off-by: Matheus Tavares <matheus.bernardino@xxxxxx> > --- > > Something I'm not entirely sure in this patch is how we implement the > mechanism to honor sparsity for the `git grep <commit-ish>` case (which > is treated in the grep_tree() function). Currently, the patch looks for > an index entry that matches the path, and then checks its skip_worktree As you discuss below, checking the index is both wrong _and_ costly. You should use the sparsity patterns; Stolee did a lot of work to make those correspond to simple hashes you could check to determine whether to even walk into a subdirectory. So, O(1). Yeah, that's "only" cone mode but the non-cone sparsity patterns were a performance nightmare waiting to rear its ugly head. We should just try to encourage everyone to move to cone mode, or accept the slowness they get without it. > bit. But this operation is perfomed in O(log(N)); N being the number of > index entries. If there are many entries (and no so many sparsity > patterns), maybe a better approach would be to try matching the path > directly against the sparsity patterns. This would be O(M) in the number > of patterns, and it could be done, in builtin/grep.c, with a function > like the following: > > static struct pattern_list sparsity_patterns; > static int sparsity_patterns_initialized = 0; > static enum pattern_match_result path_matches_sparsity_patterns( > const char *path, int pathlen, > const char *basename, > struct repository *repo) > { > int dtype = DT_UNKNOWN; > > if (!sparsity_patterns_initialized) { > char *sparse_file = git_pathdup("info/sparse-checkout"); > int ret; > > memset(&sparsity_patterns, 0, sizeof(sparsity_patterns)); > sparsity_patterns.use_cone_patterns = core_sparse_checkout_cone; > ret = add_patterns_from_file_to_list(sparse_file, "", 0, > &sparsity_patterns, NULL); > free(sparse_file); > > if (ret < 0) > die(_("failed to load sparse-checkout patterns")); > sparsity_patterns_initialized = 1; > } > > return path_matches_pattern_list(path, pathlen, basename, &dtype, > &sparsity_patterns, repo->index); > } > > Also, if I understand correctly, the index doesn't hold paths to dirs, > right? So even if a complete dir is excluded from sparse checkout, we > still have to check all its subentries, only to discover that they > should all be skipped from the search. However, if we were to check > against the sparsity patterns directly (e.g. with the function above), > we could skip such directories together with all their entries. > > Oh, and there is also the case of a commit whose tree paths are not in > the index (maybe manually created objects?). For such commits, with the > index lookup approach, we would have to fall back on ignoring the > sparsity rules. I'm not sure if that would be OK, though. > > Any thoughts on these two approaches (looking up the skip_worktree bit > in the index or directly matching against sparsity patterns), will be > highly appreciated. (Note that it only concerns the `git grep > <commit-ish>` case. The other cases already iterate thought the index, so > there is no O(log(N)) extra complexity). > > builtin/grep.c | 29 ++++++++--- > t/t7011-skip-worktree-reading.sh | 9 ---- > t/t7817-grep-sparse-checkout.sh | 88 ++++++++++++++++++++++++++++++++ > 3 files changed, 111 insertions(+), 15 deletions(-) > create mode 100755 t/t7817-grep-sparse-checkout.sh > > diff --git a/builtin/grep.c b/builtin/grep.c > index 99e2685090..52ec72a036 100644 > --- a/builtin/grep.c > +++ b/builtin/grep.c > @@ -388,7 +388,7 @@ static int grep_cache(struct grep_opt *opt, > const struct pathspec *pathspec, int cached); > static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec, > struct tree_desc *tree, struct strbuf *base, int tn_len, > - int check_attr); > + int from_commit); I'm not familiar with grep.c and have to admit I don't know what "check_attr" means. Slightly surprised to see you replace it, but maybe reading the rest will explain... > > static int grep_submodule(struct grep_opt *opt, > const struct pathspec *pathspec, > @@ -486,6 +486,10 @@ static int grep_cache(struct grep_opt *opt, > > for (nr = 0; nr < repo->index->cache_nr; nr++) { > const struct cache_entry *ce = repo->index->cache[nr]; > + > + if (ce_skip_worktree(ce)) > + continue; > + Looks good for the case where we are grepping through what's cached. > strbuf_setlen(&name, name_base_len); > strbuf_addstr(&name, ce->name); > > @@ -498,8 +502,7 @@ static int grep_cache(struct grep_opt *opt, > * cache entry are identical, even if worktree file has > * been modified, so use cache version instead > */ > - if (cached || (ce->ce_flags & CE_VALID) || > - ce_skip_worktree(ce)) { > + if (cached || (ce->ce_flags & CE_VALID)) { I had the same change when I was trying to hack something like this patch into place but only handled the worktree case before realized it was a bit bigger job. > if (ce_stage(ce) || ce_intent_to_add(ce)) > continue; > hit |= grep_oid(opt, &ce->oid, name.buf, > @@ -532,7 +535,7 @@ static int grep_cache(struct grep_opt *opt, > > static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec, > struct tree_desc *tree, struct strbuf *base, int tn_len, > - int check_attr) > + int from_commit) > { > struct repository *repo = opt->repo; > int hit = 0; > @@ -546,6 +549,9 @@ static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec, > name_base_len = name.len; > } > > + if (from_commit && repo_read_index(repo) < 0) > + die(_("index file corrupt")); > + As above, I don't think we should need to read the index. We should compare to sparsity patterns, which in the important case (cone mode) simplifies to a hash lookup as we walk directories. > while (tree_entry(tree, &entry)) { > int te_len = tree_entry_len(&entry); > > @@ -564,9 +570,20 @@ static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec, > > strbuf_add(base, entry.path, te_len); > > + if (from_commit) { > + int pos = index_name_pos(repo->index, > + base->buf + tn_len, > + base->len - tn_len); > + if (pos >= 0 && > + ce_skip_worktree(repo->index->cache[pos])) { > + strbuf_setlen(base, old_baselen); > + continue; > + } > + } > + > if (S_ISREG(entry.mode)) { > hit |= grep_oid(opt, &entry.oid, base->buf, tn_len, > - check_attr ? base->buf + tn_len : NULL); > + from_commit ? base->buf + tn_len : NULL); Sadly, this doesn't help me understand check_attr or from_commit. Could you clue me in a bit? > } else if (S_ISDIR(entry.mode)) { > enum object_type type; > struct tree_desc sub; > @@ -581,7 +598,7 @@ static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec, > strbuf_addch(base, '/'); > init_tree_desc(&sub, data, size); > hit |= grep_tree(opt, pathspec, &sub, base, tn_len, > - check_attr); > + from_commit); Same. > free(data); > } else if (recurse_submodules && S_ISGITLINK(entry.mode)) { > hit |= grep_submodule(opt, pathspec, &entry.oid, > diff --git a/t/t7011-skip-worktree-reading.sh b/t/t7011-skip-worktree-reading.sh > index 37525cae3a..26852586ac 100755 > --- a/t/t7011-skip-worktree-reading.sh > +++ b/t/t7011-skip-worktree-reading.sh > @@ -109,15 +109,6 @@ test_expect_success 'ls-files --modified' ' > test -z "$(git ls-files -m)" > ' > > -test_expect_success 'grep with skip-worktree file' ' > - git update-index --no-skip-worktree 1 && > - echo test > 1 && > - git update-index 1 && > - git update-index --skip-worktree 1 && > - rm 1 && > - test "$(git grep --no-ext-grep test)" = "1:test" > -' > - > echo ":000000 100644 $ZERO_OID $EMPTY_BLOB A 1" > expected > test_expect_success 'diff-index does not examine skip-worktree absent entries' ' > setup_absent && > diff --git a/t/t7817-grep-sparse-checkout.sh b/t/t7817-grep-sparse-checkout.sh > new file mode 100755 > index 0000000000..fccf44e829 > --- /dev/null > +++ b/t/t7817-grep-sparse-checkout.sh > @@ -0,0 +1,88 @@ > +#!/bin/sh > + > +test_description='grep in sparse checkout > + > +This test creates the following dir structure: > +. > +| - a > +| - b > +| - dir > + | - c > + > +Only "a" should be present due to the sparse checkout patterns: > +"/*", "!/b" and "!/dir". > +' > + > +. ./test-lib.sh > + > +test_expect_success 'setup' ' > + echo "text" >a && > + echo "text" >b && > + mkdir dir && > + echo "text" >dir/c && > + git add a b dir && > + git commit -m "initial commit" && > + git tag -am t-commit t-commit HEAD && > + tree=$(git rev-parse HEAD^{tree}) && > + git tag -am t-tree t-tree $tree && > + cat >.git/info/sparse-checkout <<-EOF && > + /* > + !/b > + !/dir > + EOF > + git sparse-checkout init && Using `git sparse-checkout init` but then manually writing to .git/info/sparse-checkout? Seems like it'd make more sense to use `git sparse-checkout set` than writing the patterns directly yourself. Also, would prefer to have the examples use cone mode (even if you have to add subdirectories), as it makes the testcase a bit easier to read and more performant, though neither is a big deal. > + test_path_is_missing b && > + test_path_is_missing dir && > + test_path_is_file a > +' > + > +test_expect_success 'grep in working tree should honor sparse checkout' ' > + cat >expect <<-EOF && > + a:text > + EOF > + git grep "text" >actual && > + test_cmp expect actual > +' > + > +test_expect_success 'grep --cached should honor sparse checkout' ' > + cat >expect <<-EOF && > + a:text > + EOF > + git grep --cached "text" >actual && > + test_cmp expect actual > +' > + > +test_expect_success 'grep <commit-ish> should honor sparse checkout' ' > + commit=$(git rev-parse HEAD) && > + cat >expect_commit <<-EOF && > + $commit:a:text > + EOF > + cat >expect_t-commit <<-EOF && > + t-commit:a:text > + EOF > + git grep "text" $commit >actual_commit && > + test_cmp expect_commit actual_commit && > + git grep "text" t-commit >actual_t-commit && > + test_cmp expect_t-commit actual_t-commit > +' > + > +test_expect_success 'grep <tree-ish> should search outside sparse checkout' ' I think the test is fine but the title seems misleading. "outside" and "inside" aren't defined because <tree-ish> isn't known to be rooted, meaning we have no way to apply the sparsity patterns. So perhaps just 'grep <tree-ish> should ignore sparsity patterns'? > + commit=$(git rev-parse HEAD) && > + tree=$(git rev-parse HEAD^{tree}) && > + cat >expect_tree <<-EOF && > + $tree:a:text > + $tree:b:text > + $tree:dir/c:text > + EOF > + cat >expect_t-tree <<-EOF && > + t-tree:a:text > + t-tree:b:text > + t-tree:dir/c:text > + EOF > + git grep "text" $tree >actual_tree && > + test_cmp expect_tree actual_tree && > + git grep "text" t-tree >actual_t-tree && > + test_cmp expect_t-tree actual_t-tree > +' > + > +test_done > -- > 2.25.1