One of the main uses for a sparse checkout is to allow users to focus on the subset of files in a repository in which they are interested. But git-grep currently ignores the sparsity patterns and report all matches found outside this subset, which kind of goes in the oposity direction. Let's fix that, making it honor the sparsity boundaries for every grepping case: - git grep in worktree - git grep --cached - git grep $REVISION - git grep --untracked and git grep --no-index (which already respect sparse checkout boundaries) This is also what some users reported[1] they would want as the default behavior. Note: for `git grep $REVISION`, we will choose to honor the sparsity patterns only when $REVISION is a commit-ish object. The reason is that, for a tree, we don't know whether it represents the root of a repository or a subtree. So we wouldn't be able to correctly match it against the sparsity patterns. E.g. suppose we have a repository with these two sparsity rules: "/*" and "!/a"; and the following structure: / | - a (file) | - d (dir) | - a (file) If `git grep $REVISION` were to honor the sparsity patterns for every object type, when grepping the /d tree, we would wrongly ignore the /d/a file. This happens because we wouldn't know it resides in /d and therefore it would wrongly match the pattern "!/a". Furthermore, for a search in a blob object, we wouldn't even have a path to check the patterns against. So, let's ignore the sparsity patterns when grepping non-commit-ish objects (tags to commits should be fine). Finally, the old behavior is still desirable for some use cases. So the next patch will add an option to allow restoring it when needed. [1]: https://lore.kernel.org/git/CABPp-BGuFhDwWZBRaD3nA8ui46wor-4=Ha1G1oApsfF8KNpfGQ@xxxxxxxxxxxxxx/ Signed-off-by: Matheus Tavares <matheus.bernardino@xxxxxx> --- Something I'm not entirely sure in this patch is how we implement the mechanism to honor sparsity for the `git grep <commit-ish>` case (which is treated in the grep_tree() function). Currently, the patch looks for an index entry that matches the path, and then checks its skip_worktree bit. But this operation is perfomed in O(log(N)); N being the number of index entries. If there are many entries (and no so many sparsity patterns), maybe a better approach would be to try matching the path directly against the sparsity patterns. This would be O(M) in the number of patterns, and it could be done, in builtin/grep.c, with a function like the following: static struct pattern_list sparsity_patterns; static int sparsity_patterns_initialized = 0; static enum pattern_match_result path_matches_sparsity_patterns( const char *path, int pathlen, const char *basename, struct repository *repo) { int dtype = DT_UNKNOWN; if (!sparsity_patterns_initialized) { char *sparse_file = git_pathdup("info/sparse-checkout"); int ret; memset(&sparsity_patterns, 0, sizeof(sparsity_patterns)); sparsity_patterns.use_cone_patterns = core_sparse_checkout_cone; ret = add_patterns_from_file_to_list(sparse_file, "", 0, &sparsity_patterns, NULL); free(sparse_file); if (ret < 0) die(_("failed to load sparse-checkout patterns")); sparsity_patterns_initialized = 1; } return path_matches_pattern_list(path, pathlen, basename, &dtype, &sparsity_patterns, repo->index); } Also, if I understand correctly, the index doesn't hold paths to dirs, right? So even if a complete dir is excluded from sparse checkout, we still have to check all its subentries, only to discover that they should all be skipped from the search. However, if we were to check against the sparsity patterns directly (e.g. with the function above), we could skip such directories together with all their entries. Oh, and there is also the case of a commit whose tree paths are not in the index (maybe manually created objects?). For such commits, with the index lookup approach, we would have to fall back on ignoring the sparsity rules. I'm not sure if that would be OK, though. Any thoughts on these two approaches (looking up the skip_worktree bit in the index or directly matching against sparsity patterns), will be highly appreciated. (Note that it only concerns the `git grep <commit-ish>` case. The other cases already iterate thought the index, so there is no O(log(N)) extra complexity). builtin/grep.c | 29 ++++++++--- t/t7011-skip-worktree-reading.sh | 9 ---- t/t7817-grep-sparse-checkout.sh | 88 ++++++++++++++++++++++++++++++++ 3 files changed, 111 insertions(+), 15 deletions(-) create mode 100755 t/t7817-grep-sparse-checkout.sh diff --git a/builtin/grep.c b/builtin/grep.c index 99e2685090..52ec72a036 100644 --- a/builtin/grep.c +++ b/builtin/grep.c @@ -388,7 +388,7 @@ static int grep_cache(struct grep_opt *opt, const struct pathspec *pathspec, int cached); static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec, struct tree_desc *tree, struct strbuf *base, int tn_len, - int check_attr); + int from_commit); static int grep_submodule(struct grep_opt *opt, const struct pathspec *pathspec, @@ -486,6 +486,10 @@ static int grep_cache(struct grep_opt *opt, for (nr = 0; nr < repo->index->cache_nr; nr++) { const struct cache_entry *ce = repo->index->cache[nr]; + + if (ce_skip_worktree(ce)) + continue; + strbuf_setlen(&name, name_base_len); strbuf_addstr(&name, ce->name); @@ -498,8 +502,7 @@ static int grep_cache(struct grep_opt *opt, * cache entry are identical, even if worktree file has * been modified, so use cache version instead */ - if (cached || (ce->ce_flags & CE_VALID) || - ce_skip_worktree(ce)) { + if (cached || (ce->ce_flags & CE_VALID)) { if (ce_stage(ce) || ce_intent_to_add(ce)) continue; hit |= grep_oid(opt, &ce->oid, name.buf, @@ -532,7 +535,7 @@ static int grep_cache(struct grep_opt *opt, static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec, struct tree_desc *tree, struct strbuf *base, int tn_len, - int check_attr) + int from_commit) { struct repository *repo = opt->repo; int hit = 0; @@ -546,6 +549,9 @@ static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec, name_base_len = name.len; } + if (from_commit && repo_read_index(repo) < 0) + die(_("index file corrupt")); + while (tree_entry(tree, &entry)) { int te_len = tree_entry_len(&entry); @@ -564,9 +570,20 @@ static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec, strbuf_add(base, entry.path, te_len); + if (from_commit) { + int pos = index_name_pos(repo->index, + base->buf + tn_len, + base->len - tn_len); + if (pos >= 0 && + ce_skip_worktree(repo->index->cache[pos])) { + strbuf_setlen(base, old_baselen); + continue; + } + } + if (S_ISREG(entry.mode)) { hit |= grep_oid(opt, &entry.oid, base->buf, tn_len, - check_attr ? base->buf + tn_len : NULL); + from_commit ? base->buf + tn_len : NULL); } else if (S_ISDIR(entry.mode)) { enum object_type type; struct tree_desc sub; @@ -581,7 +598,7 @@ static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec, strbuf_addch(base, '/'); init_tree_desc(&sub, data, size); hit |= grep_tree(opt, pathspec, &sub, base, tn_len, - check_attr); + from_commit); free(data); } else if (recurse_submodules && S_ISGITLINK(entry.mode)) { hit |= grep_submodule(opt, pathspec, &entry.oid, diff --git a/t/t7011-skip-worktree-reading.sh b/t/t7011-skip-worktree-reading.sh index 37525cae3a..26852586ac 100755 --- a/t/t7011-skip-worktree-reading.sh +++ b/t/t7011-skip-worktree-reading.sh @@ -109,15 +109,6 @@ test_expect_success 'ls-files --modified' ' test -z "$(git ls-files -m)" ' -test_expect_success 'grep with skip-worktree file' ' - git update-index --no-skip-worktree 1 && - echo test > 1 && - git update-index 1 && - git update-index --skip-worktree 1 && - rm 1 && - test "$(git grep --no-ext-grep test)" = "1:test" -' - echo ":000000 100644 $ZERO_OID $EMPTY_BLOB A 1" > expected test_expect_success 'diff-index does not examine skip-worktree absent entries' ' setup_absent && diff --git a/t/t7817-grep-sparse-checkout.sh b/t/t7817-grep-sparse-checkout.sh new file mode 100755 index 0000000000..fccf44e829 --- /dev/null +++ b/t/t7817-grep-sparse-checkout.sh @@ -0,0 +1,88 @@ +#!/bin/sh + +test_description='grep in sparse checkout + +This test creates the following dir structure: +. +| - a +| - b +| - dir + | - c + +Only "a" should be present due to the sparse checkout patterns: +"/*", "!/b" and "!/dir". +' + +. ./test-lib.sh + +test_expect_success 'setup' ' + echo "text" >a && + echo "text" >b && + mkdir dir && + echo "text" >dir/c && + git add a b dir && + git commit -m "initial commit" && + git tag -am t-commit t-commit HEAD && + tree=$(git rev-parse HEAD^{tree}) && + git tag -am t-tree t-tree $tree && + cat >.git/info/sparse-checkout <<-EOF && + /* + !/b + !/dir + EOF + git sparse-checkout init && + test_path_is_missing b && + test_path_is_missing dir && + test_path_is_file a +' + +test_expect_success 'grep in working tree should honor sparse checkout' ' + cat >expect <<-EOF && + a:text + EOF + git grep "text" >actual && + test_cmp expect actual +' + +test_expect_success 'grep --cached should honor sparse checkout' ' + cat >expect <<-EOF && + a:text + EOF + git grep --cached "text" >actual && + test_cmp expect actual +' + +test_expect_success 'grep <commit-ish> should honor sparse checkout' ' + commit=$(git rev-parse HEAD) && + cat >expect_commit <<-EOF && + $commit:a:text + EOF + cat >expect_t-commit <<-EOF && + t-commit:a:text + EOF + git grep "text" $commit >actual_commit && + test_cmp expect_commit actual_commit && + git grep "text" t-commit >actual_t-commit && + test_cmp expect_t-commit actual_t-commit +' + +test_expect_success 'grep <tree-ish> should search outside sparse checkout' ' + commit=$(git rev-parse HEAD) && + tree=$(git rev-parse HEAD^{tree}) && + cat >expect_tree <<-EOF && + $tree:a:text + $tree:b:text + $tree:dir/c:text + EOF + cat >expect_t-tree <<-EOF && + t-tree:a:text + t-tree:b:text + t-tree:dir/c:text + EOF + git grep "text" $tree >actual_tree && + test_cmp expect_tree actual_tree && + git grep "text" t-tree >actual_t-tree && + test_cmp expect_t-tree actual_t-tree +' + +test_done -- 2.25.1