On 3/24/2020 3:15 AM, Elijah Newren wrote: > Hi Matheus, > > On Mon, Mar 23, 2020 at 11:12 PM Matheus Tavares > <matheus.bernardino@xxxxxx> wrote: >> >> One of the main uses for a sparse checkout is to allow users to focus on >> the subset of files in a repository in which they are interested. But >> git-grep currently ignores the sparsity patterns and report all matches >> found outside this subset, which kind of goes in the oposity direction. >> Let's fix that, making it honor the sparsity boundaries for every >> grepping case: >> >> - git grep in worktree >> - git grep --cached >> - git grep $REVISION > > Wahoo! This is great. I am also excited. Also thrilled to see the option to get the old behavior in the next patch. >> Something I'm not entirely sure in this patch is how we implement the >> mechanism to honor sparsity for the `git grep <commit-ish>` case (which >> is treated in the grep_tree() function). Currently, the patch looks for >> an index entry that matches the path, and then checks its skip_worktree > > As you discuss below, checking the index is both wrong _and_ costly. I'm not sure why checking the index is _wrong_, but I agree about the performance cost. > You should use the sparsity patterns; Stolee did a lot of work to make > those correspond to simple hashes you could check to determine whether > to even walk into a subdirectory. So, O(1). Yeah, that's "only" cone > mode but the non-cone sparsity patterns were a performance nightmare > waiting to rear its ugly head. We should just try to encourage > everyone to move to cone mode, or accept the slowness they get without > it. > >> bit. But this operation is perfomed in O(log(N)); N being the number of >> index entries. If there are many entries (and no so many sparsity >> patterns), maybe a better approach would be to try matching the path >> directly against the sparsity patterns. This would be O(M) in the number >> of patterns, and it could be done, in builtin/grep.c, with a function >> like the following: >> >> static struct pattern_list sparsity_patterns; >> static int sparsity_patterns_initialized = 0; >> static enum pattern_match_result path_matches_sparsity_patterns( >> const char *path, int pathlen, >> const char *basename, >> struct repository *repo) >> { >> int dtype = DT_UNKNOWN; >> >> if (!sparsity_patterns_initialized) { >> char *sparse_file = git_pathdup("info/sparse-checkout"); >> int ret; >> >> memset(&sparsity_patterns, 0, sizeof(sparsity_patterns)); >> sparsity_patterns.use_cone_patterns = core_sparse_checkout_cone; >> ret = add_patterns_from_file_to_list(sparse_file, "", 0, >> &sparsity_patterns, NULL); >> free(sparse_file); >> >> if (ret < 0) >> die(_("failed to load sparse-checkout patterns")); >> sparsity_patterns_initialized = 1; >> } >> >> return path_matches_pattern_list(path, pathlen, basename, &dtype, >> &sparsity_patterns, repo->index); >> } >> >> Also, if I understand correctly, the index doesn't hold paths to dirs, >> right? So even if a complete dir is excluded from sparse checkout, we >> still have to check all its subentries, only to discover that they >> should all be skipped from the search. However, if we were to check >> against the sparsity patterns directly (e.g. with the function above), >> we could skip such directories together with all their entries. When in cone mode, we can check if a directory is one of these three modes: 1. Completely contained in the cone (recursive match) 2. Completely outside the cone 3. Neither. Keep matching subdirectories. (parent match) The clear_ce_flags() code in dir.c includes the matching algorithms for this. Hopefully you can re-use a lot of it. You may need to extract some methods to use them from the grep code. >> Oh, and there is also the case of a commit whose tree paths are not in >> the index (maybe manually created objects?). For such commits, with the >> index lookup approach, we would have to fall back on ignoring the >> sparsity rules. I'm not sure if that would be OK, though. >> >> Any thoughts on these two approaches (looking up the skip_worktree bit >> in the index or directly matching against sparsity patterns), will be >> highly appreciated. (Note that it only concerns the `git grep >> <commit-ish>` case. The other cases already iterate thought the index, so >> there is no O(log(N)) extra complexity). >> >> builtin/grep.c | 29 ++++++++--- >> t/t7011-skip-worktree-reading.sh | 9 ---- >> t/t7817-grep-sparse-checkout.sh | 88 ++++++++++++++++++++++++++++++++ >> 3 files changed, 111 insertions(+), 15 deletions(-) >> create mode 100755 t/t7817-grep-sparse-checkout.sh >> >> diff --git a/builtin/grep.c b/builtin/grep.c >> index 99e2685090..52ec72a036 100644 >> --- a/builtin/grep.c >> +++ b/builtin/grep.c >> @@ -388,7 +388,7 @@ static int grep_cache(struct grep_opt *opt, >> const struct pathspec *pathspec, int cached); >> static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec, >> struct tree_desc *tree, struct strbuf *base, int tn_len, >> - int check_attr); >> + int from_commit); > > I'm not familiar with grep.c and have to admit I don't know what > "check_attr" means. Slightly surprised to see you replace it, but > maybe reading the rest will explain... > >> >> static int grep_submodule(struct grep_opt *opt, >> const struct pathspec *pathspec, >> @@ -486,6 +486,10 @@ static int grep_cache(struct grep_opt *opt, >> >> for (nr = 0; nr < repo->index->cache_nr; nr++) { >> const struct cache_entry *ce = repo->index->cache[nr]; >> + >> + if (ce_skip_worktree(ce)) >> + continue; >> + > > Looks good for the case where we are grepping through what's cached. > >> strbuf_setlen(&name, name_base_len); >> strbuf_addstr(&name, ce->name); >> >> @@ -498,8 +502,7 @@ static int grep_cache(struct grep_opt *opt, >> * cache entry are identical, even if worktree file has >> * been modified, so use cache version instead >> */ >> - if (cached || (ce->ce_flags & CE_VALID) || >> - ce_skip_worktree(ce)) { >> + if (cached || (ce->ce_flags & CE_VALID)) { > > I had the same change when I was trying to hack something like this > patch into place but only handled the worktree case before realized it > was a bit bigger job. > >> if (ce_stage(ce) || ce_intent_to_add(ce)) >> continue; >> hit |= grep_oid(opt, &ce->oid, name.buf, >> @@ -532,7 +535,7 @@ static int grep_cache(struct grep_opt *opt, >> >> static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec, >> struct tree_desc *tree, struct strbuf *base, int tn_len, >> - int check_attr) >> + int from_commit) >> { >> struct repository *repo = opt->repo; >> int hit = 0; >> @@ -546,6 +549,9 @@ static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec, >> name_base_len = name.len; >> } >> >> + if (from_commit && repo_read_index(repo) < 0) >> + die(_("index file corrupt")); >> + > > As above, I don't think we should need to read the index. We should > compare to sparsity patterns, which in the important case (cone mode) > simplifies to a hash lookup as we walk directories. > >> while (tree_entry(tree, &entry)) { >> int te_len = tree_entry_len(&entry); >> >> @@ -564,9 +570,20 @@ static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec, >> >> strbuf_add(base, entry.path, te_len); >> >> + if (from_commit) { >> + int pos = index_name_pos(repo->index, >> + base->buf + tn_len, >> + base->len - tn_len); >> + if (pos >= 0 && >> + ce_skip_worktree(repo->index->cache[pos])) { >> + strbuf_setlen(base, old_baselen); >> + continue; >> + } >> + } >> + >> if (S_ISREG(entry.mode)) { >> hit |= grep_oid(opt, &entry.oid, base->buf, tn_len, >> - check_attr ? base->buf + tn_len : NULL); >> + from_commit ? base->buf + tn_len : NULL); > > Sadly, this doesn't help me understand check_attr or from_commit. > Could you clue me in a bit? Yeah, Elijah and I know the sparse-checkout code quite well, but are unfamiliar with grep. Let's all expand our knowledge! >> } else if (S_ISDIR(entry.mode)) { >> enum object_type type; >> struct tree_desc sub; >> @@ -581,7 +598,7 @@ static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec, >> strbuf_addch(base, '/'); >> init_tree_desc(&sub, data, size); >> hit |= grep_tree(opt, pathspec, &sub, base, tn_len, >> - check_attr); >> + from_commit); > > Same. > >> free(data); >> } else if (recurse_submodules && S_ISGITLINK(entry.mode)) { >> hit |= grep_submodule(opt, pathspec, &entry.oid, >> diff --git a/t/t7011-skip-worktree-reading.sh b/t/t7011-skip-worktree-reading.sh >> index 37525cae3a..26852586ac 100755 >> --- a/t/t7011-skip-worktree-reading.sh >> +++ b/t/t7011-skip-worktree-reading.sh >> @@ -109,15 +109,6 @@ test_expect_success 'ls-files --modified' ' >> test -z "$(git ls-files -m)" >> ' >> >> -test_expect_success 'grep with skip-worktree file' ' >> - git update-index --no-skip-worktree 1 && >> - echo test > 1 && >> - git update-index 1 && >> - git update-index --skip-worktree 1 && >> - rm 1 && >> - test "$(git grep --no-ext-grep test)" = "1:test" >> -' >> - >> echo ":000000 100644 $ZERO_OID $EMPTY_BLOB A 1" > expected >> test_expect_success 'diff-index does not examine skip-worktree absent entries' ' >> setup_absent && >> diff --git a/t/t7817-grep-sparse-checkout.sh b/t/t7817-grep-sparse-checkout.sh >> new file mode 100755 >> index 0000000000..fccf44e829 >> --- /dev/null >> +++ b/t/t7817-grep-sparse-checkout.sh >> @@ -0,0 +1,88 @@ >> +#!/bin/sh >> + >> +test_description='grep in sparse checkout >> + >> +This test creates the following dir structure: >> +. >> +| - a >> +| - b >> +| - dir >> + | - c >> + >> +Only "a" should be present due to the sparse checkout patterns: >> +"/*", "!/b" and "!/dir". >> +' >> + >> +. ./test-lib.sh >> + >> +test_expect_success 'setup' ' >> + echo "text" >a && >> + echo "text" >b && >> + mkdir dir && >> + echo "text" >dir/c && >> + git add a b dir && >> + git commit -m "initial commit" && >> + git tag -am t-commit t-commit HEAD && >> + tree=$(git rev-parse HEAD^{tree}) && >> + git tag -am t-tree t-tree $tree && >> + cat >.git/info/sparse-checkout <<-EOF && >> + /* >> + !/b >> + !/dir >> + EOF >> + git sparse-checkout init && > > Using `git sparse-checkout init` but then manually writing to > .git/info/sparse-checkout? Seems like it'd make more sense to use > `git sparse-checkout set` than writing the patterns directly yourself. > Also, would prefer to have the examples use cone mode (even if you > have to add subdirectories), as it makes the testcase a bit easier to > read and more performant, though neither is a big deal. I agree that we should use the builtin so your test script is less brittle to potential back-end changes to sparse-checkout (none planned). I do recommend having at least one test with non-cone mode patterns, especially if you are checking the pattern-matching yourself instead of relying on the index. Thanks, -Stolee