On 9/1/2022 8:28 PM, Victoria Dye wrote: > Shaoxuan Yuan wrote: >> Before this patch, whenever --sparse is used, `git-grep` utilizes the >> ensure_full_index() method to expand the index and search all the >> entries. Because this method requires walking all the trees and >> constructing the index, it is the slow part within the whole command. >> >> To achieve better performance, this patch uses grep_tree() to search the >> sparse directory entries and get rid of the ensure_full_index() method. >> >> Why grep_tree() is a better choice over ensure_full_index()? >> >> 1) grep_tree() is as correct as ensure_full_index(). grep_tree() looks >> into every sparse-directory entry (represented by a tree) recursively >> when looping over the index, and the result of doing so matches the >> result of expanding the index. >> >> 2) grep_tree() utilizes pathspecs to limit the scope of searching. >> ensure_full_index() always expands the index when --sparse is used, >> that means it will always walk all the trees and blobs in the repo >> without caring if the user only wants a subset of the content, i.e. >> using a pathspec. On the other hand, grep_tree() will only search >> the contents that match the pathspec, and thus possibly walking fewer >> trees. >> >> 3) grep_tree() does not construct and copy back a new index, while >> ensure_full_index() does. This also saves some time. > > Would you mind adding some 'ensure_not_expanded' cases to 't1092' to codify > this (probably in the 'grep is not expanded' test created in patch 2)? If > I'm understanding this patch correctly, you've updated 'git grep' so that it > *never* needs to expand the index. In that case, it would be good to > exercise a bunch of 'git grep' options (pathspecs inside and outside the > sparse cone, wildcard pathspecs, etc.) to confirm that. Sure! >> >> ---------------- >> Performance test >> >> - Summary: >> >> p2000 tests demonstrate a ~91% execution time reduction for >> `git grep --cached --sparse <pattern> -- <pathspec>` using tree-walking >> logic. >> >> Test HEAD~ HEAD >> --------------------------------------------------------------------------------------------------- >> 2000.78: git grep --cached --sparse bogus -- f2/f1/f1/builtin/* (full-v3) 0.11 0.09 (≈) >> 2000.79: git grep --cached --sparse bogus -- f2/f1/f1/builtin/* (full-v4) 0.08 0.09 (≈) >> 2000.80: git grep --cached --sparse bogus -- f2/f1/f1/builtin/* (sparse-v3) 0.44 0.04 (-90.9%) >> 2000.81: git grep --cached --sparse bogus -- f2/f1/f1/builtin/* (sparse-v4) 0.46 0.04 (-91.3%) > > These are fantastic results! > >> >> - Command used for testing: >> >> git grep --cached --sparse bogus -- f2/f1/f1/builtin/* >> >> The reason for specifying a pathspec is that, if we don't specify a >> pathspec, then grep_tree() will walk all the trees and blobs to find the >> pattern, and the time consumed doing so is not too different from using >> the original ensure_full_index() method, which also spends most of the >> time walking trees. However, when a pathspec is specified, this latest >> logic will only walk the area of trees enclosed by the pathspec, and the >> time consumed is reasonably a lot less. >> >> That is, if we don't specify a pathspec, the performance difference [1] >> is quite small: both methods walk all the trees and take generally same >> amount of time (even with the index construction time included for >> ensure_full_index()). > > This makes sense, thanks for the thorough explanation of the results. > >> >> [1] Performance test result without pathspec: >> >> Test HEAD~ HEAD >> ----------------------------------------------------------------------------- >> 2000.78: git grep --cached --sparse bogus (full-v3) 6.17 5.19 (≈) >> 2000.79: git grep --cached --sparse bogus (full-v4) 6.19 5.46 (≈) >> 2000.80: git grep --cached --sparse bogus (sparse-v3) 6.57 6.44 (≈) >> 2000.81: git grep --cached --sparse bogus (sparse-v4) 6.65 6.28 (≈) >> >> Suggested-by: Derrick Stolee <derrickstolee@xxxxxxxxxx> >> Helped-by: Derrick Stolee <derrickstolee@xxxxxxxxxx> >> Helped-by: Victoria Dye <vdye@xxxxxxxxxx> >> Signed-off-by: Shaoxuan Yuan <shaoxuan.yuan02@xxxxxxxxx> >> --- >> builtin/grep.c | 32 ++++++++++++++++++++++++++----- >> t/perf/p2000-sparse-operations.sh | 1 + >> 2 files changed, 28 insertions(+), 5 deletions(-) >> >> diff --git a/builtin/grep.c b/builtin/grep.c >> index a0b4dbc1dc..8c0edccd8e 100644 >> --- a/builtin/grep.c >> +++ b/builtin/grep.c >> @@ -522,9 +522,6 @@ static int grep_cache(struct grep_opt *opt, >> if (repo_read_index(repo) < 0) >> die(_("index file corrupt")); >> >> - if (grep_sparse) >> - ensure_full_index(repo->index); >> - >> for (nr = 0; nr < repo->index->cache_nr; nr++) { >> const struct cache_entry *ce = repo->index->cache[nr]; >> >> @@ -537,8 +534,26 @@ static int grep_cache(struct grep_opt *opt, >> >> strbuf_setlen(&name, name_base_len); >> strbuf_addstr(&name, ce->name); >> + if (S_ISSPARSEDIR(ce->ce_mode)) { >> + enum object_type type; >> + struct tree_desc tree; >> + void *data; >> + unsigned long size; >> + struct strbuf base = STRBUF_INIT; >> + >> + strbuf_addstr(&base, ce->name); >> + >> + data = read_object_file(&ce->oid, &type, &size); >> + init_tree_desc(&tree, data, size); >> >> - if (S_ISREG(ce->ce_mode) && >> + /* >> + * sneak in the ce_mode using check_attr parameter >> + */ >> + hit |= grep_tree(opt, pathspec, &tree, &base, >> + base.len, ce->ce_mode); >> + strbuf_release(&base); >> + free(data); >> + } else if (S_ISREG(ce->ce_mode) && >> match_pathspec(repo->index, pathspec, name.buf, name.len, 0, NULL, >> S_ISDIR(ce->ce_mode) || >> S_ISGITLINK(ce->ce_mode))) { >> @@ -598,7 +613,14 @@ static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec, >> int te_len = tree_entry_len(&entry); >> >> if (match != all_entries_interesting) { >> - strbuf_addstr(&name, base->buf + tn_len); >> + if (S_ISSPARSEDIR(check_attr)) { >> + // object is a sparse directory entry >> + strbuf_addbuf(&name, base); >> + } else { >> + // object is a commit or a root tree >> + strbuf_addstr(&name, base->buf + tn_len); >> + } > > Hmm, I'm not entirely sure I follow what's going on with 'name'. I'll try to > talk myself through it. > > Stepping back a bit in the context of 'grep_tree()': the goal of the > function is, given a tree descriptor 'tree', to recursively scan the tree to > find any 'grep' matches within items matching 'pathspec'. It is also called > with a strbuf 'base', a length 'tn_len', and a boolean 'check_attr'; it's > not immediately clear to me what those args are or what they do. What I can I was confused for quite a while about the meaning of these args, too. I think 'base' is the object's ref or SHA, e.g. HEAD, HEAD~, or a <SHA>. Before this patch, the object was expected to be a root tree or a commit. I _think_ 'base' can also be "<submodule>/", e.g. "sub/" when grepping a submodule. 'tn_len' stands for "tree_name_len"? 'check_attr', as you wrote below, is for "commit or not", at lease that was all its use case before this patch. > see is that: > > - 'check_attr' is true iff the "tree" being grepped is actually a commit. I think this is correct. Though as Derrick Stolee said here [1], this patch is abusing the 'check_attr' (passing 'ce_mode' through it), and if that caused any confusions, my apologies. [1] https://lore.kernel.org/git/e74b326d-ce4a-31c3-5424-e35858cdb569@xxxxxxxxxx > - both non-recursive callers ('grep_object()' and 'grep_submodule()') call > 'grep_tree()' with 'tn_len == base.len'. > > Stepping into 'grep_tree()', we iterate over the entries *inside of* 'tree'. > We assign the length of the tree entry's path to 'te_len'. Notably, a tree > entry's path *not* the path from the root of the repo to the entry - it's > just the filename of the entry (e.g., for entry 'folder1/a', the path is > 'a'). Yes. > Next, we skip the first 'tn_len' characters of 'base->buf' and assign that > value to 'name'. Because 'tn_len == base.len', for this first iteration, > it's an empty string. Then, we check if the tree entry is interesting with > path 'name'. But 'name' is an empty string, so 'tree_entry_interesting()' > thinks the tree entry is at the root of the repository, even if it isn't! Yes, that is the reason why it kept ignoring sub-root-level trees: the pathspec can never match a tree that is not at root level if this root-level assumption exists. > At this point, I think I've figured out what the deal with 'base' is. Before > this patch, only 'grep_object()' and 'grep_submodule()'. In the former case, > it's either "<objectname>:", or empty; in the latter, it's the path to the > submodule. Both of those are things you'd want to skip to get the correct Yep, this resonates with my reply above! > path to the tree entry for 'tree_entry_interesting()', but it isn't true in > your case; you need the path from the repository root to your tree for > 'tree_entry_interesting()' to work properly. Well said! I think this phrasing is very accurate. > Based on all of that, I *think* you can drop the 'check_attr' changes to > 'grep_tree()' and update how you provide 'base' and 'tn_len' so > 1) 'base' is the path to the tree root, and 2) 'tn_len' is 0 so that full > path is provided to 'tree_entry_interesting()': > > ----->8----->8----->8----->8----->8----->8----->8----->8----->8----->8----- > diff --git a/builtin/grep.c b/builtin/grep.c > index 8c0edccd8e..85c83190f1 100644 > --- a/builtin/grep.c > +++ b/builtin/grep.c > @@ -546,11 +546,7 @@ static int grep_cache(struct grep_opt *opt, > data = read_object_file(&ce->oid, &type, &size); > init_tree_desc(&tree, data, size); > > - /* > - * sneak in the ce_mode using check_attr parameter > - */ > - hit |= grep_tree(opt, pathspec, &tree, &base, > - base.len, ce->ce_mode); > + hit |= grep_tree(opt, pathspec, &tree, &base, 0, 0); > strbuf_release(&base); > free(data); > } else if (S_ISREG(ce->ce_mode) && > @@ -613,14 +609,6 @@ static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec, > int te_len = tree_entry_len(&entry); > > if (match != all_entries_interesting) { > - if (S_ISSPARSEDIR(check_attr)) { > - // object is a sparse directory entry > - strbuf_addbuf(&name, base); > - } else { > - // object is a commit or a root tree > - strbuf_addstr(&name, base->buf + tn_len); > - } > - > match = tree_entry_interesting(repo->index, > &entry, &name, > 0, pathspec); > -----8<-----8<-----8<-----8<-----8<-----8<-----8<-----8<-----8<-----8<----- Thank you for this diff! I think this is also what Derrick suggested in his review. In fact, this approach is more to the root of the problem: the expected format of the path/base. > I still find all of this confusing, and it's possible I'm still not properly > understanding how 'name' and 'tn_len' are supposed to be used. Regardless, I > *am* fairly certain that finding the right values for those args is the > going to be the cleanest (and least fragile) way to handle sparse > directories, rather than using the 'check_attr' arg for something it isn't. Right. > It might take some time + lots of debugging/experimenting, but it's really > important that the implementation you settle on is something you (and, > ideally, the readers of your patches) confidently and completely understand, > rather than something that seems to work but doesn't have a clear > explanation. As always, I'm happy to help if you'd like another set of eyes > on the problem! Right. I admit that the approach I was taking is pretty shady. The way suggested by you and Derrick is more explainable and to-the-point. Lesson learned! Thanks, Shaoxuan