Hi, Elijah, Stolee and others On Tue, Mar 24, 2020 at 7:55 PM Matheus Tavares Bernardino <matheus.bernardino@xxxxxx> wrote: > > On Tue, Mar 24, 2020 at 4:15 AM Elijah Newren <newren@xxxxxxxxx> wrote: > > > > On Mon, Mar 23, 2020 at 11:12 PM Matheus Tavares > > <matheus.bernardino@xxxxxx> wrote: > > > > > > Something I'm not entirely sure in this patch is how we implement the > > > mechanism to honor sparsity for the `git grep <commit-ish>` case (which > > > is treated in the grep_tree() function). Currently, the patch looks for > > > an index entry that matches the path, and then checks its skip_worktree > > > > As you discuss below, checking the index is both wrong _and_ costly. > > You should use the sparsity patterns; Stolee did a lot of work to make > > those correspond to simple hashes you could check to determine whether > > to even walk into a subdirectory. [...] > OK, makes sense. I've been working on the file skipping mechanism using the sparsity patterns directly. But I'm uncertain about some implementation details. So I wanted to share my current plan with you, to get some feedback before going deeper. The first idea was to load the sparsity patterns a priori and pass them to grep_tree(), which recursively greps the entries of a given tree object. If --recurse-submodules is given, however, we would also need to load each surepo's sparse-checkout file on the fly (as the subrepos are lazily initialized in grep_tree()'s call chain). That's not a problem on its own. But in the most naive implementation, this means unnecessarily re-loading the sparse-checkout files of the submodules for each tree given to git-grep (as grep_tree() is called separately for each one of them). So my next idea was to implement a cache, mapping 'struct repository's to 'struct pattern_list'. Well, not 'struct repository' itself, but repo->gitdir. This way we could load each file once, store the pattern list, and quickly retrieve the one that affect the repository currently being grepped, whether it is a submodule or not. But, is gitidir unique per repository? If not, could we use repo_git_path(repo, "info/sparse-checkout") as the key? I already have a prototype implementation of the last idea (using repo_git_path()). But I wanted to make sure, does this seem like a good path? Or should we avoid the work of having this hashmap here and do something else, as adding a 'struct pattern_list' to 'struct repository', directly? Thanks, Matheus