Re: [RFC PATCH 2/3] grep: honor sparse checkout patterns

Matheus Tavares Bernardino <matheus.bernardino@xxxxxx> · Mon, 20 Apr 2020 23:10:58 -0300

Hi, Elijah, Stolee and others

On Tue, Mar 24, 2020 at 7:55 PM Matheus Tavares Bernardino
<matheus.bernardino@xxxxxx> wrote:
>
> On Tue, Mar 24, 2020 at 4:15 AM Elijah Newren <newren@xxxxxxxxx> wrote:
> >
> > On Mon, Mar 23, 2020 at 11:12 PM Matheus Tavares
> > <matheus.bernardino@xxxxxx> wrote:
> > >
> > > Something I'm not entirely sure in this patch is how we implement the
> > > mechanism to honor sparsity for the `git grep <commit-ish>` case (which
> > > is treated in the grep_tree() function). Currently, the patch looks for
> > > an index entry that matches the path, and then checks its skip_worktree
> >
> > As you discuss below, checking the index is both wrong _and_ costly.
> > You should use the sparsity patterns; Stolee did a lot of work to make
> > those correspond to simple hashes you could check to determine whether
> > to even walk into a subdirectory.
[...]
> OK, makes sense.

I've been working on the file skipping mechanism using the sparsity
patterns directly. But I'm uncertain about some implementation
details. So I wanted to share my current plan with you, to get some
feedback before going deeper.

The first idea was to load the sparsity patterns a priori and pass
them to grep_tree(), which recursively greps the entries of a given
tree object. If --recurse-submodules is given, however, we would also
need to load each surepo's sparse-checkout file on the fly (as the
subrepos are lazily initialized in grep_tree()'s call chain). That's
not a problem on its own. But in the most naive implementation, this
means unnecessarily re-loading the sparse-checkout files of the
submodules for each tree given to git-grep (as grep_tree() is called
separately for each one of them).

So my next idea was to implement a cache, mapping 'struct repository's
to 'struct pattern_list'. Well, not 'struct repository' itself, but
repo->gitdir. This way we could load each file once, store the pattern
list, and quickly retrieve the one that affect the repository
currently being grepped, whether it is a submodule or not. But, is
gitidir unique per repository? If not, could we use
repo_git_path(repo, "info/sparse-checkout") as the key?

I already have a prototype implementation of the last idea (using
repo_git_path()). But I wanted to make sure, does this seem like a
good path? Or should we avoid the work of having this hashmap here and
do something else, as adding a 'struct pattern_list' to 'struct
repository', directly?

Thanks,
Matheus