Re: [RFC PATCH 2/3] grep: honor sparse checkout patterns

Derrick Stolee <stolee@xxxxxxxxx> · Wed, 22 Apr 2020 08:08:33 -0400

On 4/20/2020 11:08 PM, Elijah Newren wrote:
> On Mon, Apr 20, 2020 at 7:11 PM Matheus Tavares Bernardino
> <matheus.bernardino@xxxxxx> wrote:
>>
>> Hi, Elijah, Stolee and others
>>
>> On Tue, Mar 24, 2020 at 7:55 PM Matheus Tavares Bernardino
>> <matheus.bernardino@xxxxxx> wrote:
>>>
>>> On Tue, Mar 24, 2020 at 4:15 AM Elijah Newren <newren@xxxxxxxxx> wrote:
>>>>
>>>> On Mon, Mar 23, 2020 at 11:12 PM Matheus Tavares
>>>> <matheus.bernardino@xxxxxx> wrote:
>>>>>
>>>>> Something I'm not entirely sure in this patch is how we implement the
>>>>> mechanism to honor sparsity for the `git grep <commit-ish>` case (which
>>>>> is treated in the grep_tree() function). Currently, the patch looks for
>>>>> an index entry that matches the path, and then checks its skip_worktree
>>>>
>>>> As you discuss below, checking the index is both wrong _and_ costly.
>>>> You should use the sparsity patterns; Stolee did a lot of work to make
>>>> those correspond to simple hashes you could check to determine whether
>>>> to even walk into a subdirectory.
>> [...]
>>> OK, makes sense.
>>
>> I've been working on the file skipping mechanism using the sparsity
>> patterns directly. But I'm uncertain about some implementation
>> details. So I wanted to share my current plan with you, to get some
>> feedback before going deeper.
>>
>> The first idea was to load the sparsity patterns a priori and pass
>> them to grep_tree(), which recursively greps the entries of a given
>> tree object. If --recurse-submodules is given, however, we would also
>> need to load each surepo's sparse-checkout file on the fly (as the
>> subrepos are lazily initialized in grep_tree()'s call chain). That's
>> not a problem on its own. But in the most naive implementation, this
>> means unnecessarily re-loading the sparse-checkout files of the
>> submodules for each tree given to git-grep (as grep_tree() is called
>> separately for each one of them).
> 
> Wouldn't loading the sparse-checkout files be fast compared to
> grepping a submodule for matching strings?  And not just fast, but
> essentially in the noise and hard to even measure?  I have a hard time
> fathoming parsing the sparse-checkout file for a submodule somehow
> appreciably affecting the cost of grepping through that submodule.  If
> the submodule has a huge number of sparse-checkout patterns, that'll
> be because it has a ginormous number of files and grepping through
> them all would be way, way longer.  If the submodule only has a few
> files, then the sparse-checkout file is only going to be a few lines
> at most.
> 
> Also, from another angle: I think the original intent of submodules
> was an alternate form of sparse-checkout/partial-clone, letting people
> deal with just their piece of the repo.  As such, do we really even
> expect people to use sparse-checkouts and submodules together, let
> alone use them very heavily together?  Sure, someone will use them,
> but I have a hard time imagining the scale of use of both features
> heavily enough for this to matter, especially since it also requires
> specifying multiple trees to grep (which is slightly unusual) in
> addition to the combination of these other features before your
> optimization here could kick in and be worthwhile.
> 
> I'd be very tempted to just implement the most naive implementation
> and maybe leave a TODO note in the code for some future person to come
> along and optimize if it really matters, but I'd like to see numbers
> before we spend the development and maintenance effort on it because
> I'm having a hard time imagining any scale where it could matter.
> 
>> So my next idea was to implement a cache, mapping 'struct repository's
>> to 'struct pattern_list'. Well, not 'struct repository' itself, but
>> repo->gitdir. This way we could load each file once, store the pattern
>> list, and quickly retrieve the one that affect the repository
>> currently being grepped, whether it is a submodule or not. But, is
>> gitidir unique per repository? If not, could we use
>> repo_git_path(repo, "info/sparse-checkout") as the key?
>>
>> I already have a prototype implementation of the last idea (using
>> repo_git_path()). But I wanted to make sure, does this seem like a
>> good path? Or should we avoid the work of having this hashmap here and
>> do something else, as adding a 'struct pattern_list' to 'struct
>> repository', directly?
> 
> Honestly, it sounds a bit like premature optimization to me.  Sorry if
> that's disappointing since you've apparently already put some effort
> into this, and it sounds like you're on a good track for optimizing
> this if it were necessary, but I'm just having a hard time figuring
> out whether it'd really help and be worth the code complexity.

My initial thought was to use a stack or queue. It depend on how
git-grep treats submodules. Imagine directories A, B, C where B is a
submodule.

If results from 'B' are output between results from 'A' and 'C', then
use a stack to "push" the latest sparse-checkout patterns as you
deepen into a submodule, then "pop" the patterns as you leave a
submodule.

If results from 'B' are output after results from 'C', then you could
possibly use a queue instead. I find this unlikely, and it would
behave strangely for nested submodules.

Since "struct pattern_list" has most of the information you require,
then it should not be challenging to create a list of them.

Hopefully that provides some ideas.

Thanks,
-Stolee