Re: [RFC PATCH 2/3] grep: honor sparse checkout patterns

Derrick Stolee <stolee@xxxxxxxxx> · Tue, 24 Mar 2020 13:02:02 -0400

On 3/24/2020 12:16 PM, Elijah Newren wrote:
> On Tue, Mar 24, 2020 at 8:12 AM Derrick Stolee <stolee@xxxxxxxxx> wrote:
>>
>> On 3/24/2020 3:15 AM, Elijah Newren wrote:
>>> Hi Matheus,
>>>
>>> On Mon, Mar 23, 2020 at 11:12 PM Matheus Tavares
> ...
>>>> Something I'm not entirely sure in this patch is how we implement the
>>>> mechanism to honor sparsity for the `git grep <commit-ish>` case (which
>>>> is treated in the grep_tree() function). Currently, the patch looks for
>>>> an index entry that matches the path, and then checks its skip_worktree
>>>
>>> As you discuss below, checking the index is both wrong _and_ costly.
>>
>> I'm not sure why checking the index is _wrong_, but I agree about the
>> performance cost.
> 
> Let's say there are two directories, dir1 and dir2.  Over time, there
> have existed a total of six files:
>    dir1/{a,b,c}
>    dir2/{d,e,f}
> At the current time, there are only four files in the index:
>    dir1/{a,b}
>    dir2/{d,e}
> And the user has done a `git sparse-checkout set dir2` and then at
> some point later run `git grep OTHERCOMMIT foobar`.  What happens?
> 
> Well, since we're in a sparse checkout, we should only search the
> relevant paths within OTHERCOMMIT for "foobar".  Let's say we attempt
> to figure out the "relevant paths" using the index.  We can tell that
> dir1/a and dir2/a are marked as SKIP_WORKTREE so we don't search them.
> dir1/c is untracked -- what do we do with it?  Include it?  Exclude
> it?  Carrying on with the other files, dir2/d and dir2/e are tracked
> and !SKIP_WORKTREE so we search them.  dir2/f is untracked -- what do
> we do with it?  Include it?  Exclude it?
> 
> We're left without the necessary information to tell whether we should
> search OTHERCOMMIT's dir1/c and dir2/f if we consult the index.  Any
> decision we make is going to be wrong for one of the two paths.
> 
> If we instead do not attempt to consult the index (which corresponds
> to a version close to HEAD) in order to ask questions about the
> completely different OTHERCOMMIT, but instead use the sparsity
> patterns to query whether those files/directories are interesting,
> then we get the right answer.  The index can only be consulted for the
> right answer in the case of --cached; in all other cases (including
> OTHERCOMMIT == HEAD), we should use the sparsity patterns.  In fact,
> we could also use the sparsity patterns in the case of --cached, it's
> just that for that one particular case consulting the index will also
> give the right answer.

Thanks! This helps a lot.

-Stolee