Re: [PATCH v1 2/2] builtin/grep.c: integrate with sparse index

Shaoxuan Yuan <shaoxuan.yuan02@xxxxxxxxx> · Thu, 25 Aug 2022 05:06:42 +0800

On 8/17/2022 10:23 PM, Derrick Stolee wrote:
> On 8/17/2022 3:56 AM, Shaoxuan Yuan wrote:
>> Turn on sparse index and remove ensure_full_index().
>>
>> Change it to only expands the index when using --sparse.
>>
>> The p2000 tests demonstrate a ~99.4% execution time reduction for
>> `git grep` using a sparse index.
>>
>> Test                                           HEAD~1 HEAD
>> 
-----------------------------------------------------------------------------
>> 2000.78: git grep --cached bogus (full-v3)     0.019 0.018  (-5.2%)
>> 2000.79: git grep --cached bogus (full-v4)     0.017 0.016  (-5.8%)
>> 2000.80: git grep --cached bogus (sparse-v3)   0.29 0.0015 (-99.4%)
>> 2000.81: git grep --cached bogus (sparse-v4)   0.30 0.0018 (-99.4%)
>
> Good results.
>
> I think we could get interesting results even with the --sparse
> option if you go another step further (perhaps as a patch after
> this one).

OK.

>>
>> Optional reading about performance test results
>> -----------------------------------------------
>> Notice that because `git-grep` needs to parse blobs in the index, the
>> index reading time is minuscule comparing to the object parsing time.
>> And because of this, the p2000 test results cannot clearly reflect the
>> speedup for index reading: combining with the object parsing time,
>> the aggregated time difference is extremely close between HEAD~1 and
>> HEAD.
>>
>> Hence, the results presenting here are not directly extracted from the
>> p2000 test results. Instead, to make the performance difference more
>> visible, the test command is manually ran with GIT_TRACE2_PERF in the
>> four repos (full-v3, sparse-v3, full-v4, sparse-v4). The numbers here
>> are then extracted from the time difference between "region_enter" and
>> "region_leave" of label "do_read_index".
>
> This is a good point, but I don't recommend displaying them as if they
> were the output of a "./run HEAD~1 HEAD -- p2000-sparse-operations.sh"
> command. Instead, point out that the performance test does not show a
> major improvement and instead you have these "Before" and "After" results
> from testing manually and extracting trace2 regions.

OK.

>> @@ -519,11 +519,15 @@ static int grep_cache(struct grep_opt *opt,
>>          strbuf_addstr(&name, repo->submodule_prefix);
>>      }
>>
>> +    prepare_repo_settings(repo);
>> +    repo->settings.command_requires_full_index = 0;
>> +
>
> The best pattern is to put this in cmd_grep() immediately after parsing
> options. This guarantees that we don't parse and expand the index in any
> other code path.

Got it.

>>      if (repo_read_index(repo) < 0)
>>          die(_("index file corrupt"));
>>
>> -    /* TODO: audit for interaction with sparse-index. */
>> -    ensure_full_index(repo->index);
>> +    if (grep_sparse)

A side note: this condition should be `grep_sparse && cached`.

>> +        ensure_full_index(repo->index);
>> +
> As mentioned before, this approach is the simplest way to make the case
> without --sparse faster, but the case _with_ --sparse will still be slow.
> The way to fix this would be to modify this portion of the loop:

I'm not sure. If --sparse here means we want to expand the index, it
is expected to be slow (ensure_full_index is slow), isn't it?

>     if (S_ISREG(ce->ce_mode) &&
>         match_pathspec(repo->index, pathspec, name.buf, name.len, 0, 
NULL,
>                S_ISDIR(ce->ce_mode) ||
>                S_ISGITLINK(ce->ce_mode))) {
>
> by adding an initial case
>
>     if (S_ISSPARSEDIR(ce->ce_mode)) {
>         hit |= grep_tree(opt, &ce->oid, name.buf, 0, name.buf);
>     } else if (S_ISREG(ce->ce_mode) &&
>            match_pathspec(repo->index, pathspec, name.buf, name.len, 
0, NULL,
>                   S_ISDIR(ce->ce_mode) ||
>                   S_ISGITLINK(ce->ce_mode))) {
>
> and appropriately implement "grep_tree()" to walk the tree at ce->oid to
> find all matching files within, then call grep_oid() for each of those
> paths.

Tree walking is faster, yes. So, for this approach to be faster, I
think you are suggesting we should not expand the index, even when
--sparse is given? Instead, we just rely on the tree walking logic,
right?

> Bonus points if you recognize that the pathspec uses prefix checks that
> allow pruning the search space and not parsing all of the trees
> recursively. But that can definitely be delayed for a future enhancement.

OK.

>> +test_expect_success 'grep expands index using --sparse' '
>> +    init_repos &&
>> +
>> +    # With --sparse and --cached, do not ignore sparse entries and
>> +    # expand the index.
>> +    test_all_match git grep --sparse --cached a
>> +'
>
> Here, you're testing that the behavior matches, but not testing that the
> index expands. (It does describe why you didn't include it in the later
> ensure_not_expanded tests.)

I was trying to "imply" the index expansion because of the behavior
match. Yes, I think the test should be more explicit.

>> +
>> +test_expect_success 'grep is not expanded' '
>> +    init_repos &&
>> +
>> +    ensure_not_expanded grep a &&
>> +    ensure_not_expanded grep a -- deep/* &&
>> +    # grep does not match anything per se, so ! is used
>
> It can be helpful to say why:
>
>     # All files within the folder1/* pathspec are sparse,
>     # so this command does not find any matches.

OK.

--
Thanks,
Shaoxuan