Re: [PATCH 05/10] status: use sparse-index throughout

Derrick Stolee <stolee@xxxxxxxxx> · Wed, 21 Apr 2021 09:55:59 -0400

On 4/20/2021 8:44 PM, Elijah Newren wrote:
> On Tue, Apr 13, 2021 at 7:01 AM Derrick Stolee via GitGitGadget
> <gitgitgadget@xxxxxxxxx> wrote:
>>
>> From: Derrick Stolee <dstolee@xxxxxxxxxxxxx>
>>
>> By testing 'git -c core.fsmonitor= status -uno', we can check for the
>> simplest index operations that can be made sparse-aware. The necessary
>> implementation details are already integrated with sparse-checkout, so
>> modify command_requires_full_index to be zero for cmd_status().
>>
>> By running the debugger for 'git status -uno' after that change, we find
>> two instances of ensure_full_index() that were added for extra safety,
>> but can be removed without issue.
>>
>> In refresh_index(), we loop through the index entries. The
>> refresh_cache_ent() method copies the sparse directories into the
>> refreshed index without issue.
> 
> I do see the removal of a call to ensure_full_index() in
> refresh_index() that you mention in this paragraph in the patch below.
> 
> I'm confused, though; I would have thought we wanted to avoid a
> refresh_cache_ent() call.  Also, one of your previous patches added a
> 
>     if (istate->sparse_index && S_ISSPARSEDIR(ce->ce_mode))
>         continue;
> 
> check before the code ever gets to the refresh_cache_ent() call, so as
> far as I can tell, that function won't be called from refresh_entry()
> for sparse entries.  Maybe your commit message here is out-of-date?
> Or am I confused somehow?
> 
>> The loop within run_diff_files() skips things that are in stage 0 and
>> have skip-worktree enabled, so seems safe to disable ensure_full_index()
>> here.
> 
> Unlike the above, I don't see a removal of a ensure_full_index() call
> in run_diff_files() as claimed by this paragraph.  Has the commit
> message gotten out of date with refactorings you did while developing
> this series?

I greatly reduced the number of ensure_full_index() calls in the
previous topic (ds/sparse-index-protections) since first writing this
patch, so it is very likely to be out-of-date. Thanks for calling it out.

>> This allows some cases of 'git status' to no longer expand a sparse
>> index to a full one, giving the following performance improvements for
>> p2000-sparse-checkout-operations.sh:
>>
>> Test                                  HEAD~1           HEAD
>> -----------------------------------------------------------------------------
>> 2000.2: git status (full-index-v3)    0.38(0.36+0.07)  0.37(0.31+0.10) -2.6%
>> 2000.3: git status (full-index-v4)    0.38(0.29+0.12)  0.37(0.30+0.11) -2.6%
>> 2000.4: git status (sparse-index-v3)  2.43(2.33+0.14)  0.04(0.05+0.04) -98.4%
>> 2000.5: git status (sparse-index-v4)  2.44(2.35+0.13)  0.05(0.04+0.07) -98.0%
>>
>> Note that since HEAD~1 was expanding the sparse index by parsing trees,
>> it was artificially slower than the full index case. Thus, the 98%
>> improvement is misleading, and instead we should celebrate the 0.37s to
>> 0.05s improvement of 82%. This is more indicative of the peformance
>> gains we are expecting by using a sparse index.
> 
> 82%, very nice.  Was this with git.git as the test repository, or some
> other repo?  If it's git.git, then we'd actually expect a much bigger
> speedup for other repositories, as git.git is pretty small.
This test script takes the input repository (git.git in this case) and
creates a tree that contains that repository many times over, but only
four copies remain in the sparse-checkout definition. This creates the
big speedup, because of the enormous difference in index size.

As I am exploring commands such as 'merge' and 'rebase' I am finding
that this test setup is too expensive to cover those commands. I will
need to reduce the size of the test repository (by a factor of 4) and
that will reduce how impressive these results are while making the more
complicated commands testable in a reasonable amount of time.

Thanks,
-Stolee