Re: [PATCH 01/20] sparse-index: design doc and format update

Derrick Stolee <stolee@xxxxxxxxx> · Thu, 25 Feb 2021 10:29:32 -0500

On 2/23/2021 8:13 PM, Elijah Newren wrote:
> On Tue, Feb 23, 2021 at 12:14 PM Derrick Stolee via GitGitGadget
> <gitgitgadget@xxxxxxxxx> wrote:>> +This addition of sparse-directory entries violates expectations about the
> 
> Violates current expectations, yes.  Documentation tends to live a
> long time, and I suspect that 2-3 years from now reading this sentence
> might be jarring since we'll have modified the code to have an updated
> set of expectations.  Maybe a simple "As of time of writing, ..." at
> the beginning of the sentence here?  Or maybe I'm just being overly
> worried...

I was hoping that the phrase "this addition of" places this statement in
a moment of time where sparse-directory entries didn't exist, but now they
will. I'm open to alternatives and will give this some thought.

>> +To complete this phase, the commands `git status` and `git add` will be
>> +integrated with the sparse-index so that they operate with O(Populated)
>> +performance. They will be carefully tested for operations within and
>> +outside the sparse-checkout definition.
> 
> Good plan so far, but there's something else bugging me a little here.
> One thing we noticed with our usage of `sparse-checkout` is that
> although unimportant _tracked_ files go away, leftover build files and
> other untracked files stick around.  So, although 'git status'
> shouldn't have to check the tracked files anymore, it is still going
> to have to look at each of the *untracked* files and compare to
> .gitignore files in order to correctly classify each file as ignored
> or just plain untracked.  Our `sparsify` tool has for a long time
> tried to warn about such files when changing the sparsity
> patterns/modules and had an --remove-old-ignores option for clearing
> out ignored files that are found within directories that are sparse
> (Meaning the directories where all files under them are marked
> SKIP_WORKTREE.). I was never sure whether a warning was enough, or if
> pushing that option more made sense, but about a month ago my
> colleagues made the tool just auto-invoke that option from other
> sparsify invocations.  To my knowledge, there have been no complaints
> about that being automatically turned on; but there were
> complaints/confusion before about the directories being left around.
> (Of course, non-ignored files are still left around by that option.)
> 
> I'm worried that since sparse-checkout doesn't do anything to help
> with all these untracked/ignored files, we might not get all the
> performance improvements we want from a `git status` with sparse
> directories.  We'll be dropping from walking O(2*HEAD) files (1 source
> + 1 object file) down to O(HEAD) files (just the object files) rather
> than actually getting down to O(Populated).

This definitely seems like a valuable _enhancement_ to sparse-checkout
that could occur in parallel.

For a workaround in the moment: is "git clean -xdf" sufficient to help
these users?

>> +Phase III: Important command speedups
>> +-------------------------------------
>> +
>> +At this point, the patterns for testing and implementing sparse-directory
>> +logic should be relatively stable. This phase focuses on updating some of
>> +the most common builtins that use the index to operate as O(Populated).
>> +Here is a potential list of commands that could be valuable to integrate
>> +at this point:
>> +
>> +* `git commit`
>> +* `git checkout`
>> +* `git merge`
>> +* `git rebase`
>> +
>> +Along with `git status` and `git add`, these commands cover the majority
>> +of users' interactions with the working directory.
> 
> Sounds like a good plan as well.
> 
> I hope we get to make this specific to the merge-ort backend.  It
> localizes the index-related code to (a) a call to unpack_trees()
> called from checkout-like code (which would probably automatically be
> handled by your updates to git checkout), and (b) a single function
> named record_conflicted_index_entries().  I feel it should be pretty
> easy to update.
> 
> In contrast, the idea of attempting to update merge-recursive with
> this kind of change sounds overwhelming.

Yes, I'm hoping to eventually say "if you are in a sparse-checkout, then
you should use ORT by default." Then, if someone opts to do merge-recursive
instead, then they pay the index expansion cost.

While this is very clear in my head, it might be worth mentioning that
explicitly here.

>>  In addition, we can
>> +integrate with these commands:
>> +
>> +* `git grep`
>> +* `git rm`
>> +
>> +These have been proposed as some whose behavior could change when in a
>> +repo with a sparse-checkout definition. It would be good to include this
>> +behavior automatically when using a sparse-index. Some clarity is needed
>> +to make the behavior switch clear to the user.
> 
> Is this leftover from before recent events?  I think this portion of
> the document should just be stricken.
> 
> I argued about how these were buggy as-is due SKIP_WORKTREE always
> having been an incomplete implementation of an idea at [1], but didn't
> hear a further response from you.  I'm curious if you disagreed with
> my reasoning, or it just slipped through the cracks in a busy schedule
> and this portion of the document was leftover from before.  In my
> opinion, both commands are just buggy and should be fixed for general
> sparse-checkout usage cases, not just for sparse-index.

This is definitely a case of "I've been too busy to read those topics
in detail." I figured that there was something going on that was relevant
to the sparse-checkout and would affect the sparse-index, but I shelved
it in my mind until I had space to think about it directly.

> Anyway, that's a long way of saying I think this section of your
> document is already obsolete.  (Which is a good thing -- less work to
> do to get sparse-index working.  Thanks, Matheus!).

Thank you for your summary, which helps a lot. Thanks, Matheus, too!
If those fixes already include coverage for the behavior, then I'll see
if they also translate to sparse-index tests easily.

I feel like a lot of these later contributions will be more about adding
tests than actually writing a lot of code.

>> +This phase is the first where parallel work might be possible without too
>> +much conflicts between topics.
>> +
>> +Phase IV: The long tail
>> +-----------------------
>> +
>> +This last phase is less a "phase" and more "the new normal" after all of
>> +the previous work.
>> +
>> +To start, the `command_requires_full_index` option could be removed in
>> +favor of expanding only when hitting an API guard.
>> +
>> +There are many Git commands that could use special attention to operate as
>> +O(Populated), while some might be so rare that it is acceptable to leave
>> +them with additional overhead when a sparse-index is present.
>> +
>> +Here are some commands that might be useful to update:
>> +
>> +* `git sparse-checkout set`
>> +* `git am`
>> +* `git clean`
>> +* `git stash`
> 
> Oh, man, git stash is definitely in need of work.  It's still a
> minimalistic transliteration of shell to C, complete with lots of
> process forking and piping output between various low-level commands.
> It might be interesting to rewrite this in terms of the merge
> machinery, though its separate stashing of staged stuff, unstaged
> stuff, and possibly untracked stuff means that there is a sequence of
> two or three merges needed and interesting failure handling to do if
> those merges fail, especially if the user uses --index.  But I
> digress...

I would prefer to leave 'git stash' alone, but it's used by enough
people that I need to care about it eventually.

> Anyway, overall, very nicely written and planned out.  Thanks for
> taking the time to write this all up.

Thanks for your detailed comments!
-Stolee