Re: [PATCH 06/27] checkout-index: ensure full index

Derrick Stolee <stolee@xxxxxxxxx> · Wed, 17 Mar 2021 21:17:37 -0400

On 3/17/2021 6:36 PM, Elijah Newren wrote:
> On Wed, Mar 17, 2021 at 2:33 PM Derrick Stolee <stolee@xxxxxxxxx> wrote:
>>
>> On 3/17/2021 5:10 PM, Elijah Newren wrote:
>>> On Wed, Mar 17, 2021 at 1:05 PM Derrick Stolee <stolee@xxxxxxxxx> wrote:
>>>>
>>>> On 3/17/2021 1:50 PM, Elijah Newren wrote:
>>>>> On Tue, Mar 16, 2021 at 2:17 PM Derrick Stolee via GitGitGadget
>>>>> <gitgitgadget@xxxxxxxxx> wrote:
>>>>> With the caveat in the commit message, this change looks okay, but
>>>>> checkout-index may be buggy regardless of the presence of
>>>>> ensure_full_index().  If ensure_full_index() really is needed here
>>>>> because it needs to operate on all SKIP_WORKTREE paths and not just
>>>>> leading directories, that's because it's writing all those
>>>>> SKIP_WORKTREE entries to the working tree.  When it writes them to the
>>>>> working tree, is it clearing the SKIP_WORKTREE bit?  If not, we're in
>>>>> a bit of a pickle...
>>>>
>>>> Perhaps I'm unclear in my intentions with this series: _every_
>>>> insertion of ensure_full_index() is intended to be audited with
>>>> tests in the future. Some might need behavior change, and others
>>>> will not. In this series, I'm just putting in the protections so
>>>> we don't accidentally trigger unexpected behavior.
>>>
>>> I think this may be part of my qualms -- what do you mean by not
>>> accidentally triggering unexpected behavior?  In particular, does your
>>> statement imply that whatever behavior you get after putting in
>>> ensure_full_index() is "expected"?  I think I'm reading that
>>> implication into it, and objecting that the behavior with the
>>> ensure_full_index() still isn't expected.  You've only removed a
>>> certain class of unexpected behavior, namely code that wasn't written
>>> to expect tree entries that suddenly gets them.  You haven't handled
>>> the class of "user wants to work with a subset of files, why are all
>>> these unrelated files being munged/updated/computed/shown/etc."
>>> unexpected behavior.
>>
>> My intention is to ensure that (at this moment) choosing to use
>> the on-disk sparse-index format does not alter Git's end-to-end
>> behavior.
>>
>> I want to avoid as much as possible a state where enabling the
>> sparse-index can start changing how Git commands behave, perhaps
>> in destructive ways.
>>
>> By adding these checks, we ensure the in-memory data structure
>> matches whatever a full index would have created, and then the
>> behavior matches what Git would do there. It might not be the
>> "correct" behavior, but it is _consistent_.
> 
> That sounds good.  Could this be included in
> Documentation/technical/sparse-index.txt?  There's only an oblique
> reference to it when talking about grep and rm, which incidentally
> were already updated by Matheus.  Perhaps also a reference to the
> warning in Documentation/git-sparse-checkout.txt would be worthwhile.

As I was thinking about how to make this more clear, I realized
that an update to that doc early in this series would be wise.
Thanks.

>>> I'm worrying that expectations are being set up such that working with
>>> just a small section of the code will be unusably hard.  There may be
>>> several commands/flags where it could make sense to operate on either
>>> (a) all files in the repo or (b) just on files within your sparse
>>> paths.  If, though, folks interpret operate-on-all-files as the
>>> "normal" mode (and history suggests they will), then people start
>>> adding all kinds of --no-do-this-sparsely flags to each command, and
>>> then users who want sparse operation have to remember to type such a
>>> flag with each and every command they ever run -- despite having taken
>>> at least three steps already to get a sparse-index.
>>>
>>> I believe the extended discussions (for _months_!) on just grep & rm,
>>> plus watching a --sparse patch being floated just in the last day for
>>> ls-files suggest to me that this is a _very_ likely outcome and I'm
>>> worried about it.
>>
>> It's these behavior changes that I would like to delay as much as
>> possible and focus on the format and making commands fast that don't
>> need a change in behavior.
> 
> Delaying sounds great, just so long as that delay doesn't also cement
> the behavior and confuse consistency for correctness.
> 
> I still think we don't have correct behavior for sparse-checkouts in
> many cases (I mean, when "git reset --hard" throws errors about not
> removing files and then removes them, we've obviously got some
> problems), but we had a decade long cementing of sorts with
> SKIP_WORKTREE and now a year-or-two long cementing since
> sparse-checkout was introduced and we never went through and cleaned
> up the commands. We should at some point, especially since we put the
> huge scary warning in Documenation/git-sparse-checkout.txt expressly
> for this purpose.
> 
> (I would have started this sooner, but trying to feed merge-ort and
> keep up with patch review already keeps me at less time on non-git
> projects than I think is expected for me.  Once merge-ort is done...)
> 
>> (Yes, there will be exceptions, like when "git add" specifically
>> adds a file that is in a directory that should be out of the cone,
>> but the user added it anyway. Atypical behavior like that can be
>> slow for now.)
> 
> I agree that there will be exceptions where we can't make the behavior
> be fast, but I disagree with that specific example.  "git add" should
> just give a warning message and not add any file outside the cone, for
> both sparse-index and sparse-checkout.  That can be done quickly.

That would be great! I would love behavior changes that make the
performance work I'm interested in easier.

> I know you want to delay the discussion of behavior fixes for specific
> commands somewhat, but this particular discussion has already been
> ongoing for about 4-12 months now (started at [1] and [2]) and it's
> reached a point where I've put my Reviewed-by on it; see [3] for the
> git-add piece specifically.
> 
> [1] https://lore.kernel.org/git/9f2135f90ffea7f4ccb226f506bf554deab324cc.1605205427.git.matheus.bernardino@xxxxxx/
> [2] https://lore.kernel.org/git/0b9b4c4b414a571877163667694afa3053bf8890.1585027716.git.matheus.bernardino@xxxxxx/
> [3] https://lore.kernel.org/git/66d5c71182274c78e1fcfe84e77deb17e4f0d7e6.1615588109.git.matheus.bernardino@xxxxxx/

I'm in full support of these ideas to change the behavior, when
possible. I would love to see that those changes make it really
easy to integrate the sparse-index into the commands.

>>>> Since tests take time to write and review, I was hoping that these
>>>> insertions were minimal enough to get us to a safe place where we
>>>> can remove the guards carefully.
>>>>
>>>> So with that in mind...
>>>>
>>>>> Might be nice to add a
>>>>> /* TODO: audit if this is needed; if it is, we may have other bugs... */
>>>>> or something like that.  But then again, perhaps you're considering
>>>>> all uses of ensure_full_index() to be need-to-be-reaudited codepaths?
>>>>> If so, and we determine we really do need one and want to keep it
>>>>> indefinitely, will we mark those with a comment about why it's
>>>>> considered correct?
>>>>>
>>>>> I just want a way to know what still needs to be audited and what
>>>>> doesn't without doing a lot of history spelunking...
>>>>
>>>> ...every insertion "needs to be audited" in the future. That's a
>>>> big part of the next "phases" in the implementation plan.
>>>>
>>>> As you suggest, it might be a good idea to add a comment to every
>>>> insertion, to mark it as un-audited, such as:
>>>>
>>>>         /* TODO: test if ensure_full_index() is necessary */
>>>>
>>>> We can come back later to delete the comment if it truly is
>>>> necessary (and add tests to guarantee correct behavior). We can
>>>> also remove the comment _and_ the call by modifying the loop
>>>> behavior to do the right thing in some cases.
>>>
>>> If it's "needs to be audited for both performance reasons (can we
>>> operate on fewer entries as an invisible doesn't-change-results
>>> optimization) and correctness reasons (should we operate on fewer
>>> entries and given a modified result within a sparse-index because
>>> users would expect that, but maybe provide a special flag for the
>>> users who want to operate on all files in the repo)" and there's also
>>> an agreement that either audited or unaudited ones will be marked (or
>>> both), then great, I'm happy.  If not, can we discuss which part of my
>>> performance/correctness/marking we aren't in agreement on?
>>
>> I will mark all of the ones I'm inserting. My hope is to eventually
>> remove it entirely except for when disabling the sparse-index. That
>> is likely too far out to really hope for, but it is the direction I
>> am trying to go.
>>
>> As I indicate that we should carefully test each of these instances
>> where ensure_full_index() _might_ be necessary before removing them,
>> it is even more important to test the scenarios where the behavior
>> changes from a full index with sparse-checkout. Preferably, we just
>> change the behavior under sparse-checkout and then the sparse-index
>> can match that (see "test_sparse_match" in t1092).
> 
> Makes sense.  I agree that it'd be nice to have the two generally
> match, though I think we should be open to there being special cases
> that differ.  The only one I can think of right now is `git ls-files`
> (list the entries in the index) -- since there are tree entries in a
> sparse-index, ls-files would naturally show the tree entries.
> However, we can discuss that and any other cases -- if there are any
> -- later.

Here, I at least want to make it very clear that the change is
happening and include updates to the documentation. This is a
case where it is unclear what the "promise" is to _external_
tools that depend on ls-files (somehow). What is their intended
use of ls-files, and how do those expectations change when in a
sparse-checkout?

It's questions like this where I at least want to be making a
clear change that motivates why a change in behavior is merited,
then test that behavior to demonstrate that we expect that to be
the same in perpetuity.

Thanks,
-Stolee