Re: [PATCH 06/27] checkout-index: ensure full index

Derrick Stolee <stolee@xxxxxxxxx> · Wed, 17 Mar 2021 17:33:55 -0400

On 3/17/2021 5:10 PM, Elijah Newren wrote:
> On Wed, Mar 17, 2021 at 1:05 PM Derrick Stolee <stolee@xxxxxxxxx> wrote:
>>
>> On 3/17/2021 1:50 PM, Elijah Newren wrote:
>>> On Tue, Mar 16, 2021 at 2:17 PM Derrick Stolee via GitGitGadget
>>> <gitgitgadget@xxxxxxxxx> wrote:
>>> With the caveat in the commit message, this change looks okay, but
>>> checkout-index may be buggy regardless of the presence of
>>> ensure_full_index().  If ensure_full_index() really is needed here
>>> because it needs to operate on all SKIP_WORKTREE paths and not just
>>> leading directories, that's because it's writing all those
>>> SKIP_WORKTREE entries to the working tree.  When it writes them to the
>>> working tree, is it clearing the SKIP_WORKTREE bit?  If not, we're in
>>> a bit of a pickle...
>>
>> Perhaps I'm unclear in my intentions with this series: _every_
>> insertion of ensure_full_index() is intended to be audited with
>> tests in the future. Some might need behavior change, and others
>> will not. In this series, I'm just putting in the protections so
>> we don't accidentally trigger unexpected behavior.
> 
> I think this may be part of my qualms -- what do you mean by not
> accidentally triggering unexpected behavior?  In particular, does your
> statement imply that whatever behavior you get after putting in
> ensure_full_index() is "expected"?  I think I'm reading that
> implication into it, and objecting that the behavior with the
> ensure_full_index() still isn't expected.  You've only removed a
> certain class of unexpected behavior, namely code that wasn't written
> to expect tree entries that suddenly gets them.  You haven't handled
> the class of "user wants to work with a subset of files, why are all
> these unrelated files being munged/updated/computed/shown/etc."
> unexpected behavior.

My intention is to ensure that (at this moment) choosing to use
the on-disk sparse-index format does not alter Git's end-to-end
behavior.

I want to avoid as much as possible a state where enabling the
sparse-index can start changing how Git commands behave, perhaps
in destructive ways.

By adding these checks, we ensure the in-memory data structure
matches whatever a full index would have created, and then the
behavior matches what Git would do there. It might not be the
"correct" behavior, but it is _consistent_.

> I'm worrying that expectations are being set up such that working with
> just a small section of the code will be unusably hard.  There may be
> several commands/flags where it could make sense to operate on either
> (a) all files in the repo or (b) just on files within your sparse
> paths.  If, though, folks interpret operate-on-all-files as the
> "normal" mode (and history suggests they will), then people start
> adding all kinds of --no-do-this-sparsely flags to each command, and
> then users who want sparse operation have to remember to type such a
> flag with each and every command they ever run -- despite having taken
> at least three steps already to get a sparse-index.
> 
> I believe the extended discussions (for _months_!) on just grep & rm,
> plus watching a --sparse patch being floated just in the last day for
> ls-files suggest to me that this is a _very_ likely outcome and I'm
> worried about it.

It's these behavior changes that I would like to delay as much as
possible and focus on the format and making commands fast that don't
need a change in behavior.

(Yes, there will be exceptions, like when "git add" specifically
adds a file that is in a directory that should be out of the cone,
but the user added it anyway. Atypical behavior like that can be
slow for now.)

>> Since tests take time to write and review, I was hoping that these
>> insertions were minimal enough to get us to a safe place where we
>> can remove the guards carefully.
>>
>> So with that in mind...
>>
>>> Might be nice to add a
>>> /* TODO: audit if this is needed; if it is, we may have other bugs... */
>>> or something like that.  But then again, perhaps you're considering
>>> all uses of ensure_full_index() to be need-to-be-reaudited codepaths?
>>> If so, and we determine we really do need one and want to keep it
>>> indefinitely, will we mark those with a comment about why it's
>>> considered correct?
>>>
>>> I just want a way to know what still needs to be audited and what
>>> doesn't without doing a lot of history spelunking...
>>
>> ...every insertion "needs to be audited" in the future. That's a
>> big part of the next "phases" in the implementation plan.
>>
>> As you suggest, it might be a good idea to add a comment to every
>> insertion, to mark it as un-audited, such as:
>>
>>         /* TODO: test if ensure_full_index() is necessary */
>>
>> We can come back later to delete the comment if it truly is
>> necessary (and add tests to guarantee correct behavior). We can
>> also remove the comment _and_ the call by modifying the loop
>> behavior to do the right thing in some cases.
> 
> If it's "needs to be audited for both performance reasons (can we
> operate on fewer entries as an invisible doesn't-change-results
> optimization) and correctness reasons (should we operate on fewer
> entries and given a modified result within a sparse-index because
> users would expect that, but maybe provide a special flag for the
> users who want to operate on all files in the repo)" and there's also
> an agreement that either audited or unaudited ones will be marked (or
> both), then great, I'm happy.  If not, can we discuss which part of my
> performance/correctness/marking we aren't in agreement on?

I will mark all of the ones I'm inserting. My hope is to eventually
remove it entirely except for when disabling the sparse-index. That
is likely too far out to really hope for, but it is the direction I
am trying to go.

As I indicate that we should carefully test each of these instances
where ensure_full_index() _might_ be necessary before removing them,
it is even more important to test the scenarios where the behavior
changes from a full index with sparse-checkout. Preferably, we just
change the behavior under sparse-checkout and then the sparse-index
can match that (see "test_sparse_match" in t1092).

Thanks,
-Stolee