On 3/17/2021 5:10 PM, Elijah Newren wrote: > On Wed, Mar 17, 2021 at 1:05 PM Derrick Stolee <stolee@xxxxxxxxx> wrote: >> >> On 3/17/2021 1:50 PM, Elijah Newren wrote: >>> On Tue, Mar 16, 2021 at 2:17 PM Derrick Stolee via GitGitGadget >>> <gitgitgadget@xxxxxxxxx> wrote: >>> With the caveat in the commit message, this change looks okay, but >>> checkout-index may be buggy regardless of the presence of >>> ensure_full_index(). If ensure_full_index() really is needed here >>> because it needs to operate on all SKIP_WORKTREE paths and not just >>> leading directories, that's because it's writing all those >>> SKIP_WORKTREE entries to the working tree. When it writes them to the >>> working tree, is it clearing the SKIP_WORKTREE bit? If not, we're in >>> a bit of a pickle... >> >> Perhaps I'm unclear in my intentions with this series: _every_ >> insertion of ensure_full_index() is intended to be audited with >> tests in the future. Some might need behavior change, and others >> will not. In this series, I'm just putting in the protections so >> we don't accidentally trigger unexpected behavior. > > I think this may be part of my qualms -- what do you mean by not > accidentally triggering unexpected behavior? In particular, does your > statement imply that whatever behavior you get after putting in > ensure_full_index() is "expected"? I think I'm reading that > implication into it, and objecting that the behavior with the > ensure_full_index() still isn't expected. You've only removed a > certain class of unexpected behavior, namely code that wasn't written > to expect tree entries that suddenly gets them. You haven't handled > the class of "user wants to work with a subset of files, why are all > these unrelated files being munged/updated/computed/shown/etc." > unexpected behavior. My intention is to ensure that (at this moment) choosing to use the on-disk sparse-index format does not alter Git's end-to-end behavior. I want to avoid as much as possible a state where enabling the sparse-index can start changing how Git commands behave, perhaps in destructive ways. By adding these checks, we ensure the in-memory data structure matches whatever a full index would have created, and then the behavior matches what Git would do there. It might not be the "correct" behavior, but it is _consistent_. > I'm worrying that expectations are being set up such that working with > just a small section of the code will be unusably hard. There may be > several commands/flags where it could make sense to operate on either > (a) all files in the repo or (b) just on files within your sparse > paths. If, though, folks interpret operate-on-all-files as the > "normal" mode (and history suggests they will), then people start > adding all kinds of --no-do-this-sparsely flags to each command, and > then users who want sparse operation have to remember to type such a > flag with each and every command they ever run -- despite having taken > at least three steps already to get a sparse-index. > > I believe the extended discussions (for _months_!) on just grep & rm, > plus watching a --sparse patch being floated just in the last day for > ls-files suggest to me that this is a _very_ likely outcome and I'm > worried about it. It's these behavior changes that I would like to delay as much as possible and focus on the format and making commands fast that don't need a change in behavior. (Yes, there will be exceptions, like when "git add" specifically adds a file that is in a directory that should be out of the cone, but the user added it anyway. Atypical behavior like that can be slow for now.) >> Since tests take time to write and review, I was hoping that these >> insertions were minimal enough to get us to a safe place where we >> can remove the guards carefully. >> >> So with that in mind... >> >>> Might be nice to add a >>> /* TODO: audit if this is needed; if it is, we may have other bugs... */ >>> or something like that. But then again, perhaps you're considering >>> all uses of ensure_full_index() to be need-to-be-reaudited codepaths? >>> If so, and we determine we really do need one and want to keep it >>> indefinitely, will we mark those with a comment about why it's >>> considered correct? >>> >>> I just want a way to know what still needs to be audited and what >>> doesn't without doing a lot of history spelunking... >> >> ...every insertion "needs to be audited" in the future. That's a >> big part of the next "phases" in the implementation plan. >> >> As you suggest, it might be a good idea to add a comment to every >> insertion, to mark it as un-audited, such as: >> >> /* TODO: test if ensure_full_index() is necessary */ >> >> We can come back later to delete the comment if it truly is >> necessary (and add tests to guarantee correct behavior). We can >> also remove the comment _and_ the call by modifying the loop >> behavior to do the right thing in some cases. > > If it's "needs to be audited for both performance reasons (can we > operate on fewer entries as an invisible doesn't-change-results > optimization) and correctness reasons (should we operate on fewer > entries and given a modified result within a sparse-index because > users would expect that, but maybe provide a special flag for the > users who want to operate on all files in the repo)" and there's also > an agreement that either audited or unaudited ones will be marked (or > both), then great, I'm happy. If not, can we discuss which part of my > performance/correctness/marking we aren't in agreement on? I will mark all of the ones I'm inserting. My hope is to eventually remove it entirely except for when disabling the sparse-index. That is likely too far out to really hope for, but it is the direction I am trying to go. As I indicate that we should carefully test each of these instances where ensure_full_index() _might_ be necessary before removing them, it is even more important to test the scenarios where the behavior changes from a full index with sparse-checkout. Preferably, we just change the behavior under sparse-checkout and then the sparse-index can match that (see "test_sparse_match" in t1092). Thanks, -Stolee