On 3/17/2021 6:36 PM, Elijah Newren wrote: > On Wed, Mar 17, 2021 at 2:33 PM Derrick Stolee <stolee@xxxxxxxxx> wrote: >> >> On 3/17/2021 5:10 PM, Elijah Newren wrote: >>> On Wed, Mar 17, 2021 at 1:05 PM Derrick Stolee <stolee@xxxxxxxxx> wrote: >>>> >>>> On 3/17/2021 1:50 PM, Elijah Newren wrote: >>>>> On Tue, Mar 16, 2021 at 2:17 PM Derrick Stolee via GitGitGadget >>>>> <gitgitgadget@xxxxxxxxx> wrote: >>>>> With the caveat in the commit message, this change looks okay, but >>>>> checkout-index may be buggy regardless of the presence of >>>>> ensure_full_index(). If ensure_full_index() really is needed here >>>>> because it needs to operate on all SKIP_WORKTREE paths and not just >>>>> leading directories, that's because it's writing all those >>>>> SKIP_WORKTREE entries to the working tree. When it writes them to the >>>>> working tree, is it clearing the SKIP_WORKTREE bit? If not, we're in >>>>> a bit of a pickle... >>>> >>>> Perhaps I'm unclear in my intentions with this series: _every_ >>>> insertion of ensure_full_index() is intended to be audited with >>>> tests in the future. Some might need behavior change, and others >>>> will not. In this series, I'm just putting in the protections so >>>> we don't accidentally trigger unexpected behavior. >>> >>> I think this may be part of my qualms -- what do you mean by not >>> accidentally triggering unexpected behavior? In particular, does your >>> statement imply that whatever behavior you get after putting in >>> ensure_full_index() is "expected"? I think I'm reading that >>> implication into it, and objecting that the behavior with the >>> ensure_full_index() still isn't expected. You've only removed a >>> certain class of unexpected behavior, namely code that wasn't written >>> to expect tree entries that suddenly gets them. You haven't handled >>> the class of "user wants to work with a subset of files, why are all >>> these unrelated files being munged/updated/computed/shown/etc." >>> unexpected behavior. >> >> My intention is to ensure that (at this moment) choosing to use >> the on-disk sparse-index format does not alter Git's end-to-end >> behavior. >> >> I want to avoid as much as possible a state where enabling the >> sparse-index can start changing how Git commands behave, perhaps >> in destructive ways. >> >> By adding these checks, we ensure the in-memory data structure >> matches whatever a full index would have created, and then the >> behavior matches what Git would do there. It might not be the >> "correct" behavior, but it is _consistent_. > > That sounds good. Could this be included in > Documentation/technical/sparse-index.txt? There's only an oblique > reference to it when talking about grep and rm, which incidentally > were already updated by Matheus. Perhaps also a reference to the > warning in Documentation/git-sparse-checkout.txt would be worthwhile. As I was thinking about how to make this more clear, I realized that an update to that doc early in this series would be wise. Thanks. >>> I'm worrying that expectations are being set up such that working with >>> just a small section of the code will be unusably hard. There may be >>> several commands/flags where it could make sense to operate on either >>> (a) all files in the repo or (b) just on files within your sparse >>> paths. If, though, folks interpret operate-on-all-files as the >>> "normal" mode (and history suggests they will), then people start >>> adding all kinds of --no-do-this-sparsely flags to each command, and >>> then users who want sparse operation have to remember to type such a >>> flag with each and every command they ever run -- despite having taken >>> at least three steps already to get a sparse-index. >>> >>> I believe the extended discussions (for _months_!) on just grep & rm, >>> plus watching a --sparse patch being floated just in the last day for >>> ls-files suggest to me that this is a _very_ likely outcome and I'm >>> worried about it. >> >> It's these behavior changes that I would like to delay as much as >> possible and focus on the format and making commands fast that don't >> need a change in behavior. > > Delaying sounds great, just so long as that delay doesn't also cement > the behavior and confuse consistency for correctness. > > I still think we don't have correct behavior for sparse-checkouts in > many cases (I mean, when "git reset --hard" throws errors about not > removing files and then removes them, we've obviously got some > problems), but we had a decade long cementing of sorts with > SKIP_WORKTREE and now a year-or-two long cementing since > sparse-checkout was introduced and we never went through and cleaned > up the commands. We should at some point, especially since we put the > huge scary warning in Documenation/git-sparse-checkout.txt expressly > for this purpose. > > (I would have started this sooner, but trying to feed merge-ort and > keep up with patch review already keeps me at less time on non-git > projects than I think is expected for me. Once merge-ort is done...) > >> (Yes, there will be exceptions, like when "git add" specifically >> adds a file that is in a directory that should be out of the cone, >> but the user added it anyway. Atypical behavior like that can be >> slow for now.) > > I agree that there will be exceptions where we can't make the behavior > be fast, but I disagree with that specific example. "git add" should > just give a warning message and not add any file outside the cone, for > both sparse-index and sparse-checkout. That can be done quickly. That would be great! I would love behavior changes that make the performance work I'm interested in easier. > I know you want to delay the discussion of behavior fixes for specific > commands somewhat, but this particular discussion has already been > ongoing for about 4-12 months now (started at [1] and [2]) and it's > reached a point where I've put my Reviewed-by on it; see [3] for the > git-add piece specifically. > > [1] https://lore.kernel.org/git/9f2135f90ffea7f4ccb226f506bf554deab324cc.1605205427.git.matheus.bernardino@xxxxxx/ > [2] https://lore.kernel.org/git/0b9b4c4b414a571877163667694afa3053bf8890.1585027716.git.matheus.bernardino@xxxxxx/ > [3] https://lore.kernel.org/git/66d5c71182274c78e1fcfe84e77deb17e4f0d7e6.1615588109.git.matheus.bernardino@xxxxxx/ I'm in full support of these ideas to change the behavior, when possible. I would love to see that those changes make it really easy to integrate the sparse-index into the commands. >>>> Since tests take time to write and review, I was hoping that these >>>> insertions were minimal enough to get us to a safe place where we >>>> can remove the guards carefully. >>>> >>>> So with that in mind... >>>> >>>>> Might be nice to add a >>>>> /* TODO: audit if this is needed; if it is, we may have other bugs... */ >>>>> or something like that. But then again, perhaps you're considering >>>>> all uses of ensure_full_index() to be need-to-be-reaudited codepaths? >>>>> If so, and we determine we really do need one and want to keep it >>>>> indefinitely, will we mark those with a comment about why it's >>>>> considered correct? >>>>> >>>>> I just want a way to know what still needs to be audited and what >>>>> doesn't without doing a lot of history spelunking... >>>> >>>> ...every insertion "needs to be audited" in the future. That's a >>>> big part of the next "phases" in the implementation plan. >>>> >>>> As you suggest, it might be a good idea to add a comment to every >>>> insertion, to mark it as un-audited, such as: >>>> >>>> /* TODO: test if ensure_full_index() is necessary */ >>>> >>>> We can come back later to delete the comment if it truly is >>>> necessary (and add tests to guarantee correct behavior). We can >>>> also remove the comment _and_ the call by modifying the loop >>>> behavior to do the right thing in some cases. >>> >>> If it's "needs to be audited for both performance reasons (can we >>> operate on fewer entries as an invisible doesn't-change-results >>> optimization) and correctness reasons (should we operate on fewer >>> entries and given a modified result within a sparse-index because >>> users would expect that, but maybe provide a special flag for the >>> users who want to operate on all files in the repo)" and there's also >>> an agreement that either audited or unaudited ones will be marked (or >>> both), then great, I'm happy. If not, can we discuss which part of my >>> performance/correctness/marking we aren't in agreement on? >> >> I will mark all of the ones I'm inserting. My hope is to eventually >> remove it entirely except for when disabling the sparse-index. That >> is likely too far out to really hope for, but it is the direction I >> am trying to go. >> >> As I indicate that we should carefully test each of these instances >> where ensure_full_index() _might_ be necessary before removing them, >> it is even more important to test the scenarios where the behavior >> changes from a full index with sparse-checkout. Preferably, we just >> change the behavior under sparse-checkout and then the sparse-index >> can match that (see "test_sparse_match" in t1092). > > Makes sense. I agree that it'd be nice to have the two generally > match, though I think we should be open to there being special cases > that differ. The only one I can think of right now is `git ls-files` > (list the entries in the index) -- since there are tree entries in a > sparse-index, ls-files would naturally show the tree entries. > However, we can discuss that and any other cases -- if there are any > -- later. Here, I at least want to make it very clear that the change is happening and include updates to the documentation. This is a case where it is unclear what the "promise" is to _external_ tools that depend on ls-files (somehow). What is their intended use of ls-files, and how do those expectations change when in a sparse-checkout? It's questions like this where I at least want to be making a clear change that motivates why a change in behavior is merited, then test that behavior to demonstrate that we expect that to be the same in perpetuity. Thanks, -Stolee