On Wed, Mar 17, 2021 at 2:33 PM Derrick Stolee <stolee@xxxxxxxxx> wrote: > > On 3/17/2021 5:10 PM, Elijah Newren wrote: > > On Wed, Mar 17, 2021 at 1:05 PM Derrick Stolee <stolee@xxxxxxxxx> wrote: > >> > >> On 3/17/2021 1:50 PM, Elijah Newren wrote: > >>> On Tue, Mar 16, 2021 at 2:17 PM Derrick Stolee via GitGitGadget > >>> <gitgitgadget@xxxxxxxxx> wrote: > >>> With the caveat in the commit message, this change looks okay, but > >>> checkout-index may be buggy regardless of the presence of > >>> ensure_full_index(). If ensure_full_index() really is needed here > >>> because it needs to operate on all SKIP_WORKTREE paths and not just > >>> leading directories, that's because it's writing all those > >>> SKIP_WORKTREE entries to the working tree. When it writes them to the > >>> working tree, is it clearing the SKIP_WORKTREE bit? If not, we're in > >>> a bit of a pickle... > >> > >> Perhaps I'm unclear in my intentions with this series: _every_ > >> insertion of ensure_full_index() is intended to be audited with > >> tests in the future. Some might need behavior change, and others > >> will not. In this series, I'm just putting in the protections so > >> we don't accidentally trigger unexpected behavior. > > > > I think this may be part of my qualms -- what do you mean by not > > accidentally triggering unexpected behavior? In particular, does your > > statement imply that whatever behavior you get after putting in > > ensure_full_index() is "expected"? I think I'm reading that > > implication into it, and objecting that the behavior with the > > ensure_full_index() still isn't expected. You've only removed a > > certain class of unexpected behavior, namely code that wasn't written > > to expect tree entries that suddenly gets them. You haven't handled > > the class of "user wants to work with a subset of files, why are all > > these unrelated files being munged/updated/computed/shown/etc." > > unexpected behavior. > > My intention is to ensure that (at this moment) choosing to use > the on-disk sparse-index format does not alter Git's end-to-end > behavior. > > I want to avoid as much as possible a state where enabling the > sparse-index can start changing how Git commands behave, perhaps > in destructive ways. > > By adding these checks, we ensure the in-memory data structure > matches whatever a full index would have created, and then the > behavior matches what Git would do there. It might not be the > "correct" behavior, but it is _consistent_. That sounds good. Could this be included in Documentation/technical/sparse-index.txt? There's only an oblique reference to it when talking about grep and rm, which incidentally were already updated by Matheus. Perhaps also a reference to the warning in Documentation/git-sparse-checkout.txt would be worthwhile. > > I'm worrying that expectations are being set up such that working with > > just a small section of the code will be unusably hard. There may be > > several commands/flags where it could make sense to operate on either > > (a) all files in the repo or (b) just on files within your sparse > > paths. If, though, folks interpret operate-on-all-files as the > > "normal" mode (and history suggests they will), then people start > > adding all kinds of --no-do-this-sparsely flags to each command, and > > then users who want sparse operation have to remember to type such a > > flag with each and every command they ever run -- despite having taken > > at least three steps already to get a sparse-index. > > > > I believe the extended discussions (for _months_!) on just grep & rm, > > plus watching a --sparse patch being floated just in the last day for > > ls-files suggest to me that this is a _very_ likely outcome and I'm > > worried about it. > > It's these behavior changes that I would like to delay as much as > possible and focus on the format and making commands fast that don't > need a change in behavior. Delaying sounds great, just so long as that delay doesn't also cement the behavior and confuse consistency for correctness. I still think we don't have correct behavior for sparse-checkouts in many cases (I mean, when "git reset --hard" throws errors about not removing files and then removes them, we've obviously got some problems), but we had a decade long cementing of sorts with SKIP_WORKTREE and now a year-or-two long cementing since sparse-checkout was introduced and we never went through and cleaned up the commands. We should at some point, especially since we put the huge scary warning in Documenation/git-sparse-checkout.txt expressly for this purpose. (I would have started this sooner, but trying to feed merge-ort and keep up with patch review already keeps me at less time on non-git projects than I think is expected for me. Once merge-ort is done...) > (Yes, there will be exceptions, like when "git add" specifically > adds a file that is in a directory that should be out of the cone, > but the user added it anyway. Atypical behavior like that can be > slow for now.) I agree that there will be exceptions where we can't make the behavior be fast, but I disagree with that specific example. "git add" should just give a warning message and not add any file outside the cone, for both sparse-index and sparse-checkout. That can be done quickly. I know you want to delay the discussion of behavior fixes for specific commands somewhat, but this particular discussion has already been ongoing for about 4-12 months now (started at [1] and [2]) and it's reached a point where I've put my Reviewed-by on it; see [3] for the git-add piece specifically. [1] https://lore.kernel.org/git/9f2135f90ffea7f4ccb226f506bf554deab324cc.1605205427.git.matheus.bernardino@xxxxxx/ [2] https://lore.kernel.org/git/0b9b4c4b414a571877163667694afa3053bf8890.1585027716.git.matheus.bernardino@xxxxxx/ [3] https://lore.kernel.org/git/66d5c71182274c78e1fcfe84e77deb17e4f0d7e6.1615588109.git.matheus.bernardino@xxxxxx/ > >> Since tests take time to write and review, I was hoping that these > >> insertions were minimal enough to get us to a safe place where we > >> can remove the guards carefully. > >> > >> So with that in mind... > >> > >>> Might be nice to add a > >>> /* TODO: audit if this is needed; if it is, we may have other bugs... */ > >>> or something like that. But then again, perhaps you're considering > >>> all uses of ensure_full_index() to be need-to-be-reaudited codepaths? > >>> If so, and we determine we really do need one and want to keep it > >>> indefinitely, will we mark those with a comment about why it's > >>> considered correct? > >>> > >>> I just want a way to know what still needs to be audited and what > >>> doesn't without doing a lot of history spelunking... > >> > >> ...every insertion "needs to be audited" in the future. That's a > >> big part of the next "phases" in the implementation plan. > >> > >> As you suggest, it might be a good idea to add a comment to every > >> insertion, to mark it as un-audited, such as: > >> > >> /* TODO: test if ensure_full_index() is necessary */ > >> > >> We can come back later to delete the comment if it truly is > >> necessary (and add tests to guarantee correct behavior). We can > >> also remove the comment _and_ the call by modifying the loop > >> behavior to do the right thing in some cases. > > > > If it's "needs to be audited for both performance reasons (can we > > operate on fewer entries as an invisible doesn't-change-results > > optimization) and correctness reasons (should we operate on fewer > > entries and given a modified result within a sparse-index because > > users would expect that, but maybe provide a special flag for the > > users who want to operate on all files in the repo)" and there's also > > an agreement that either audited or unaudited ones will be marked (or > > both), then great, I'm happy. If not, can we discuss which part of my > > performance/correctness/marking we aren't in agreement on? > > I will mark all of the ones I'm inserting. My hope is to eventually > remove it entirely except for when disabling the sparse-index. That > is likely too far out to really hope for, but it is the direction I > am trying to go. > > As I indicate that we should carefully test each of these instances > where ensure_full_index() _might_ be necessary before removing them, > it is even more important to test the scenarios where the behavior > changes from a full index with sparse-checkout. Preferably, we just > change the behavior under sparse-checkout and then the sparse-index > can match that (see "test_sparse_match" in t1092). Makes sense. I agree that it'd be nice to have the two generally match, though I think we should be open to there being special cases that differ. The only one I can think of right now is `git ls-files` (list the entries in the index) -- since there are tree entries in a sparse-index, ls-files would naturally show the tree entries. However, we can discuss that and any other cases -- if there are any -- later.