Re: Bug report: `git restore --source --staged` deals poorly with sparse-checkout

Elijah Newren <newren@xxxxxxxxx> · Wed, 5 Oct 2022 21:20:33 -0700

On Wed, Oct 5, 2022 at 1:00 PM Martin von Zweigbergk
<martinvonz@xxxxxxxxxx> wrote:
>
> On Wed, Oct 5, 2022 at 12:51 AM Elijah Newren <newren@xxxxxxxxx> wrote:
> >
[...]
> I agree with you and Stolee that there are two different cases: some
> people use sparse checkouts to restrict what they see (behavior A), and
> some people use it just as a performance optimization (behavior B). So I
> suspect we roughly agree about what should happen if you pass
> `--restrict` (or if that becomes the default so you don't actually need to
> pass it). My arguments were about the `--no-restrict` case. Sorry, I
> should have made that clear.
>
> I also agree that having a way to make commands restrict to certain paths
> by default is useful, and I agree that tying that set of paths to the current
> worktree's sparse spec makes sense.
>
> I'll answer the questions below for the `--no-restrict` case
> (behavior B).

I don't think your usecase matches behavior B.  I think we should
label your VFS usecase as behavior C and define it separately.  More
on that below...

[...]
> > What about merge/rebase/cherry-pick/revert?  Should those write
> > additional files to the working tree or avoid it?  What about if there
> > are conflicts outside the sparse specification?
>
> I think they should avoid it, but since the user will need to resolve
> that conflict anyway, I can see it makes sense to write them to disk
> if there are conflicts.
>
> >
> > And if extra files get written outside the sparse specification, are
> > there ways for this to get "cleaned up" where after resolving
> > conflicts or changes we can again remove the file from the working
> > tree?
>
> I've never really used `git sparse-checkout` (until I read your doc),
> but isn't that what `git sparse-checkout reapply` is for?

While that command is available for users that want to manually clean
things up proactively, my suspicion is that it is used very rarely --
especially now that we have the present-despite-skipped class of
issues fixed.  I suspect nearly all cleaning up is actually done as an
implicit side-effect of calls to unpack_trees(), which would affect
commands such `switch`, the switch-like portion of `checkout`, `reset
--hard`, `merge`, `rebase`, and many others.

All of these commands have two types of implicit clean-up they do as
part of their operation (which could be thought of as a
post-processing step): (1) marking *unmodified* files outside the
sparsity patterns as SKIP_WORKTREE in the index and removing them from
the working tree, and (2) taking files which match the sparsity
patterns which were previously SKIP_WORKTREE and flip them to
!SKIP_WORKTREE and restore them to the working tree.  I've got a few
examples of what this clean up looks like over at:
https://lore.kernel.org/git/CABPp-BHGrxLPu_S3y2zG-U6uo0rM5TYYEREZa2A=e=d9VZb2PA@xxxxxxxxxxxxxx/

I have no idea how this cleanup affects the VFS usecase; it's very
focused towards "sparse checkout means many files should NOT be
present in the working tree" which may be at odds with how the VFS
stuff is intended to behave.  But it's also been part of
sparse-checkout behavior the longest; for well over a decade now.

> > What about `git grep PATTERN`?  That's documented to search the
> > tracked files in the working tree.  But should that include the files
> > that would have been there were it not for the "performance
> > optimization" of not checking them out?  (Similarly, what about `git
> > grep --cached PATTERN` or `git grep PATTERN REVISION`?)  I mean, if
> > these commands should behave the same regardless of sparse
> > specification, then you should search those other files, right?  But
> > isn't that a nasty performance penalty if the user has a
> > sparse/partial clone since Git will have to do many network operations
> > to get additional blobs in order to search them?  Is that really
> > wanted?
>
> I think it's consistent to search them with `--no-restrict` (but not
> with `--restrict`, of course).
>
> > What about `git rm '*.png'` to remove all the tracked png files from
> > my working tree.  Should that also remove all the files that would
> > have been there were it not for the "performance optimization"?  Will
> > that result in very negative surprises for those with a "I want to
> > concentrate on just this subset of files" mental model?
>
> Same here.
>
> >
> > What about `git worktree add`?  Should the sparse specification be the
> > same for all worktrees?  Be per-worktree?  Should it default to dense?
> >  To just top-level sparse?  To the same sparsity as the worktree you
> > were in when you created a new one?
>
> That's an interesting case. If someone does `git worktree add` and expects
> all files to be available in the working copy, they might be surprised, yes.
> I think that's a much smaller risk than
> `git restore --source HEAD^ --staged && git commit -m 'undo changes'` being
> partial, however.

After you described the VFS usecase, I was guessing you'd answer how
you did for most of these commands.  Most of your answers do not match
the answers I'd expect for behavior B, which seems to me to support my
suspicion that you've got a third usecase.

In particular, I think the difference between Behavior B and your
usecase hinges on the expectation for the working tree:
   Behavior B: Files outside the sparse specification are NOT present
in the working tree.
   Behavior C (your usecase): Files outside the sparse specification
ARE "present" in the working tree, but Git doesn't have to put them
there (they'll be lazily put into place by something else, and the VFS
will ensure that users don't ever notice them actually missing, so far
all intents and purposes, the files are present).

In particular, that difference is perhaps most notable with `git grep`
(without --cached or REVISION flags); such a command is supposed to
search the worktree.  For Behavior B, files outside the sparse
specification are NOT present in the working tree, and hence those
files should NOT be searched.  For your usecase, as you highlight
above, you view all files as present in the working tree (even if Git
isn't the thing writing those files to the working tree and even if
they aren't technically present until you query whether they are
there), so all those files SHOULD be searched.

This difference about the "presence" of files has other knock-on
effects too.  Under Behavior B, users get used to working on just a
subset of files.  Thus `git rm '*.jpg'` or `git restore --source HEAD^
-- '*.md'` should NOT overwrite files outside the sparse specification
(but an error should be shown if the pathspec doesn't match anything,
and that error should point out how users can affect other files
outside the sparse specification).  Under your usecase, users are
always working on the full set of files and all of them can be viewed
in their working copy (as enforced by the filesystem intercepting any
attempts to view or edit files and suddenly magically materializing
them when users look) -- so users in your usecase are not expecting to
be working on a subset of files, and thus those commands would operate
tree-wide.

Similarly, under Behavior B, `git add outside/of/cone/path` should
throw an error.  If it doesn't, some future command will silently
remove the file from the working copy, which may confuse the user;
they are getting themselves into an erroneous state.  Users are
pointed to an override flag they can use if they want to accept the
consequences.  Under your usecase, since ALL files are always
"present" (but not materialized until an attempt to access is made),
that same command would be expected to run without an override and
with no error or warning.

Related to the above, under Behavior B, `git status` should probably
report the existence of any untracked files that do not match the
sparsity patterns as an erroneous condition (because why wait until
git-add to throw an error; let the user know early).  Under your
usecase, we wouldn't.

You might think I'm describing Behavior A above, but only because
Behavior A and Behavior B overlap on worktree-related operations.  The
primary difference between Behavior A and Behavior B is with behavior
of history-related operations.  Behavior B says "the working tree is
sparse, but history is dense; if I do a query on older revisions of
history (grep/diff/log/etc.) then give me results across all paths",
whereas Behavior A says "I only care about this subset of files, both
for the working tree and for history.  Unless I override,
grep/diff/etc. on history should restrict all output to files within
the sparse specification.

One potential way to (over)simplify this would be:
    Behavior A: `--scope=sparse` for both worktree and history operations
    Behavior B: `--scope=sparse` for worktree operations,
`--scope=all` for history operations
    Behavior C: `--scope=all` for both worktree and history operations