Re: [PATCH 00/10] [RFC] In-tree sparse-checkout definitions

Elijah Newren <newren@xxxxxxxxx> · Wed, 17 Jun 2020 22:03:09 -0700

On Wed, Jun 17, 2020 at 8:01 PM Derrick Stolee <stolee@xxxxxxxxx> wrote:
>
> On 6/17/2020 9:59 PM, Elijah Newren wrote:
> > On Wed, Jun 17, 2020 at 6:42 PM Derrick Stolee <stolee@xxxxxxxxx> wrote:
> >>
> >> On 6/17/2020 7:14 PM, Elijah Newren wrote:
> >> You mentioned in another thread that it is a bit unwieldy for a user
> >> to rely on a committed (or staged?) file, so adding the ability to
> >> check the working directory first is interesting. I wonder how the
> >> timing comes into play when changing HEAD to a new commit? Seems
> >> tricky, but solvable.
> >
> > Isn't that essentially the same timing issue that comes into play if
> > you only look at the index, and are changing HEAD to a new commit?
>
> It adds to the complexity. We can inspect the new index for the
> in-tree definition and update the skip-worktree bits before actually
> changing the working directory to match the new index. However, if
> we trust the working directory copy over the index copy, then we
> need to also decide if the working directory copy _would_ change
> in this move before using it.
>
> Of course, maybe I'm making it over-complicated. I still know so
> little about the index. I got into this feature due to a simple
> pattern-matching problem that I could understand, but now I'm
> getting lost in index states and dirty statuses. ;)

Honestly, I think your first cut for switching branches with this new
feature should just be:
  1) Do a switch/checkout exactly the way it's done today already:
    1a) Load the index and existing sparsity rules (from worktree then
index), according to what existed before the branch switch
    1b) Do the appropriate 1-way or 2-way merge from unpack_trees() to
move to the right branch (as builtin/checkout.c already does)
  2) *If* using in-tree sparsity feature, re-load the sparsity rules
(from worktree then index) and run the equivalent of `sparse-checkout
reapply`

Not only does this avoid messing up any code for the normal case, I'm
pretty sure this gets the behavior the user expects in all the special
cases.  There is one big downside, and a little downside.  The big
downside is that it'll have two progress meters instead of just one
(much like sparse-checkout init && sparse-checkout set do today).  The
little downside is that this isn't optimal from a performance
standpoint, for two reasons:
(a) some files may be updated in the working tree in step 1b despite
the fact that they'll be removed in step 2.
(b) step 2 may just be an expensive no-op, in fact I suspect it will
be most the time

The reason I think the performance isn't a big deal is because:
C) the fact that (b) is a problem means (a) is not often an issue --
the sparsity patterns are usually the same.
D) Even if the sparsity patterns differ, it's often the case that
files within the tree won't change (people tend to only modify a few
files per commit, after all, so even switching branches only updates a
subset of all files).
E) Even if the sparsity patterns differ and files differ and have to
be updated, it's still proportional at most to the number of files
selected by the sparsity patterns, so it won't feel excessive or slow
to the user.
F) `sparse-checkout reapply` is pretty cheap, especially if called as
a function instead of as a subprocess

In my opinion, other than the unfortunate dual progress meters, the
only place where this new feature gets hairy is that `git
sparse-checkout reapply` (or its libified variant) now needs to know
how to "read your new form of sparsity rules".  That sounds easy at
first but in my opinion it is a bit of a mess.  The user could have
written garbage to the file and I don't know what to do with garbage,
but I think we have to do something sane.  (Give up and checkout
everything?  Give up and checkout nothing but the toplevel directory?
Ignore each line we don't understand and treat the rest as stuff that
needs to be checked out?  Can that last one even be done with the .ini
reader?)  I don't think saying "well, tough luck the user should know
better" works because we already know that merges or rebases or
checkout -m could have written garbage to those files (in the form of
conflict markers), and users will say that *we* were the ones that
filled the file with garbage.  So we have to handle it intelligently
in some fashion.

Thoughts?