Re: [PATCH] rebase, cherry-pick, revert: only run from toplevel

Elijah Newren <newren@xxxxxxxxx> · Tue, 31 Aug 2021 23:48:23 -0700

On Tue, Aug 31, 2021 at 9:59 PM Taylor Blau <me@xxxxxxxxxxxx> wrote:
>
> On Tue, Aug 31, 2021 at 09:43:15PM -0700, Elijah Newren wrote:
> > On Tue, Aug 31, 2021 at 7:55 PM Taylor Blau <me@xxxxxxxxxxxx> wrote:
> > >
> > > On Tue, Aug 31, 2021 at 01:14:55PM -0700, Elijah Newren wrote:
> > > > > Now I have spent zero time looking into actually coding this, so it may
> > > > > turn out to be much trickier than I am suggesting. But this seems like a
> > > > > much more fruitful direction, where we can protect users in cases where
> > > > > they benefit (and give them sensible and actionable error messages),
> > > > > without bothering people in the majority of cases where their cwd
> > > > > doesn't go away.
> > > >
> > > > Ooh, this sounds intriguing to me...but what if we changed that rule
> > > > slightly and just decided to never make the cwd go away?  Currently,
> > > > the checkout code removes directories if they have no tracked or
> > > > untracked or ignored files left, i.e. if they're empty.  What if we
> > > > decide to only have remove_scheduled_dirs() remove directories that
> > > > are empty AND they are not the current working directory?
> > >
> > > Hmm. My first thought after reading this is that it would cause
> > > surprising behavior for anybody who had 'git add --all' in their 'rebase
> > > -x' script. But would it?
> > >
> > > I.e., imagine somebody doing an in-place sed in a rebase and then `git
> > > add --all`-ing the result at each point in history. If the directory
> > > they were in ever went away, then the *next* revision would add that
> > > directory right back.
> > >
> > > That behavior seems somewhat surprising to me, or at least I could
> > > imagine it being surprising to users.
> >
> > I'm not following.  `git add --all` doesn't add empty directories, so
> > I don't see how my proposed change would cause any problems in such a
> > case; nothing would be added back.
>
> Ahh, it was I who wasn't following. You were proposing to leave the
> directories in place but empty. Agreed that there wouldn't be any
> problems with that.
>
> > > Another thought is what should happen when the current directory goes
> > > away and then comes back as a file? We wouldn't be able to checkout that
> > > file, I don't think, so it might be a dead end.
> >
> > I'm not following this either.  Peff's original suggestion was to
> > error out only when we knew it could cause problems, in particular
> > when the working directory would be removed.  Here I've shifted the
> > way the problem is viewed by just not removing the working directory,
> > but the end result is the same -- it errors out when the removal was
> > needed.  Given that erroring out is exactly what we wanted for a case
> > like this, why does that make it a dead end?
>
> The way you shifted the problem makes it possible for us to discover
> that only right before we're about to fail, right? In other words, if
> you're doing a rebase then you're potentially leaving a lot of wasted
> work on the table if you realize halfway through the operation that you
> couldn't complete it.

Sure, it shifts things towards failing later.  However, for
correctness reasons, merge-ort has already fundamentally shifted
everything working-directory related towards failing later.

merge-recursive (the old algorithm), via unpack-trees, tried to do
checks for "locally modified files in the way" and "untracked
files/directories in the way" early on in the process.  However, such
checks were unaware of rename detection logic and thus had both false
positives and false negatives in terms of throwing errors and saying
the merge could not proceed.  Those false positives and negatives were
intrinsic to the design of merge-recursive; they'll never be fixed.
The only correct fix is to fundamentally change the design -- do the
entire merge operation without touching the working tree or index, and
only after you have a new resulting merge tree, then do a checkout
operation to switch from the old tree to the new merge result tree;
only at that final checkout step do you fail if there are locally
modified files or untracked files/directories in the way.  That's what
merge-ort does.

merge-ort is also written to allow a whole rebase to be done
"in-memory" and only update the working tree and index after we have
the final end-tree.  test-tool fast-rebase already works this way.  In
such a case, we'd only notice an untracked directory being in the way
at the point we hit a conflict or when we got to the end of the
rebase.  So again, erroring out when there is a directory with >= 1
untracked files (and at the end of the operation) isn't a new issue at
all.  The only thing that is new is we'd also error out if a directory
had 0 untracked files in it.

However, I disagree with the "a lot of wasted work"; rebase is
hundreds or thousands of times faster than it used to be as you
highlighted in your blog post.  ;-)  Also, as per my --remerge-diff
thread, merge-ort can remerge 33.5 merge commits/second in linux.git.
And that's just merges; rebases are faster than merges because (a) the
intersection (rather than union) of changes on both sides of history
tends to be much smaller for rebases than merges, due to the fact that
individual commits tend to only modify a handful of files (and
merge-ort optimizations work very well when the intersection of files
changed on both sides is small), and (b) we can cache rename detection
results for the upstream side of history in rebases.

> Even though I think that's *not* what you want for rebase, it ironically

I think it's totally what we want for rebase; it's perfectly aligned
with the merge-ort correctness changes to move things to
fail-at-the-end for working-directory-related problems.

> *is* what you might want for bisect, since the path we'll take isn't
> known ahead of time. So even if there are some paths that would result
> in a directory -> file conversion in the cwd, we don't need to fail the
> operation ahead of time if the bisection doesn't actually take that
> path.
>
> On the other hand, it does still leave a lot of work on the table if the
> bisection does eventually want to change the current working directory
> into a file, depending on where that change happens.
>
> Maybe those cases are pretty niche in practice. I have to imagine that
> they are. If my guess is right, then I think your approach makes sense.

:-)