Re: Comparing rebase --am with --interactive via p3400

Phillip Wood <phillip.wood123@xxxxxxxxx> · Wed, 1 Apr 2020 15:00:19 +0100

Hi Alban and Johannes

On 01/04/2020 12:33, Alban Gruin wrote:
Hi Johannes,

Sorry for the late answer, I was really busy for the last months.

Le 31/01/2020 à 22:23, Johannes Schindelin a écrit :
Hi Alban,
  -%<-

Indeed. I offered these insights in #git-devel (slightly edited):

This `discard_index()` is in an awfully central location. I am rather
certain that it would cause problems to just remove it.

Looking at `do_merge()`: it explicitly discards and re-reads the index if
we had to spawn a `git merge` process (which we do if a strategy other
than `recursive` was specified, or if it is an octopus merge). But I am
wary of other code paths that might not be as careful.

I see that `do_exec()` is similarly careful.

I have to admit that the index is not my area of expertise in git, so
sorry if my question is stupid: isn't there a less heavy way to find
unstaged or uncommitted changes than discarding and then reloading the
index?

One thing I cannot fail to notice: we do not re-read a changed index
after running the `prepare-commit-msg` hook, or for that matter, any other
hook. That could even be an old regression from the conversion of the
interactive rebase to using the sequencer rather than a shell script.

Further, `reset_merge()` seems to spawn `git reset --merge` without
bothering to re-read the possibly modified index. Its callees are
`rollback_single_pick()`, `skip_single_pick()` and `sequencer_rollback()`,
none of which seem to be careful, either, about checking whether the index
was modified in the meantime.

Technically, the in-memory index should also be discarded
in `apply_autostash()`, but so far, we do not have any callers of that
function, I don't think, that wants to do anything but release resources
and exit.

The `run_git_checkout()` function discards, as intended. I
am not quite sure whether it needs to, though, unless the `.git/index`
file _was_ modified (it _is_ possible, after all, to run `git rebase -i
HEAD`, and I do have a use case for that where one of my scripts generates
a todo script, sort of a `git cherry-pick --rebase-merges`, because
`cherry-pick` does not support that mode).

I'm not sure it is worth optimizing the case where .git/index is not 
changed as we only do this once per rebase. In any case I hope that one 
day we'll stop forking git checkout and use the code from 
builtin/rebase.c to do it

The `continue_single_pick()` function spawns a `git
commit` which could potentially modify the index through a hook, but the
first call site does not care and the second one guards against that
(erroring out...).

My biggest concern is with the `run_git_commit()` function: it does not
re-read a potentially-modified index (think of hooks).

I agree that we should be re-reading the index after forking `git 
commit` and also `git merge`. Most of the time we commit without forking 
so that should not impact the performance too much

Thank you for your analysis.

We will need to be very careful with this `discard_index()`, I think, and
in my opinion there is a great opportunity here for cleaning up a little:
rather than discarding and re-reading the in-memory index without seeing
whether the on-disk index has changed at all appears a bit wasteful to me.

This could be refactored into a function that only discards and re-reads
the index if the mtime of `.git/index` changed. That function could then
also be taught to detect when the in-memory index has unwritten changes:
that would constitute a bug.

Hmm, checking if the mtime of the index to see if it changed isn't racy?
  Sub-second changes should happen, and to quote a comment in
is_racy_stat(), "nanosecond timestamped files can also be racy" -- even
if it should not really happen in the case of rebase…

I don't think relying on the index stat data is a good idea, git 
defaults to one second mtime resolution unless it is built with 
-DUSE_NSEC and we do way more than one commit a second. We tried to rely 
on stat data to determine when to re-read the todo list after an exec 
and it is broken (both in the design because it assumes ns mtime 
resolution and the implementation because we don't update the cached 
mtime after we rewrite the todo list). There are not that many places 
where we need to re-read the index so I think we should just have 
explicit re-reads where we need them. Hopefully over time we'll stop 
forking other processes and the problem will go away.

Best Wishes

Phillip

Ciao,
Dscho

Alban