Re: [PATCH 0/3] built-in add -p: support diff-so-fancy better

Johannes Schindelin <Johannes.Schindelin@xxxxxx> · Wed, 24 Aug 2022 23:05:38 +0200 (CEST)

Hi Junio & Philippe,

On Wed, 24 Aug 2022, Junio C Hamano wrote:

> Philippe Blain <levraiphilippeblain@xxxxxxxxx> writes:
>
> >>>> However, I've tried it on a more "real-life" example, and then I get:
> >>>>
> >>>>     error: mismatched output from interactive.diffFilter
> >>>>     hint: Your filter must maintain a one-to-one correspondence
> >>>>     hint: between its input and output lines.
> >>>>
> >>>> This is despite using "diff-so-fancy --patch" as interactive.diffFilter, which should
> >>>> keep the number of lines the same.
> >>>
> >>> Would you mind sharing the example with me?
> >>
> >> In trying to write a reproducer, I realize that this is in fact a separate "bug".
> >> I happened to have a submodule in the repository I was testing, and it had modified
> >> content. This is what was causing the "mismatched output" error, although I'm not
> >> sure why. If I use a pathspec that does not include the submodule when invoking
> >> 'git add -p', it works correctly.
> >>
> >> I'm a bit out of time now but I'll try to write a separate
> >> reproducer for this later today.
> >
> > So in trying to write a new reproducer, I found the cause of the
> > bug. ...

Ah. Submodules. The gift that keeps on giving.

I'm not _really_ complaining, though. It seems that with beautiful
regularity, submodules come up as crucial parts of attack vectors in
exploits that are reported to the git-security mailing list, requiring
somebody with my abilities to implement fixes. Therefore, submodules award
me with a sort of job security. :-)

> I somehow had an impression that that the C reimplementation was
> designed to be a bug-for-bug compatible faithful rewrite of the
> original scripted version,

Yes, it was designed as that, with a few things done differently, though,
such as avoiding unnecessary `git diff` invocations, and using C-native
data structures and memory management.

One unfortunate consequence of avoiding the `git diff` invocations of
`git-add--interactive.perl` that are not actually necessary is that I
missed that one of those invocations specifically ignores dirty
submodules, and that the Perl script then only processes files whose
numstat shows up in _that_ diff output.

Fortunately, the fix is easy: simply ignore dirty submodules in _all_ `git
diff` invocations of `add -p`.

I will submit the next iteration as soon as the PR builds pass.

> but looking at the way how this bug is handled, I am not sure (it looks
> as if the initial fix was to patch the code to match the symptom, not to
> make the code to behave more faithfully to the scripted version).

The built-in `add -p` does something slightly different from the Perl
version when it comes to the line ranges in hunk headers: instead of
storing stringified versions of them and parsing & re-rendering them when
the hunks are modified, the C version stores the actual numbers, adjusts
them as needed (e.g. when hunks are split), and, crucially, renders them
on the fly when displaying the hunks.

That means that also the colorized hunk headers are generated on the fly
whenever the hunks are displayed, and they are never stored in rendered
form, neither while parsing diffs nor when hunks are split or otherwise
edited. That's why the built-in `add -p` looked for a `@@ ... @@` part in
the colorized diff, so that it can display the remainder of that line
after showing the rendered line range.

And yes, this is a deviation from the bug-for-bug principle I follow for
conversions from scripted commands to built-in C code, but it was for a
good reason: memory management in C is less trivial than in Perl (where
one essentially YOLOs memory management), and I was not prepared to invent
a rope data structure just to be able to replace parts of a long string
buffer with rendered text of a different length, nor was I prepared to
call `memmove()` tons of times.

Note that the Perl version does something slightly iffy, too: if a diff
colorizer is configured, it initially shows the original hunk headers
(e.g. "@ file:16 @" in the `diff-so-fancy` case) but when the hunk is
modified in any way, it generates "@@ -1,6 +1,7 @@"-style hunk headers and
does not show the ones generated by the diff colorizer _at all_ anymore.

In this patch series, I teach the built-in `add -p` to be more lenient and
simply show the entire original hunk header after the rendered line range
if no `@@ ... @@` part was found. If that part was found, we still replace
it with the rendered line range.

Yes, it is a deviation from what the Perl version does. I consider it an
improvement, though.

> As with any big rewrite, a complete re-implementation always has risks
> to introduce unintended regressions and that is why starting with
> faithful rewrite with the staged transition plan is valuable.
>
> In this case, the staged transition plan, the step to flip the
> default without before remove the old one, is working beautifully.
> And it was only made possible with the actual users reporting issues
> before they say "the new one is broken, so let's use the knob to
> retain the old implementation".

Indeed, and thank you, Philippe, for bringing this to the Git mailing list
so that I could do something about the bug.

Ciao,
Dscho