Re: Feature Request: Custom split location for `git add --patch`

Johannes Schindelin <Johannes.Schindelin@xxxxxx> · Sat, 18 Jun 2022 15:01:06 +0200 (CEST)

Hi Dennis,

On Thu, 16 Jun 2022, Dennis van Gerwen wrote:

> ## Problem
>
> If a hunk cannot be split automatically, the `s` (split) option disappears
> from the list of options, and we get:
>
> > (1/6) Stage this hunk [y,n,q,a,d,j,J,g,/,e,?]? s
> > Sorry, cannot split this hunk
>
>
> See for example these questions on StackOverflow: [2], [3]
>
> I am aware that this problem can be solved by using the [`--edit` option][4]
> to edit the hunk, as explained in the docs under [editing patches][5].
>
> However, if we have a large contiguous hunk, and we only want to split at one
> specific location, using `--edit` can be quite cumbersome, because (as far as
> I know) we need to edit all the lines that we do not want to stage.
>
> ## Feature request
>
> Would it be possible to add a "custom split location" feature, allowing the
> user to specify where a contiguous hunk should be split?
>
> Some options:
>
> - Always show the `s` (split hunk) option, and let the user specify a line
> number if an automatic split cannot be made.
>
> - Add a new [patch option][6], in addition to `s` (split hunk), to split any
> hunk at a custom location.
>
> - Add a new operation for editing patches, for example "s" (in addition to
> "+", "-", and " "), indicating where to split (e.g. "split after this line").
> This way the user only needs to modify a single line in edit mode.
>
> I'm sure there are a lot of nasty details to consider, but I do think such a
> feature would be very convenient.

While I am personally not interested in having this option (I find that I
always have to edit other aspects when splitting at non-context lines,
anyway, so I _have_ to use `e`), if you are interested in giving this a
go, I will be more than happy to help.

The most important thing to note is that there are currently _two_
implementations of the `git add -p` command, one in Perl and one in C. The
former one is slated to be dropped in favor of the latter, so I will
concentrate on the implementation in C.

The function implementing the current split logic is not exactly for the
faint of heart, ~150 lines of almost undocumented code:
https://github.com/git/git/blob/v2.37.0-rc0/add-patch.c#L892-L1049

But then, it does both a lot more _and_ a lot less of what you want: It
automatically finds all stretches of context lines and then expands the
indicated single hunk into as many hunks as it can split it into, sharing
the stretches of context lines between adjacent hunks.

Obviously, you would need to add a new function that is inspired by this
function, but differs in the following aspects:

- It requires an additional parameter, the line offset at which to split
  the hunk.

- It always splits the hunk into two hunks, no more, no less.

However, from my experience with the Git GUI code implementing line
staging, the biggest challenge will be that one of the assumptions of the
current hunk splitting design will be broken by your feature: so far,
when applying a hunk, no other hunks need to be touched: they remain
unchanged.

When splitting a hunk by context lines, the new hunks will share
overlapping context lines, but those remain invariant whether hunks are
applied or not.

With your desired feature, this would change. Take this hunk, for example:

	-- snip --
	@@
	 All journeys start with a first step.

	-[tbd]
	+Actually, some journeys, some interesting ones, end in a disaster
	+right at the start, because the first step was skipped.

	 And this concludes the chapter. Good night.
	-- snap --

Let's just split this for simplicity's sake between the `-` and the `+`
lines. So the first hunk would become:

	-- snip --
	@@
	 All journeys start with a first step.

	-[tbd]

	 And this concludes the chapter. Good night.
	-- snap --

and the second:

	-- snip --
	@@
	 All journeys start with a first step.

	 [tbd]
	+Actually, some journeys, some interesting ones, end in a disaster
	+right at the start, because the first step was skipped.

	 And this concludes the chapter. Good night.
	-- snap --

Note how the `-` line became a context line? It _has_ to be, otherwise the
second hunk could not be applied as-is. That is, until the first hunk has
been applied. In which case the second hunk needs to be modified:

	-- snip --
	@@
	 All journeys start with a first step.

	+Actually, some journeys, some interesting ones, end in a disaster
	+right at the start, because the first step was skipped.

	 And this concludes the chapter. Good night.
	-- snap --

Now the `[tbd]` line in the context has vanished.

This is really an important concept to keep in mind: if you split at
context lines, your hunks can remain immutable. If you want to split
elsewhere, after one of the hunks resulting from said split is a applied,
you have to recompute the other hunk.

In Git GUI, that's easy: it always presents the entire diff, and when
something gets (un-)staged, the entire diff is recomputed.

In `add -p`, we try very hard _not_ to stage anything until the user is
done deciding what to stage and what to skip. It then applies the entire
file diff in one fell swoop.

So why does it work with `edit`, then? You can edit a hunk and stage it,
right? That's because it is still a single hunk. It does not modify any
line that might be the context line of a different hunk.

With this in mind, if you are still interested to tackle this feature,
here are a couple of notes that should get you started:

- The hunks are recorded as an array of element type `struct hunk`
  (https://github.com/git/git/blob/v2.37.0-rc0/add-patch.c#L243-L248) in
  the `struct file_diff`
  (https://github.com/git/git/blob/v2.37.0-rc0/add-patch.c#L256-L261).

- Each hunk has a `start` and `end` offset. This is the offset into the
  diff that is stored in plain text in the `plain` strbuf in `struct
  add_p_state`.

  They have to be offsets instead of pointers because that strbuf can be
  modified, i.e. it might need to grow so much that its needs to move to a
  new memory location (which would invalidate all existing pointers).

- Likewise, there are `colored_start` and `colored_end` (byte) offsets
  that point into the `colored` strbuf of the `struct file_diff`. This is
  because the colored version of the diff may optionally be generated by
  another tool than Git itself, and the only requirement is that it
  generates the same amount of lines and Git assumes a 1:1 correlation
  between the lines.

- Each hunk also has a hunk header. The `old_offset`, `old_count`,
  `new_offset` and `new_count` attributes correspond to the diff hunk
  header, i.e. they refer to line numbers/counts in the pre/post image.

- When splitting a hunk, a new element has to be inserted into the `hunk`
  array of the `struct file_diff`, and the new hunk's numbers have to be
  adjusted carefully, as well as the hunk from which the new hunk was
  split out.

- Since the split hunks require context lines that the corresponding lines
  did not have in the original diff, you cannot simply adjust the `start`
  and `end` offsets to point into the existing diff.

  You have to use the same trick as the `edit` feature: append the new
  hunk with its context lines to `plain` (and to `colored`, unless that's
  empty), and then point to those offsets.

- You need to take care to record relationships between hunks so that one
  hunk's content can be adjusted accordingly if another hunk is staged to
  be applied and therefore changes the former hunk's context lines.

  This is similar to the role of the `delta` field that keeps track how
  the post image line number in the hunk header needs to be adjusted
  depending how previous hunks have been edited.

Ciao,
Dscho