On Tue, Nov 21, 2017 at 2:17 PM, Jonathan Tan <jonathantanmy@xxxxxxxxxx> wrote: > Teach the patience diff to support prohibiting a user-specified line > from appearing as a deletion or addition in the end result. > > Signed-off-by: Jonathan Tan <jonathantanmy@xxxxxxxxxx> > --- > I'm sending this out to see if a change similar to this would be > welcome. It is useful to me as a reviewer (to check my own code, e.g. > when checking [1]). Probably more design needs to go into this, > including the best way to specify the "anchor" line, and the correct > behavior when the anchor is either not found or appears more than once. > > Any thoughts? The background from this whole idea is that the Myers diff algorithm may produce the shortest diff, which is good for computer consumption or transport, but not ideal for human review. To accommodate human review we need to couple the diff with higher level concepts, such as move detection or ignoring blanks. I would imagine that this anchor can be set by a user to a function header or other significant line, such that the commit is more in line with the diff itself ("move code out of function A into its helper" would not want to have the "function A" line jump around, but the code should be removed from the function, hence you'd anchor the function). The solution you provide is a good thing to experiment with, but longer term, I would want to have huge record of configs in which humans selected the best diff, such that we can use that data to reason about better automatic diff generation. The diff heuristic was based on a lot of human generated data, that was generated by Michael at the time. I wonder if we want to permanently store the anchor so the data collection will happen automatically over time. I had a similar idea, which would affix a given coordinate of the map[1] to be on the path. I imagine it similar to e.g. Google Maps in which you can select intermittent way points on your route. When having this rather abstract coordinate, which can be given as line number in pre and post image, then we would not need to think about questions whether a given line is found or appears multiple times; however users like concise input the best, so the idea of an "anchor line" might be the best representation for the user, which is internally translated into way points. [1] think of http://simplygenius.net/ArticleFiles/DiffTutorial/diagonals.png Stefan > [1] > https://public-inbox.org/git/20171121221256.154741-1-jonathantanmy@xxxxxxxxxx/ > --- > t/t4033-diff-patience.sh | 13 +++++++++++++ > xdiff/xpatience.c | 29 +++++++++++++++++++++++++++-- > 2 files changed, 40 insertions(+), 2 deletions(-) > > diff --git a/t/t4033-diff-patience.sh b/t/t4033-diff-patience.sh > index 113304dc5..2147fd688 100755 > --- a/t/t4033-diff-patience.sh > +++ b/t/t4033-diff-patience.sh > @@ -13,6 +13,19 @@ test_expect_success '--ignore-space-at-eol with a single appended character' ' > grep "^+.*X" diff > ' > > +test_expect_success 'anchor' ' > + printf "a\nb\nc\n" >pre && > + printf "c\na\nb\n" >post && > + > + # without anchor, c is moved > + test_expect_code 1 git diff --no-index --patience pre post >diff && > + grep "^+c" diff && > + > + # with anchor, a is moved > + DIFF_ANCHOR=c test_expect_code 1 git diff --no-index --patience pre post >diff && > + grep "^+a" diff or rather: "c is not moved, we don't care how the diff actually looks like", so maybe ! grep "+c" diff > diff --git a/xdiff/xpatience.c b/xdiff/xpatience.c > index a44e77632..195a60e57 100644 > --- a/xdiff/xpatience.c > +++ b/xdiff/xpatience.c > @@ -62,6 +62,8 @@ struct hashmap { > * initially, "next" reflects only the order in file1. > */ > struct entry *next, *previous; > + > + unsigned anchor : 1; While this is RFC, I should not expect comments, though it would be nice to have them in the final series. ;-) > } *entries, *first, *last; > /* were common records found? */ > unsigned long has_matches; > @@ -70,6 +72,14 @@ struct hashmap { > xpparam_t const *xpp; > }; > > +static int is_anchor(const char *line) > +{ > + char *anchor = getenv("DIFF_ANCHOR"); > + if (!anchor) > + return 0; > + return !strncmp(line, anchor, strlen(anchor)); > +} > + > /* The argument "pass" is 1 for the first file, 2 for the second. */ > static void insert_record(int line, struct hashmap *map, int pass) > { > @@ -110,6 +120,7 @@ static void insert_record(int line, struct hashmap *map, int pass) > return; > map->entries[index].line1 = line; > map->entries[index].hash = record->ha; > + map->entries[index].anchor = is_anchor(map->env->xdf1.recs[line - 1]->ptr); > if (!map->first) > map->first = map->entries + index; > if (map->last) { > @@ -192,14 +203,28 @@ static struct entry *find_longest_common_sequence(struct hashmap *map) > int longest = 0, i; > struct entry *entry; > > + /* > + * If not -1, this entry in sequence must never be overridden. (Also, > + * do not override entries in sequence before this entry, since it is > + * useless.) > + */ > + int anchor_i = -1; > + > for (entry = map->first; entry; entry = entry->next) { > if (!entry->line2 || entry->line2 == NON_UNIQUE) > continue; > i = binary_search(sequence, longest, entry); > entry->previous = i < 0 ? NULL : sequence[i]; > - sequence[++i] = entry; > - if (i == longest) > + ++i; > + if (i <= anchor_i) > + continue; > + sequence[i] = entry; > + if (anchor_i == -1 && entry->anchor) { > + anchor_i = i; > + longest = anchor_i + 1; > + } else if (i == longest) { > longest++; > + } > } > > /* No common unique lines were found */ > -- > 2.15.0.448.gf294e3d99a-goog >