Re: [PATCHv3] diff.c: emit moved lines with a different color

Stefan Beller <sbeller@xxxxxxxxxx> · Mon, 5 Sep 2016 19:20:07 -0700

On Mon, Sep 5, 2016 at 6:17 PM, Jacob Keller <jacob.keller@xxxxxxxxx> wrote:
> On Sun, Sep 4, 2016 at 4:42 PM, Stefan Beller <stefanbeller@xxxxxxxxx> wrote:
>> When we color the diff, we'll mark moved lines with a different color.
>>
>
> Excellent idea. This is a very neat way to show extra information
> without cluttering the diff output.
>
>> This is achieved by doing a two passes over the diff. The first pass
>> will inspect each line of the diff and store the removed lines and the
>> added lines in its own hash map.
>> The second pass will check for each added line if that is found in the
>> set of removed lines. If so it will color the added line differently as
>> with the new `moved-new` color mode. For each removed line we check the
>> set of added lines and if found emit that with the new color `moved-old`.
>>
>
> Makes sense.
>
>> When detecting the moved lines, we cannot just rely on a line being equal,
>> but we need to take the context into account to detect when the moves were
>> reordered as we do not want to color moved but per-mutated lines.
>> To do that we use the hash of the preceding line.
>
> Also makes sense.
>
>>
>> This patch was motivated by e.g. reviewing 3b0c4200 ("apply: move
>> libified code from builtin/apply.c to apply.{c,h}", 2016-08-08)
>>
>
> Yes, this would be quite helpful.
>
>> diff --git a/Documentation/config.txt b/Documentation/config.txt
>> index 0bcb679..f4f51c2 100644
>> --- a/Documentation/config.txt
>> +++ b/Documentation/config.txt
>> @@ -980,8 +980,9 @@ color.diff.<slot>::
>>         of `context` (context text - `plain` is a historical synonym),
>>         `meta` (metainformation), `frag`
>>         (hunk header), 'func' (function in hunk header), `old` (removed lines),
>> -       `new` (added lines), `commit` (commit headers), or `whitespace`
>> -       (highlighting whitespace errors).
>> +       `new` (added lines), `commit` (commit headers), `whitespace`
>> +       (highlighting whitespace errors), `moved-old` (removed lines that
>> +       reappear), `moved-new` (added lines that were removed elsewhere).
>>
>
> I liked Junio's "Moved from" and "moved to" but I think moved old and
> moved new are ok as well.

as we do not want to see dashes ('moved-old'), I think I'l go with
"movedfrom" and "movedto".

>
>> @@ -287,6 +304,25 @@ int git_diff_basic_config(const char *var, const char *value, void *cb)
>>         return git_default_config(var, value, cb);
>>  }
>>
>> +static int moved_entry_cmp(const struct moved_entry *a,
>> +                          const struct moved_entry *b,
>> +                          const void *unused)
>> +{
>> +       return strcmp(a->line, b->line) &&
>> +              a->hash_prev_line == b->hash_prev_line;
>
> So we're only comparing them if they match and have a matching
> previous line? That seems pretty reasonable to reduce the cost of
> computing exact copied sequences.
>
>> +       if (ecbdata->opt->color_moved) {
>> +               int h = memhash(line, len);
>> +               hash_prev_removed = h;
>> +               hash_prev_added = h;
>> +       }
>>  }
>
> If I understand, this is to ensure that we don't keep re-hashing each
> line right?

No, this is to ensure we have the context sensitivity of one prior line.

In the collection phase we look at each line of the patch and make a hash of it.
Then we store the hash temporarily (think of a state machine that goes line by
line and always keeps the hash of the last line)

What we store in the hashmaps is the hash(current line) ^
hash(previous applicable line).
With previous applicable line I mean any line starting with " " or "+"
when the current
line starts with "+" and " " or "-" when the current line starts with "-".

When going through the second pass and actually emitting colored lines
we only find matches in the hash map if the current line AND the previous line
match as we lookup by hash code, i.e. if we have a moved line, but the
previous line
changed we do not find it in the hashmap and we don't color it, such
that the reviewer
can spot a permutation.

So in the second phase we also need to have access to previous line, so maybe
we could also go with just taking the line with us instead of 2 hash codes.
But that implementation detail seems like a trade off to me, where I'd lean
on keeping the hashes around as lines may be very long in bad cases, whereas
the hashcode is short and it is a cheap hash.

(I am referring to http://i.imgur.com/MnaSZ1D.png where in the malicious
case all lines were moved to there as well, but permutated)