Re: [RFC/PATCH 0/2] Color moved code differently

Stefan Beller <sbeller@xxxxxxxxxx> · Sun, 4 Sep 2016 01:28:58 -0700

On Sat, Sep 3, 2016 at 11:41 PM, Junio C Hamano <gitster@xxxxxxxxx> wrote:
> Stefan Beller <sbeller@xxxxxxxxxx> writes:
>
>>>  * On 2/2, doing it at xdiff.c level may be limiting this good idea
>>>    to flourish to its full potential, as the interface is fed only
>>>    one diff_filepair at a time.
>>
>> I realized that after I implemented it. I agree we would want to have
>> it function cross file.
>>
>> So from my current understanding of the code,
>> * diffcore_std would call a new function diffcore_detect_moved(void)
>>    just before diffcore_apply_filter is called.
>> * The new function diffcore_detect_moved would then check if the
>>    diff is a valid textual diff (i.e. real files, not submodules, but
>>    deletion/creation of one file is allowed)
>>    If so we generate the diff internally and as in 2/2 would
>>    hash all added/removed lines with context and store it.
>
> I do not think you should step outside diff_flush().

After looking further into it, I think we need to step outside
or at the very least deep down in the xdl cellar from there,
because we need the context around the line both in the
pre and after image.

>  Only when
> producing textual diff, you would have to run the textual diff
> twice by going over the q twice:
>
>  * The first pass would run diff_flush_patch(), which would call
>    into xdiff the usual way, but the callback from xdiff would
>    capture the removed lines and the added lines without making any
>    output.

The callback doesn't have the context around one line as easily
accessible as the xdl emit function, so rather we'd pass in a flag,
i.e. the hashmap where to store the added/removed lines or NULL.

Both passes (the first to find and the second enhanced pass that
outputs the lines) need to have access to the context of the respective
file.

>
>  * The second pass would run diff_flush_patch(), but the callback
>    from xdiff would be called with additional information, namely,
>    the removed and the added lines captured in the first pass.
>
>  * I suspect that the fn_out_consume() function that is used for a
>    normal case (i.e. when we are not doing this more expensive
>    "moved to/moved from" coloring) can be used for the second pass
>    above (i.e. the "priv" aka "ecbdata" may need to be extended so
>    that it can tell which mode of operation it is asked to perform),
>    but if there is not enough similarity between the second pass of
>    this "moved from/moved to" mode and the normal mode of output, it
>    is also OK to have two different callback functions, i.e. the
>    original one to be used in the normal mode, the second one that
>    knows the "these are moved without modification" coloring.  The
>    callback for the first pass is sufficiently different and I think
>    it is better to invent a new callback function to be used in the
>    first pass, instead of reusing fn_out_consume().

The callback doesn't have easy accessible data IMHO,
e.g. we may get these lines in the callback:

(function context not required:)
---8<---
 {
        int i;
        for (i = 0; i < GIT_SHA1_RAWSZ; i++) {
-               unsigned int val;
-               /*
-                * hex[1]=='\0' is caught when val is checked below,
-                * but if hex[0] is NUL we have to avoid reading
-                * past the end of the string:
-                */
-               if (!hex[0])
-                       return -1;
-               val = (hexval(hex[0]) << 4) | hexval(hex[1]);
-               if (val & ~0xff)
+               int val = hex2chr(hex);
+               if (val < 0)
                        return -1;
                *sha1++ = val;
                hex += 2;
---8<---

And for the first + line (int val = hex2chr(hex);) we need
to hash that line and 2 (made up threshold) before and after:

---8<---
        int i;
        for (i = 0; i < GIT_SHA1_RAWSZ; i++) {
               int val = hex2chr(hex);
               if (val < 0)
                        return -1;
---8<---

and this is hard to reconstruct from the above diff fed into the
callback. So we have to do it in the xdl emit phase.

So I think the wrong part of the initial patch 2/2 is the too
late recording of the moved lines.
So rather whenever we call diff_flush_patch(), we have to first
call diff_prepare_moved_line_detection() first for all file pairs.

>
>    The fn_out_consume() function working in the "second pass of
>    moved from/moved to mode" would inspect line[] and see if it is
>    an added or a removed line, and then:
>
>    - if it is an added line, and it appears as a removed line
>      elsewhere in the patchset (you obtained the information in the
>      first pass), you show it as "this was moved from elsewhere".
>
>    - if it is a removed line, and it appears as an added line
>      elsewhere in the patchset (you obtained the information in the
>      first pass), you show it as "this was moved to elsewhere".
>
> Or something like that.