Re: [PATCH 18/19] diff: buffer all output if asked to

Stefan Beller <sbeller@xxxxxxxxxx> · Tue, 16 May 2017 09:42:23 -0700




On Mon, May 15, 2017 at 9:14 PM, Jonathan Tan <jonathantanmy@xxxxxxxxxx> wrote:
> Overall, this patch seems larger than it should to me, although there
> might be good reasons for that that I don't know. I'll remark on what
> I find unexpected.
>


>>
>> +static void emit_buffered_patch_line_ws(struct diff_options *o,
>> +                                       struct buffered_patch_line *e,
>> +                                       const char *ws, unsigned ws_rule)
>
> This introduces a new _ws emission function - how is this used and how
> is this different from the non-ws one? I see BPL_EMIT_LINE_WS, but I
> don't see the caller that introduces that constant in this patch.

See emit_line_ws, which makes use of BPL_EMIT_LINE_WS.
The difference between BPL_EMIT_LINE_WS and BPL_EMIT_LINE_AS_IS
is that _WS is emitted marking up whitespace differently (e.g. 8 continuous
spaces instead of a tab or such), see the core.whitespace option.
Relevant hunk:

@@ -557,9 +616,12 @@ static void emit_line_ws(struct diff_options *o,
                         const char *line, int len,
                         const char *ws, unsigned ws_rule)
 {
-       emit_line_0(o, set, reset, sign, "", 0);
-       ws_check_emit(line, len, ws_rule,
-                     o->file, set, reset, ws);
+       struct buffered_patch_line e = {set, reset, line, len, sign,
BPL_EMIT_LINE_WS};
+
+       if (o->use_buffer)
+               append_buffered_patch_line(o, &e);
+       else
+               emit_buffered_patch_line_ws(o, &e, ws, ws_rule);
 }

>> +       switch (e->state) {
>> +               case BPL_EMIT_LINE_ASIS:
>> +                       emit_buffered_patch_line(o, e);
>> +                       break;
>> +               case BPL_EMIT_LINE_WS:
>> +                       emit_buffered_patch_line_ws(o, e, ws, ws_rule);
>> +                       break;
>> +               case BPL_HANDOVER:
>> +                       o->current_filepair++;
>
> If we're just buffering the diff output, do we need to store
> per-file-pair metadata? (I assume that's why you need a special
> handover constant.) Clients can already read what they need from the
> diff output.

Currently we keep only whitespace settings per-file separate as they are
defined per path (via attributes).

So I read this comment as if you consider the per-file buffer unneeded and
we could just detect the next file via the line

    <line-prefix>diff --git a/<file> b/<file>

and then re-read the attributes and remember only the current files settings in
the output pass. I'll look into that.

>> +        *
>> +        * TODO (later in this series):
>> +        * We'll unset this flag in a later patch.
>> +        */
>> +       o->use_buffer = 1;
>
> What I would do is to add a demonstration patch at the end of the
> patch series (which is not supposed to be queued) to avoid such churn
> in history, but I'm not sure how the Git project prefers to do this.

ok, I can omit this part in a reroll.

>> + *
>> + * NEEDSWORK: Instead of storing a copy of the line, add an offset pointer
>> + * into the pre/post image file. This pointer could be a union with the
>> + * line pointer. By storing an offset into the file instead of the literal line,
>> + * we can decrease the memory footprint for the buffered output. At first we
>> + * may want to only have indirection for the content lines, but we could
>> + * also have an enum (based on sign?) that stores prefabricated lines, e.g.
>> + * the similarity score line or hunk/file headers.
>
> This would be nice, but come to think of it, might not be possible.
> When requesting --word-diff, control characters (or others) might
> appear in the output, right?

That is why we'd have even more states. ;)
Or duplicate word diff still, but lines left intact are referenced via offsets.
It's a hard problem to get right, so I defer it via this comment.

>
>> + */
>> +struct buffered_patch_line {
>> +       const char *set;
>> +       const char *reset;
>> +       const char *line;
>> +       int len;
>> +       int sign;
>> +       enum {
>> +               BPL_EMIT_LINE_WS,
>> +               BPL_EMIT_LINE_ASIS,
>> +               BPL_HANDOVER
>> +       } state;
>
> It might be better, for simplicity, just to have one big buffer
> including everything (if we decide that we really can't add pointers
> to input later).

What do you mean here? (Drop the other structs such as the file pair?)

Thanks for the review!
Stefan