Re: [PATCH] diff-highlight: Fix broken multibyte string

"Yi, EungJun" <semtlenori@xxxxxxxxx> · Fri, 3 Apr 2015 11:19:24 +0900

> I timed this one versus the existing diff-highlight. It's about 7%
> slower. That's not great, but is acceptable to me. The String::Multibyte
> version was a lot faster, which was nice (but I'm still unclear on
> _why_).

I think the reason is here:

> sub split_line {
>    local $_ = shift;
>    return map { /$COLOR/ ? $_ : ($mbcs ? $mbcs->strsplit('', $_) : split //) }
>           split /($COLOR)/;
> }

I removed "*" from "split /($COLOR*)/". Actually I don't know why "*"
was required but I need to remove it to make my patch works correctly.

On Fri, Apr 3, 2015 at 10:24 AM, Jeff King <peff@xxxxxxxx> wrote:
> On Thu, Apr 02, 2015 at 05:49:24PM -0700, Kyle J. McKay wrote:
>
>> Subject: [PATCH v2] diff-highlight: do not split multibyte characters
>>
>> When the input is UTF-8 and Perl is operating on bytes instead
>> of characters, a diff that changes one multibyte character to
>> another that shares an initial byte sequence will result in a
>> broken diff display as the common byte sequence prefix will be
>> separated from the rest of the bytes in the multibyte character.
>
> Thanks, I had a feeling we should be able to do something with perl's
> builtin utf8 support.  This doesn't help people with other encodings,
> but I'm not sure the original was all that helpful either (in that we
> don't actually _know_ the file encodings in the first place).
>
> I briefly confirmed that this seems to do the right thing on po/bg.po,
> which has a couple of sheared characters when viewed with the existing
> code.
>
> I timed this one versus the existing diff-highlight. It's about 7%
> slower. That's not great, but is acceptable to me. The String::Multibyte
> version was a lot faster, which was nice (but I'm still unclear on
> _why_).
>
>> Fix this by putting Perl into character mode when splitting the
>> line and then back into byte mode after the split is finished.
>
> I also wondered if we could simply put stdin into utf8 mode. But it
> looks like it will barf whenever it gets invalid utf8. Checking for
> valid utf8 and only doing the multi-byte split in that case (as you do
> here) is a lot more robust.
>
>> While the utf8::xxx functions are built-in and do not require
>> any 'use' statement, the utf8::is_utf8 function did not appear
>> until Perl 5.8.1, but is identical to the Encode::is_utf8
>> function which is available in 5.8 so we use that instead of
>> utf8::is_utf8.
>
> Makes sense. I'm happy enough listing perl 5.8 as a dependency.
>
> EungJun, does this version meet your needs?
>
> -Peff
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html