Re: [PATCH] diff-highlight: Fix broken multibyte string

"Kyle J. McKay" <mackyle@xxxxxxxxx> · Thu, 2 Apr 2015 18:59:50 -0700

On Apr 2, 2015, at 18:24, Jeff King wrote:

On Thu, Apr 02, 2015 at 05:49:24PM -0700, Kyle J. McKay wrote:

Subject: [PATCH v2] diff-highlight: do not split multibyte characters

When the input is UTF-8 and Perl is operating on bytes instead
of characters, a diff that changes one multibyte character to
another that shares an initial byte sequence will result in a
broken diff display as the common byte sequence prefix will be
separated from the rest of the bytes in the multibyte character.

Thanks, I had a feeling we should be able to do something with perl's
builtin utf8 support.  This doesn't help people with other encodings,

It should work as well as the original did for any 1-byte encoding.   
That is, if it's not valid UTF-8 it should pass through unchanged and  
any single byte encoding should just work.  But, as you point out,  
multibyte encodings other than UTF-8 won't work, but they should  
behave the same as they did before.

but I'm not sure the original was all that helpful either (in that we
don't actually _know_ the file encodings in the first place).

I think it should work fine on any single byte encoding (i.e. ISO-8859- 
x, WINDOWS-1252, etc.).

I timed this one versus the existing diff-highlight. It's about 7%
slower.

I'd expect that, we're doing extra work we weren't doing before.

That's not great, but is acceptable to me. The String::Multibyte
version was a lot faster, which was nice (but I'm still unclear on
_why_).

Must be the mbcs->strsplit routine has special case code for splitting  
on '' to just split on character boundaries.

Fix this by putting Perl into character mode when splitting the
line and then back into byte mode after the split is finished.

I also wondered if we could simply put stdin into utf8 mode. But it
looks like it will barf whenever it gets invalid utf8. Checking for
valid utf8 and only doing the multi-byte split in that case (as you do
here) is a lot more robust.

While the utf8::xxx functions are built-in and do not require
any 'use' statement, the utf8::is_utf8 function did not appear
until Perl 5.8.1, but is identical to the Encode::is_utf8
function which is available in 5.8 so we use that instead of
utf8::is_utf8.

Makes sense. I'm happy enough listing perl 5.8 as a dependency.

Maybe that should be added.  The rest of Git's perl code seems to have  
a 'use 5.008;' already, so I figured that was a reasonable  
dependency.  :)

-Kyle
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html