> I timed this one versus the existing diff-highlight. It's about 7% > slower. That's not great, but is acceptable to me. The String::Multibyte > version was a lot faster, which was nice (but I'm still unclear on > _why_). I think the reason is here: > sub split_line { > local $_ = shift; > return map { /$COLOR/ ? $_ : ($mbcs ? $mbcs->strsplit('', $_) : split //) } > split /($COLOR)/; > } I removed "*" from "split /($COLOR*)/". Actually I don't know why "*" was required but I need to remove it to make my patch works correctly. On Fri, Apr 3, 2015 at 10:24 AM, Jeff King <peff@xxxxxxxx> wrote: > On Thu, Apr 02, 2015 at 05:49:24PM -0700, Kyle J. McKay wrote: > >> Subject: [PATCH v2] diff-highlight: do not split multibyte characters >> >> When the input is UTF-8 and Perl is operating on bytes instead >> of characters, a diff that changes one multibyte character to >> another that shares an initial byte sequence will result in a >> broken diff display as the common byte sequence prefix will be >> separated from the rest of the bytes in the multibyte character. > > Thanks, I had a feeling we should be able to do something with perl's > builtin utf8 support. This doesn't help people with other encodings, > but I'm not sure the original was all that helpful either (in that we > don't actually _know_ the file encodings in the first place). > > I briefly confirmed that this seems to do the right thing on po/bg.po, > which has a couple of sheared characters when viewed with the existing > code. > > I timed this one versus the existing diff-highlight. It's about 7% > slower. That's not great, but is acceptable to me. The String::Multibyte > version was a lot faster, which was nice (but I'm still unclear on > _why_). > >> Fix this by putting Perl into character mode when splitting the >> line and then back into byte mode after the split is finished. > > I also wondered if we could simply put stdin into utf8 mode. But it > looks like it will barf whenever it gets invalid utf8. Checking for > valid utf8 and only doing the multi-byte split in that case (as you do > here) is a lot more robust. > >> While the utf8::xxx functions are built-in and do not require >> any 'use' statement, the utf8::is_utf8 function did not appear >> until Perl 5.8.1, but is identical to the Encode::is_utf8 >> function which is available in 5.8 so we use that instead of >> utf8::is_utf8. > > Makes sense. I'm happy enough listing perl 5.8 as a dependency. > > EungJun, does this version meet your needs? > > -Peff -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html