Re: [PATCH] diff-highlight: Fix broken multibyte string

Jeff King <peff@xxxxxxxx> · Mon, 30 Mar 2015 18:16:35 -0400

On Tue, Mar 31, 2015 at 12:55:33AM +0900, Yi EungJun wrote:

> From: Yi EungJun <eungjun.yi@xxxxxxxxxxxxx>
> 
> Highlighted string might be broken if the common subsequence is a proper subset
> of a multibyte character. For example, if the old string is "진" and the new
> string is "지", then we expect the diff is rendered as follows:
> 
> 	-진
> 	+지
> 
> but actually it was rendered as follows:
> 
>     -<EC><A7><84>
>     +<EC><A7><80>
> 
> This fixes the bug by splitting the string by multibyte characters.

Yeah, I agree the current output is not ideal, and this should address
the problem. I was worried that multi-byte splitting would make things
slower, but in my tests, it actually speeds things up!

That surprised me. The big difference is calling $mbcs->strsplit instead
of regular split. Could it be that it's much faster than regular split?
Or is it that the resulting strings are faster for the rest of the
processing (e.g., perl hits a "slow path" on the sheared characters)? It
doesn't really matter, I guess, but certainly I was curious.

> +use File::Basename;
> +use File::Spec::Functions qw( catdir );
> +use String::Multibyte;

Unfortunately, String::Multibyte is not a standard module, and is not
even packed for Debian systems (I got mine from CPAN). Can we make this
a conditional include (e.g., 'eval "require String::Multibyte"' in
get_mbcs, and return undef if that fails?). Then people without it can
still use the script.

> +# Returns an instance of String::Multibyte based on the charset defined by
> +# i18n.commitencoding or UTF-8, or undef if String::Multibyte doesn't support
> +# the charset.

Hrm. The characters we are processing are not in the commit message, but
in the files themselves. In fact, there may be many different charsets
(i.e., a different one for each file), and we really don't have a good
way of knowing which is in play. I'd say that using the commit
encoding is our best guess, though. What happens with $mbcs->split when
the input is not a valid character in the charset (i.e., when we guess
wrong)?

If we are going to use the commit encoding, wouldn't
i18n.logOutputEncoding be a better choice?

> +sub get_mbcs {
> +	my $dir = catdir(dirname($INC{'String/Multibyte.pm'}), 'Multibyte');
> +	opendir my $dh, $dir or return;
> +	my @mbcs_charsets = grep s/[.]pm\z//, readdir $dh;

Yuck. This is a lot more intimate with String::Multibyte's
implementation than I'd like to be. Could we perhaps just run the
constructor on any candidates charsets, and then return the first hit
that gives us something besides undef?

-Peff
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html