On Tue, Mar 31, 2015 at 12:55:33AM +0900, Yi EungJun wrote: > From: Yi EungJun <eungjun.yi@xxxxxxxxxxxxx> > > Highlighted string might be broken if the common subsequence is a proper subset > of a multibyte character. For example, if the old string is "진" and the new > string is "지", then we expect the diff is rendered as follows: > > -진 > +지 > > but actually it was rendered as follows: > > -<EC><A7><84> > +<EC><A7><80> > > This fixes the bug by splitting the string by multibyte characters. Yeah, I agree the current output is not ideal, and this should address the problem. I was worried that multi-byte splitting would make things slower, but in my tests, it actually speeds things up! That surprised me. The big difference is calling $mbcs->strsplit instead of regular split. Could it be that it's much faster than regular split? Or is it that the resulting strings are faster for the rest of the processing (e.g., perl hits a "slow path" on the sheared characters)? It doesn't really matter, I guess, but certainly I was curious. > +use File::Basename; > +use File::Spec::Functions qw( catdir ); > +use String::Multibyte; Unfortunately, String::Multibyte is not a standard module, and is not even packed for Debian systems (I got mine from CPAN). Can we make this a conditional include (e.g., 'eval "require String::Multibyte"' in get_mbcs, and return undef if that fails?). Then people without it can still use the script. > +# Returns an instance of String::Multibyte based on the charset defined by > +# i18n.commitencoding or UTF-8, or undef if String::Multibyte doesn't support > +# the charset. Hrm. The characters we are processing are not in the commit message, but in the files themselves. In fact, there may be many different charsets (i.e., a different one for each file), and we really don't have a good way of knowing which is in play. I'd say that using the commit encoding is our best guess, though. What happens with $mbcs->split when the input is not a valid character in the charset (i.e., when we guess wrong)? If we are going to use the commit encoding, wouldn't i18n.logOutputEncoding be a better choice? > +sub get_mbcs { > + my $dir = catdir(dirname($INC{'String/Multibyte.pm'}), 'Multibyte'); > + opendir my $dh, $dir or return; > + my @mbcs_charsets = grep s/[.]pm\z//, readdir $dh; Yuck. This is a lot more intimate with String::Multibyte's implementation than I'd like to be. Could we perhaps just run the constructor on any candidates charsets, and then return the first hit that gives us something besides undef? -Peff -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html