Jakub Narebski <jnareb@xxxxxxxxx> writes: > "Praveen A" <pravi.a@xxxxxxxxx> writes: > > > Git currently does not handle unicode special characters ZWJ and ZWNJ, > > both are heavily used in Malayalam and common in other languages > > needing complex text layout like Sinhala and Arabic. > > > > An example of this is shown in the commit message here > > http://git.savannah.gnu.org/gitweb/?p=smc.git;a=commit;h=c3f368c60aabdc380c77608c614d91b0a628590a > > > > \20014 and \20015 should have been ZWNJ and ZWJ respectively. You just > > need to handle them as any other unicode character - especially it is > > a commit message and expectation is normal pain text display. > > > > I hope some one will fix this. > > Well, I am bit stumped. git_commit calls format_log_line_html, which > in turn calls esc_html. esc_html looks like this: > > sub esc_html ($;%) { > my $str = shift; > my %opts = @_; > > ** $str = to_utf8($str); > $str = $cgi->escapeHTML($str); > if ($opts{'-nbsp'}) { > $str =~ s/ / /g; > } > ** $str =~ s|([[:cntrl:]])|(($1 ne "\t") ? quot_cec($1) : $1)|eg; > return $str; > } > > The two important lines are marked with '**'. [...] > So it looks like Perl treats \20014 and \20015 (ZWNJ and ZWJ) as > belonging to '[:cntrl:]' class. I don't know if it is correct from the > point of view of Unicode character classes, therefore if it is a bug > in Perl, or just in gitweb. I checked this, via this simple Perl script: #!/usr/bin/perl use charnames ":full"; my $c = ord("\N{ZWNJ}"); printf "oct=%o dec=%d hex=%x\n", $c, $c, $c; "\N{ZWNJ}" =~ /[[:cntrl:]]/ and print "is [:cntrl:]"; And the answer was: oct=20014 dex=8204 hex=200c is [:cntrl:] 'ZERO WIDTH NON-JOINER' _is_ control character... We probably should use [^[:print:][:space:]] instead of [[:cntrl:]] here. [...] > P.S. Even that might not help much, as Savannah uses git and gitwev > version 1.5.6.5, which is probably version released with some major > distribution. As of now we are at 1.6.0.5... Which can be seen from the fact that gitweb uses octal escapes, instead of hex escapes... -- Jakub Narebski Poland ShadeHawk on #git -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html