Re: gitweb and unicode special characters

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Jakub Narebski <jnareb@xxxxxxxxx> writes:
> "Praveen A" <pravi.a@xxxxxxxxx> writes:
> 
> > Git currently does not handle unicode special characters ZWJ and ZWNJ,
> > both are heavily used in Malayalam and common in other languages
> > needing complex text layout like Sinhala and Arabic.
> > 
> > An example of this is shown in the commit message here
> > http://git.savannah.gnu.org/gitweb/?p=smc.git;a=commit;h=c3f368c60aabdc380c77608c614d91b0a628590a
> > 
> > \20014 and \20015 should have been ZWNJ and ZWJ respectively. You just
> > need to handle them as any other unicode character - especially it is
> > a commit message and expectation is normal pain text display.
> > 
> > I hope some one will fix this.
> 
> Well, I am bit stumped.  git_commit calls format_log_line_html, which
> in turn calls esc_html.  esc_html looks like this:
> 
>   sub esc_html ($;%) {
>   	my $str = shift;
>   	my %opts = @_;
>   
>   **	$str = to_utf8($str);
>   	$str = $cgi->escapeHTML($str);
>   	if ($opts{'-nbsp'}) {
>   		$str =~ s/ /&nbsp;/g;
>   	}
>   **	$str =~ s|([[:cntrl:]])|(($1 ne "\t") ? quot_cec($1) : $1)|eg;
>   	return $str;
>   }
> 
> The two important lines are marked with '**'.
[...]

> So it looks like Perl treats \20014 and \20015 (ZWNJ and ZWJ) as
> belonging to '[:cntrl:]' class. I don't know if it is correct from the
> point of view of Unicode character classes, therefore if it is a bug
> in Perl, or just in gitweb.

I checked this, via this simple Perl script:

  #!/usr/bin/perl

  use charnames ":full";

  my $c = ord("\N{ZWNJ}");
  printf "oct=%o dec=%d hex=%x\n", $c, $c, $c;

  "\N{ZWNJ}" =~ /[[:cntrl:]]/ and print "is [:cntrl:]";

And the answer was:

  oct=20014 dex=8204 hex=200c
  is [:cntrl:]

'ZERO WIDTH NON-JOINER' _is_ control character... We probably should
use [^[:print:][:space:]] instead of [[:cntrl:]] here.

[...]
> P.S. Even that might not help much, as Savannah uses git and gitwev
> version 1.5.6.5, which is probably version released with some major
> distribution.  As of now we are at 1.6.0.5...

Which can be seen from the fact that gitweb uses octal escapes,
instead of hex escapes...

-- 
Jakub Narebski
Poland
ShadeHawk on #git
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux