Jakub Narebski <jnareb@xxxxxxxxx> wrote: > On Thu, 2 Feb 2012, Jakub Narebski wrote: > > On Thu, 2 Feb 2012, Michał Kiedrowicz wrote: > > > Jakub Narebski <jnareb@xxxxxxxxx> wrote: > > > > > > > Gitweb tries hard to properly process UTF-8 data, by marking > > > > output from git commands and contents of files as UTF-8 with > > > > to_utf8() subroutine. This ensures that gitweb would print > > > > correctly UTF-8 e.g. in 'log' and 'commit' views. > > > > > > > > Unfortunately it misses another source of potentially Unicode > > > > input, namely query parameters. The result is that one cannot > > > > search for a string containing characters outside US-ASCII. > > > > For example searching for "Michał Kiedrowicz" (containing > > > > letter 'ł' - LATIN SMALL LETTER L WITH STROKE, with Unicode > > > > codepoint U+0142, represented with 0xc5 0x82 bytes in UTF-8 and > > > > percent-encoded as %C5%81) result in the following incorrect > > > > data in search field > > > > > > > > MichaÅ Kiedrowicz > > > > > > > > This is caused by CGI by default treating '0xc5 0x82' bytes as > > > > two characters in Perl legacy encoding latin-1 (iso-8859-1), > > > > because 's' query parameter is not processed explicitly as > > > > UTF-8 encoded string. > > > > > > > > The solution used here follows "Using Unicode in a Perl CGI > > > > script" article on > > > > http://www.lemoda.net/cgi/perl-unicode/index.html: > > > > > > > > use CGI; > > > > use Encode 'decode_utf8; > > > > my $value = params('input'); > > > > $value = decode_utf8($value); > > > > > > > > This is done when filling %input_params hash; this required to > > > > move from explicit $cgi->param(<label>) to > > > > $input_params{<name>} in a few places. > > > > > > I'm sorry but this doesn't work for me. I would be happy to help > > > if you have some questions about it. > > > > Strange. http://www.lemoda.net/cgi/perl-unicode/index.html says > > that those two approaches should be equivalent. The -utf8 pragma > > version doesn't work for me at all, while this one works in that if > > finds what it is supposed to, but shows garbage in search form. > > Is it what you mean by "this doesn't work for me", i.e. working > search, garbage in search field? I mean "garbage in search field". Search works even without the patch (at least on Debian with git-1.7.7.3, perl-5.10.1 and CGI-3.43; I don't have my notebook nearby at the moment to check). > > > Will investigate. Thanks for your time spending on this. I wouldn't call this problem "production critial" but it seems wrong to support UTF-8 everywhere properly except for one place. > > Damn. If we use $cgi->textfield(-name => "s", -value => $searchtext) > like in gitweb, CGI.pm would read $cgi->param("s") by itself - > without decoding. Makes sense. When I tried calling to_utf8() in the line that defines textfield (this was my first approach to this problem), it haven't changed anything. > To skip this we need to pass -force=>1 or > -override=>1 (i.e. further changes to gitweb). > > -utf8 pragma works with more modern CGI.pm, but does not with 3.10. > -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html