Jakub Narebski <jnareb@xxxxxxxxx> wrote: > Gitweb tries hard to properly process UTF-8 data, by marking output > from git commands and contents of files as UTF-8 with to_utf8() > subroutine. This ensures that gitweb would print correctly UTF-8 > e.g. in 'log' and 'commit' views. > > Unfortunately it misses another source of potentially Unicode input, > namely query parameters. The result is that one cannot search for a > string containing characters outside US-ASCII. For example searching > for "Michał Kiedrowicz" (containing letter 'ł' - LATIN SMALL LETTER L > WITH STROKE, with Unicode codepoint U+0142, represented with 0xc5 0x82 > bytes in UTF-8 and percent-encoded as %C5%81) result in the following > incorrect data in search field > > MichaÅ Kiedrowicz > > This is caused by CGI by default treating '0xc5 0x82' bytes as two > characters in Perl legacy encoding latin-1 (iso-8859-1), because 's' > query parameter is not processed explicitly as UTF-8 encoded string. > > The solution used here follows "Using Unicode in a Perl CGI script" > article on http://www.lemoda.net/cgi/perl-unicode/index.html: > > use CGI; > use Encode 'decode_utf8; > my $value = params('input'); > $value = decode_utf8($value); > > Decoding UTF-8 is done when filling %input_params hash and $path_info > variable; the former required to move from explicit $cgi->param(<label>) > to $input_params{<name>} in a few places, which is a good idea anyway. > > Another required change was to add -override=>1 parameter to > $cgi->textfield() invocation (in search form). Otherwise CGI would > use values from query string if it is present, filling value from > $cgi->param... without decode_utf8(). As we are using value of > appropriate parameter anyway, -override=>1 doesn't change the > situation but makes gitweb fill search field correctly. > > Alternate solution would be to simply use the '-utf8' pragma (via > "use CGI '-utf8';"), but according to CGI.pm documentation it may > cause problems with POST requests containing binary files... and > it requires CGI 3.31 (I think), released with perl v5.8.9. > > Noticed-by: Michał Kiedrowicz <michal.kiedrowicz@xxxxxxxxx> > Signed-off-by: Jakub Narębski <jnareb@xxxxxxxxx> > --- > On Fri, 3 Feb 2012, Michal Kiedrowicz wrote: > > Jakub Narebski <jnareb@xxxxxxxxx> wrote: > > > > Is it what you mean by "this doesn't work for me", i.e. working > > > search, garbage in search field? > > > > I mean "garbage in search field". Search works even without the patch > > (at least on Debian with git-1.7.7.3, perl-5.10.1 and CGI-3.43; I > > don't have my notebook nearby at the moment to check). > [...] > > > > Damn. If we use $cgi->textfield(-name => "s", -value => $searchtext) > > > like in gitweb, CGI.pm would read $cgi->param("s") by itself - > > > without decoding. > > > > Makes sense. When I tried calling to_utf8() in the line that defines > > textfield (this was my first approach to this problem), it haven't > > changed anything. > > Yes, and it doesn't makes sense in gitweb case - we use value of > $cgi->param("s") as default value of text field anyway, but in > Unicode-aware way. > > > > To skip this we need to pass -force=>1 or > > > -override=>1 (i.e. further changes to gitweb). > > This patch does this. > > Does it make work for you? > Yes, it works for me. Search form properly displays "ł". Thanks! -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html