Re: [PATCH/RFC (version B)] gitweb: Allow UTF-8 encoded CGI query parameters and path_info

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Jakub Narebski <jnareb@xxxxxxxxx> wrote:

> On Thu, 2 Feb 2012, Jakub Narebski wrote:
> > On Thu, 2 Feb 2012, Michał Kiedrowicz wrote:
> > > Jakub Narebski <jnareb@xxxxxxxxx> wrote:
> > > 
> > > > Gitweb tries hard to properly process UTF-8 data, by marking
> > > > output from git commands and contents of files as UTF-8 with
> > > > to_utf8() subroutine.  This ensures that gitweb would print
> > > > correctly UTF-8 e.g. in 'log' and 'commit' views.
> > > > 
> > > > Unfortunately it misses another source of potentially Unicode
> > > > input, namely query parameters.  The result is that one cannot
> > > > search for a string containing characters outside US-ASCII.
> > > > For example searching for "Michał Kiedrowicz" (containing
> > > > letter 'ł' - LATIN SMALL LETTER L WITH STROKE, with Unicode
> > > > codepoint U+0142, represented with 0xc5 0x82 bytes in UTF-8 and
> > > > percent-encoded as %C5%81) result in the following incorrect
> > > > data in search field
> > > > 
> > > > 	Michał Kiedrowicz
> > > > 
> > > > This is caused by CGI by default treating '0xc5 0x82' bytes as
> > > > two characters in Perl legacy encoding latin-1 (iso-8859-1),
> > > > because 's' query parameter is not processed explicitly as
> > > > UTF-8 encoded string.
> > > > 
> > > > The solution used here follows "Using Unicode in a Perl CGI
> > > > script" article on
> > > > http://www.lemoda.net/cgi/perl-unicode/index.html:
> > > > 
> > > > 	use CGI;
> > > > 	use Encode 'decode_utf8;
> > > > 	my $value = params('input');
> > > > 	$value = decode_utf8($value);
> > > > 
> > > > This is done when filling %input_params hash; this required to
> > > > move from explicit $cgi->param(<label>) to
> > > > $input_params{<name>} in a few places.
> > > 
> > > I'm sorry but this doesn't work for me. I would be happy to help
> > > if you have some questions about it.
> > 
> > Strange.  http://www.lemoda.net/cgi/perl-unicode/index.html says
> > that those two approaches should be equivalent.  The -utf8 pragma
> > version doesn't work for me at all, while this one works in that if
> > finds what it is supposed to, but shows garbage in search form.
> 
> Is it what you mean by "this doesn't work for me", i.e. working
> search, garbage in search field?

I mean "garbage in search field". Search works even without the patch
(at least on Debian with git-1.7.7.3, perl-5.10.1 and CGI-3.43; I
don't have my notebook nearby at the moment to check).

>  
> > Will investigate.

Thanks for your time spending on this. I wouldn't call this problem
"production critial" but it seems wrong to support UTF-8 everywhere
properly except for one place.

> 
> Damn.  If we use $cgi->textfield(-name => "s", -value => $searchtext)
> like in gitweb, CGI.pm would read $cgi->param("s") by itself -
> without decoding. 

Makes sense. When I tried calling to_utf8() in the line that defines
textfield (this was my first approach to this problem), it haven't
changed anything.

> To skip this we need to pass -force=>1  or
> -override=>1 (i.e. further changes to gitweb).
> 
> -utf8 pragma works with more modern CGI.pm, but does not with 3.10.
> 
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]