Re: [PATCH/RFC] Gitweb: Convert UTF-8 encoded file names

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Thu, May 15, 2014 at 8:48 PM, Michael Wagner <accounts@xxxxxxxxxxx> wrote:
> On Thu, May 15, 2014 at 10:04:24AM +0100, Peter Krefting wrote:
>> Michael Wagner:
>>
>>>Decoding the UTF-8 encoded file name (again with an additional print
>>>statement):
>>>
>>>$ REQUEST_METHOD=GET QUERY_STRING='p=notes.git;a=blob_plain;f=work/G%C3%83%C2%BCtekriterien.txt;hb=HEAD' ./gitweb.cgi
>>>
>>>work/Gütekriterien.txt
>>>Content-disposition: inline; filename="work/Gütekriterien.txt"
>>
>> You should fix the code path that created that URI, though, as it is not
>> what you expected.
>>
>> %C3%83 decodes to U+00C3 Latin Capital Letter A With Tilde
>> %C2%BC decodes to U+00BC Vulgar Graction One Quarter
>>
>> The proper UTF-8 encoding for ü (U+00FC) is, as you can probably guess from
>> looking at which two characters the sequence above yielded, C3 BC, which in
>> a URI is represented as %C3%BC.
>>
>> Your QUERY_STRING should thus be
>>
>>   p=notes.git;a=blob_plain;f=work/G%C3%BCtekriterien.txt;hb=HEAD
>>
>> which probably works as expected.
>>
>> What is happening is that whatever is generating the URI us UTF-8-encoding
>> the string twice (i.e., it generates a string with the proper C3 BC in it,
>> and then interprets it as iso-8859-1 data and runs that through a UTF-8
>> encoder again, yielding the C3 83 C2 BC sequence you see above).
>
> The subroutine "git tree" generates the tree view. It stores the output
> of "git ls-tree -z ..." in an array named "@entries". Printing the content
> of this array yields the following result:
>
> 00644 blob 6419cd06a9461c38d4f94d9705d97eaaa887156a     520 Gütekriterien.txt
>
> This leads to the "doubled" encoding. Declaring the encoding in the call
> to open yields the following result:
>
> 100644 blob 6419cd06a9461c38d4f94d9705d97eaaa887156a     520 Gütekriterien.txt

Good catch.

Writing test for this would not be easy, and require some HTML
parser (WWW::Mechanize, Web::Scraper, HTML::Query, pQuery,
... or low level HTML::TreeBuilder, or other low level parser).

> ---
>
> diff --git a/gitweb/gitweb.perl b/gitweb/gitweb.perl
> index a9f57d6..f1414e1 100755
> --- a/gitweb/gitweb.perl
> +++ b/gitweb/gitweb.perl
> @@ -7138,7 +7138,7 @@ sub git_tree {
>         my @entries = ();
>         {
>                 local $/ = "\0";
> -               open my $fd, "-|", git_cmd(), "ls-tree", '-z',
> +               open my $fd, "-|encoding(UTF-8)", git_cmd(), "ls-tree", '-z',
>                         ($show_sizes ? '-l' : ()), @extra_options, $hash
>                         or die_error(500, "Open git-ls-tree failed");

Or put

                   binmode $fd, ':utf8';

like in the rest of the code.

>                 @entries = map { chomp; $_ } <$fd>;
>

Even better solution would be to use

    use open IN => ':encoding(utf-8)';

at the beginning of gitweb.perl, once and for all.

Unfortunately the output equivalent requires creating Perl
module for gitweb, to be able to use

    use open OUT => ':encoding(utf-8-with-fallback)';

-- 
Jakub Narebski
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html




[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]