Re: [PATCH 1/4] gitweb: Fix utf8 encoding for blob_plain, blobdiff_plain, commitdiff_plain, and patch

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Jakub Narębski <jnareb@xxxxxxxxx> writes:

> On 08.04.2013, Junio C Hamano wrote:
>> jk@xxxxxxxxxxxx (Jürgen Kreileder) writes:
>> 
>>> Fixes the encoding for several _plain actions and for text/* and */*+xml blobs. 
>>>
>>> Signed-off-by: Jürgen Kreileder <jk@xxxxxxxxxxxx>
>
> I see that this patch does (or tries to do) two _independent_ things,
> and should be split into at least two separate commits:

Agreed.

>>> ---
>> 
>> Thanks, will queue but not hold until I hear something from Jakub.
>> 
>>>  gitweb/gitweb.perl |    8 +++++++-
>>>  1 file changed, 7 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/gitweb/gitweb.perl b/gitweb/gitweb.perl
>>> index 1309196..9cfe5b5 100755
>>> --- a/gitweb/gitweb.perl
>>> +++ b/gitweb/gitweb.perl
>>> @@ -3823,7 +3823,7 @@ sub blob_contenttype {
>>>  	my ($fd, $file_name, $type) = @_;
>>>  
>>>  	$type ||= blob_mimetype($fd, $file_name);
>>> -	if ($type eq 'text/plain' && defined $default_text_plain_charset) {
>>> +	if (($type =~ m!^text/\w[-\w]*$! || $type =~ m!^\w[-\w]*/\w[-\w]*\+xml$!) && defined $default_text_plain_charset) {
>>>  		$type .= "; charset=$default_text_plain_charset";
>>>  	}
>
> First, it extends adding "; charset=$default_text_plain_charset" to
> other mimetypes for 'blob_plain' view to all 'text/*' and '*/*+xml'
> mimetypes (without changing name of variable... though this is more
> complicated as it is configuration variable and we would want to
> preserve backward compatibility, but at least a comment would be,
> I think, needed).
>  
> Originally it applied only to 'text/plain' files, which can be
> displayed inline by web browser, and which need charset in
> 'Content-Type:' HTTP header to be displayed correctly, as they
> do not include such information inside the file.

What prompted the change is that some browsers (Chrome and Safari) also
display other file types inline: text/x-chdr, text/x-java, text/x-objc,
text/x-sh, ...

At least on my system these files do get served with charset set!
(ISO-8859-1 even though Apache has "AddDefaultCharset UTF-8"...)

    $ curl -I "https://git.blackdown.de/old.cgi?p=contactalbum.git;a=blob_plain;f=Classes/ContactAlbum.h;hb=HEAD";
    HTTP/1.1 200 OK
    Date: Tue, 09 Apr 2013 21:47:30 GMT
    Server: Apache
    Content-disposition: inline; filename="Classes/ContactAlbum.h"
    X-Frame-Options: SAMEORIGIN
    Vary: User-Agent
    X-UA-Compatible: IE=edge
    Strict-Transport-Security: max-age=31536000
    Content-Type: text/x-chdr; charset=ISO-8859-1

> 'text/html' and 'application/xhtml+xml' can include such information
> inside them (meta http-equiv for 'text/html' and <?xml ...> for
> 'application/xhtml+xml').  I don't know what browser does when there
> is conflicting information about charset, i.e. which one wins, but
> it is certainly something to consider.

As shown above, even without my patch, this already can happen!

> It might be a good change; I don't know what web browser do when
> serving 'text/css', 'text/javascript', 'text/xml' to client to
> view without media type known.
>
>
> BTW I have noticed that we do $prevent_xss dance outside
> blob_contenttype(), in it's only caller i.e. git_blob_plain()...
> which for example means that 'text/html' converted to 'text/plain'
> don't get '; charset=...' added.  I guess that it *might* be
> what prompted this part of change... but if it is so, it needs
> to be fixed at source, e.g. by moving $prevent_xss to
> blob_contenttype() subroutine.

Jep.

[...]

> Second it changes 'blobdiff_plain', 'commitdiff_plain' (which I think
> that should be abandoned in favor of 'patch' view; but that is
> a separate issue) and 'patch' views so they use binary-safe output.
>
> Note that in all cases (I think) we use
>
>    $cgi->header(
>      -type => 'text/plain',
>      -charset => 'utf-8',
>      ...
>    );
>
> promising web browser that output is as whole in 'utf-8' encoding.

Yes.

> It is not explained in the commit message what is the reason for this
> change.  Is it better handing of a situation where files being diff-ed
> being in other encoding (like for example in commit that changes
> encoding of files from legacy encoding such like e.g. iso-8859-2
> or cp1250 to utf-8)?

I do see encoding problems when comparing utf8 to utf8 files (i.e. no
encoding change).  For instance:
https://git.blackdown.de/old.cgi?p=contactalbum.git;a=commitdiff_plain;h=cc4eaa64c2b399dd9bdbf1f67f6d621aa24df5f8

I don't claim to be an expert in Perl's utf8 handling but I guess this
because Perl's internal utf8 form differs from the normal utf8 out from
git commands.  Switching to :raw fixes that: We write plain utf8 (and,
as noted above, charset is set to utf8 already)

With the patch: https://git.blackdown.de/?p=contactalbum.git;a=commitdiff_plain;h=cc4eaa64c2b399dd9bdbf1f67f6d621aa24df5f8

> But what about -charset => 'utf-8'  then?

That's a good question.  I think I never tried git with anything besides
ISO-8859-1 (rarely), UTF8 (mostly), and UTF16 (some Xcode files).
(UTF16 definitely causes problems for gitweb.)

> About implementation: I think after this change common code crosses
> threshold for refactoring it into subroutine, for example:
>
>   sub dump_fh_raw {
>   	my $fh = shift;
>
>   	local $/ = undef;
>   	binmode STDOUT, ':raw';
>   	print <$fh>;
>   	binmode STDOUT, ':utf8'; # as set at the beginning of gitweb.cgi
>
>   	return $fh;
>   }

OK.

I'll submit an updated patch for the second part ('blobdiff_plain',
'commitdiff_plain', and 'plain') tomorrow.

Not sure how to proceed with the 'blob_plain' issue.


    Juergen

== 
https://blackdown.de/
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html




[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]