Re: [PATCH 1/4] gitweb: Fix utf8 encoding for blob_plain, blobdiff_plain, commitdiff_plain, and patch

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 09.04.2013 at 23:59, Jürgen Kreileder wrote:
> Jakub Narębski <jnareb@xxxxxxxxx> writes:
> 
>> On 08.04.2013, Junio C Hamano wrote:
>>> jk@xxxxxxxxxxxx (Jürgen Kreileder) writes:
>>>
>>>> Fixes the encoding for several _plain actions and for text/* and */*+xml blobs. 
>>>>
>>>> Signed-off-by: Jürgen Kreileder <jk@xxxxxxxxxxxx>
>>
>> I see that this patch does (or tries to do) two _independent_ things,
>> and should be split into at least two separate commits:
[...]
>> First, it extends adding "; charset=$default_text_plain_charset" to
>> other mimetypes for 'blob_plain' view to all 'text/*' and '*/*+xml'
>> mimetypes (without changing name of variable... though this is more
>> complicated as it is configuration variable and we would want to
>> preserve backward compatibility, but at least a comment would be,
>> I think, needed).
>>  
>> Originally it applied only to 'text/plain' files, which can be
>> displayed inline by web browser, and which need charset in
>> 'Content-Type:' HTTP header to be displayed correctly, as they
>> do not include such information inside the file.
> 
> What prompted the change is that some browsers (Chrome and Safari) also
> display other file types inline: text/x-chdr, text/x-java, text/x-objc,
> text/x-sh, ...
> 
> At least on my system these files do get served with charset set!
> (ISO-8859-1 even though Apache has "AddDefaultCharset UTF-8"...)
[...]

I think this (the reason behind this change) should be explained in the
commit message.

>> 'text/html' and 'application/xhtml+xml' can include such information
>> inside them (meta http-equiv for 'text/html' and <?xml ...> for
>> 'application/xhtml+xml').  I don't know what browser does when there
>> is conflicting information about charset, i.e. which one wins, but
>> it is certainly something to consider.
> 
> As shown above, even without my patch, this already can happen!
> 
>> It might be a good change; I don't know what web browser do when
>> serving 'text/css', 'text/javascript', 'text/xml' to client to
>> view without media type known.

There are few options on how to handle this.

First, there is an issue of $default_text_plain_charset configuration
variable.  Any changes to its use (or adding new configuration
variables) should be accompanied with changing / adding comment near the
place where the default value is set, and changing / adding to
the documentation, namely gitweb.conf.txt

We can either:
1.) Use $default_text_plain_charset variable both for 'text/plain'
    and 'text/*' etc. content types in 'blob_plain' (aka 'raw') view
    for backwards compatibility, just add comment that it applies more
    than 'text/plain'

2.) Add new configuration variable, e.g. $default_blob_plain_charset
    and use it in place of $default_text_plain_charset as above,
    initializing it to $default_text_plain_charset.

    This practically renames $default_text_plain_charset preserving
    backward compatibility.  Documentation would talk now about new
    variable name.

3.) Add new configuration variable, e.g. $default_text_other_charset,
    or $default_blob_plain_charset, or $default_inline_charset, etc.
    which will be used for files other than 'text/plain'.  This would
    make it possible to turn this new feature on and off.

3.a.) New feature default to on, i.e. to $default_text_plain_charset
3.b.) New feature default to off, i.e. undef


Second, there is an issue of file types, like HTML or XML (or XHTML,
or sometimes for LaTeX), which have the information about encoding
embedded in file.

We can:

A.) Skip this issue, and always add charset for 'text/*' and
    '*/*-xml' files

B.) Threat those few media types in special way, excluding them from
    adding charset

C.) Make which types to be added charset configurable.


Anyway I think this feature is much less urgent.  It fixes an annoyance
rather than bug, as you can always choose which charset to use to
display content in a browser.  We can take time to implement it well,
especially that it interacts with config (and backward compatibility
of config variables).

>> BTW I have noticed that we do $prevent_xss dance outside
>> blob_contenttype(), in it's only caller i.e. git_blob_plain()...
>> which for example means that 'text/html' converted to 'text/plain'
>> don't get '; charset=...' added.  I guess that it *might* be
>> what prompted this part of change... but if it is so, it needs
>> to be fixed at source, e.g. by moving $prevent_xss to
>> blob_contenttype() subroutine.
> 
> Jep.

But I see this is yet another separate issue.

>> Second it changes 'blobdiff_plain', 'commitdiff_plain' (which I think
>> that should be abandoned in favor of 'patch' view; but that is
>> a separate issue) and 'patch' views so they use binary-safe output.
>>
>> Note that in all cases (I think) we use
>>
>>    $cgi->header(
>>      -type => 'text/plain',
>>      -charset => 'utf-8',
>>      ...
>>    );
>>
>> promising web browser that output is as whole in 'utf-8' encoding.
> 
> Yes.
> 
>> It is not explained in the commit message what is the reason for this
>> change.  Is it better handing of a situation where files being diff-ed
>> being in other encoding (like for example in commit that changes
>> encoding of files from legacy encoding such like e.g. iso-8859-2
>> or cp1250 to utf-8)?
> 
> I do see encoding problems when comparing utf8 to utf8 files (i.e. no
> encoding change).

Ah, O.K.

>  For instance:
> https://git.blackdown.de/old.cgi?p=contactalbum.git;a=commitdiff_plain;h=cc4eaa64c2b399dd9bdbf1f67f6d621aa24df5f8

Nice to have online example, BTW.

> I don't claim to be an expert in Perl's utf8 handling but I guess this
> because Perl's internal utf8 form differs from the normal utf8 out from
> git commands.  Switching to :raw fixes that: We write plain utf8 (and,
> as noted above, charset is set to utf8 already)

Hmmm... I wonder if it is yet another case of concatenating content
marked as utf8 and not marked as utf8...

Anyway this fact that it is workaround, or that it fixes symptoms
of said issue, should be in the commit message (perhaps with TODO
explained).  This way we would be able to see it and perhaps fix
underlying issue, maybe with 'open' pragma, i.e.

  use open IN => ':encoding(utf-8)';

(though then we loose $fallback_encoding), or

  use open IN => ':encoding(utf-8-with-fallback)'

(which needs writing a Perl module for gitweb use).

> With the patch: https://git.blackdown.de/?p=contactalbum.git;a=commitdiff_plain;h=cc4eaa64c2b399dd9bdbf1f67f6d621aa24df5f8
> 
>> But what about -charset => 'utf-8'  then?
> 
> That's a good question.  I think I never tried git with anything besides
> ISO-8859-1 (rarely), UTF8 (mostly), and UTF16 (some Xcode files).
> (UTF16 definitely causes problems for gitweb.)
> 
>> About implementation: I think after this change common code crosses
>> threshold for refactoring it into subroutine, for example:
>>
>>   sub dump_fh_raw {
>>   	my $fh = shift;
>>
>>   	local $/ = undef;
>>   	binmode STDOUT, ':raw';
>>   	print <$fh>;
>>   	binmode STDOUT, ':utf8'; # as set at the beginning of gitweb.cgi
>>
>>   	return $fh;
>>   }
> 
> OK.
> 
> I'll submit an updated patch for the second part ('blobdiff_plain',
> 'commitdiff_plain', and 'plain') tomorrow.

Thanks in advance.
-- 
Jakub Narębski
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html




[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]