Jakub Narębski <jnareb@xxxxxxxxx> writes: > On 08.04.2013, Junio C Hamano wrote: >> jk@xxxxxxxxxxxx (Jürgen Kreileder) writes: >> >>> Fixes the encoding for several _plain actions and for text/* and */*+xml blobs. >>> >>> Signed-off-by: Jürgen Kreileder <jk@xxxxxxxxxxxx> > > I see that this patch does (or tries to do) two _independent_ things, > and should be split into at least two separate commits: Agreed. >>> --- >> >> Thanks, will queue but not hold until I hear something from Jakub. >> >>> gitweb/gitweb.perl | 8 +++++++- >>> 1 file changed, 7 insertions(+), 1 deletion(-) >>> >>> diff --git a/gitweb/gitweb.perl b/gitweb/gitweb.perl >>> index 1309196..9cfe5b5 100755 >>> --- a/gitweb/gitweb.perl >>> +++ b/gitweb/gitweb.perl >>> @@ -3823,7 +3823,7 @@ sub blob_contenttype { >>> my ($fd, $file_name, $type) = @_; >>> >>> $type ||= blob_mimetype($fd, $file_name); >>> - if ($type eq 'text/plain' && defined $default_text_plain_charset) { >>> + if (($type =~ m!^text/\w[-\w]*$! || $type =~ m!^\w[-\w]*/\w[-\w]*\+xml$!) && defined $default_text_plain_charset) { >>> $type .= "; charset=$default_text_plain_charset"; >>> } > > First, it extends adding "; charset=$default_text_plain_charset" to > other mimetypes for 'blob_plain' view to all 'text/*' and '*/*+xml' > mimetypes (without changing name of variable... though this is more > complicated as it is configuration variable and we would want to > preserve backward compatibility, but at least a comment would be, > I think, needed). > > Originally it applied only to 'text/plain' files, which can be > displayed inline by web browser, and which need charset in > 'Content-Type:' HTTP header to be displayed correctly, as they > do not include such information inside the file. What prompted the change is that some browsers (Chrome and Safari) also display other file types inline: text/x-chdr, text/x-java, text/x-objc, text/x-sh, ... At least on my system these files do get served with charset set! (ISO-8859-1 even though Apache has "AddDefaultCharset UTF-8"...) $ curl -I "https://git.blackdown.de/old.cgi?p=contactalbum.git;a=blob_plain;f=Classes/ContactAlbum.h;hb=HEAD" HTTP/1.1 200 OK Date: Tue, 09 Apr 2013 21:47:30 GMT Server: Apache Content-disposition: inline; filename="Classes/ContactAlbum.h" X-Frame-Options: SAMEORIGIN Vary: User-Agent X-UA-Compatible: IE=edge Strict-Transport-Security: max-age=31536000 Content-Type: text/x-chdr; charset=ISO-8859-1 > 'text/html' and 'application/xhtml+xml' can include such information > inside them (meta http-equiv for 'text/html' and <?xml ...> for > 'application/xhtml+xml'). I don't know what browser does when there > is conflicting information about charset, i.e. which one wins, but > it is certainly something to consider. As shown above, even without my patch, this already can happen! > It might be a good change; I don't know what web browser do when > serving 'text/css', 'text/javascript', 'text/xml' to client to > view without media type known. > > > BTW I have noticed that we do $prevent_xss dance outside > blob_contenttype(), in it's only caller i.e. git_blob_plain()... > which for example means that 'text/html' converted to 'text/plain' > don't get '; charset=...' added. I guess that it *might* be > what prompted this part of change... but if it is so, it needs > to be fixed at source, e.g. by moving $prevent_xss to > blob_contenttype() subroutine. Jep. [...] > Second it changes 'blobdiff_plain', 'commitdiff_plain' (which I think > that should be abandoned in favor of 'patch' view; but that is > a separate issue) and 'patch' views so they use binary-safe output. > > Note that in all cases (I think) we use > > $cgi->header( > -type => 'text/plain', > -charset => 'utf-8', > ... > ); > > promising web browser that output is as whole in 'utf-8' encoding. Yes. > It is not explained in the commit message what is the reason for this > change. Is it better handing of a situation where files being diff-ed > being in other encoding (like for example in commit that changes > encoding of files from legacy encoding such like e.g. iso-8859-2 > or cp1250 to utf-8)? I do see encoding problems when comparing utf8 to utf8 files (i.e. no encoding change). For instance: https://git.blackdown.de/old.cgi?p=contactalbum.git;a=commitdiff_plain;h=cc4eaa64c2b399dd9bdbf1f67f6d621aa24df5f8 I don't claim to be an expert in Perl's utf8 handling but I guess this because Perl's internal utf8 form differs from the normal utf8 out from git commands. Switching to :raw fixes that: We write plain utf8 (and, as noted above, charset is set to utf8 already) With the patch: https://git.blackdown.de/?p=contactalbum.git;a=commitdiff_plain;h=cc4eaa64c2b399dd9bdbf1f67f6d621aa24df5f8 > But what about -charset => 'utf-8' then? That's a good question. I think I never tried git with anything besides ISO-8859-1 (rarely), UTF8 (mostly), and UTF16 (some Xcode files). (UTF16 definitely causes problems for gitweb.) > About implementation: I think after this change common code crosses > threshold for refactoring it into subroutine, for example: > > sub dump_fh_raw { > my $fh = shift; > > local $/ = undef; > binmode STDOUT, ':raw'; > print <$fh>; > binmode STDOUT, ':utf8'; # as set at the beginning of gitweb.cgi > > return $fh; > } OK. I'll submit an updated patch for the second part ('blobdiff_plain', 'commitdiff_plain', and 'plain') tomorrow. Not sure how to proceed with the 'blob_plain' issue. Juergen == https://blackdown.de/ -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html