On 08.04.2013, Junio C Hamano wrote: > jk@xxxxxxxxxxxx (Jürgen Kreileder) writes: > >> Fixes the encoding for several _plain actions and for text/* and */*+xml blobs. >> >> Signed-off-by: Jürgen Kreileder <jk@xxxxxxxxxxxx> I see that this patch does (or tries to do) two _independent_ things, and should be split into at least two separate commits: >> --- > > Thanks, will queue but not hold until I hear something from Jakub. > >> gitweb/gitweb.perl | 8 +++++++- >> 1 file changed, 7 insertions(+), 1 deletion(-) >> >> diff --git a/gitweb/gitweb.perl b/gitweb/gitweb.perl >> index 1309196..9cfe5b5 100755 >> --- a/gitweb/gitweb.perl >> +++ b/gitweb/gitweb.perl >> @@ -3823,7 +3823,7 @@ sub blob_contenttype { >> my ($fd, $file_name, $type) = @_; >> >> $type ||= blob_mimetype($fd, $file_name); >> - if ($type eq 'text/plain' && defined $default_text_plain_charset) { >> + if (($type =~ m!^text/\w[-\w]*$! || $type =~ m!^\w[-\w]*/\w[-\w]*\+xml$!) && defined $default_text_plain_charset) { >> $type .= "; charset=$default_text_plain_charset"; >> } First, it extends adding "; charset=$default_text_plain_charset" to other mimetypes for 'blob_plain' view to all 'text/*' and '*/*+xml' mimetypes (without changing name of variable... though this is more complicated as it is configuration variable and we would want to preserve backward compatibility, but at least a comment would be, I think, needed). Originally it applied only to 'text/plain' files, which can be displayed inline by web browser, and which need charset in 'Content-Type:' HTTP header to be displayed correctly, as they do not include such information inside the file. 'text/html' and 'application/xhtml+xml' can include such information inside them (meta http-equiv for 'text/html' and <?xml ...> for 'application/xhtml+xml'). I don't know what browser does when there is conflicting information about charset, i.e. which one wins, but it is certainly something to consider. It might be a good change; I don't know what web browser do when serving 'text/css', 'text/javascript', 'text/xml' to client to view without media type known. BTW I have noticed that we do $prevent_xss dance outside blob_contenttype(), in it's only caller i.e. git_blob_plain()... which for example means that 'text/html' converted to 'text/plain' don't get '; charset=...' added. I guess that it *might* be what prompted this part of change... but if it is so, it needs to be fixed at source, e.g. by moving $prevent_xss to blob_contenttype() subroutine. About implementation: >> + if (($type =~ m!^text/\w[-\w]*$! || $type =~ m!^\w[-\w]*/\w[-\w]*\+xml$!) && defined $default_text_plain_charset) { >> $type .= "; charset=$default_text_plain_charset"; would be better with line split >> + if (($type =~ m!^text/\w[-\w]*$! || $type =~ m!^\w[-\w]*/\w[-\w]*\+xml$!) && >> + defined $default_text_plain_charset) { >> $type .= "; charset=$default_text_plain_charset"; Also: do we need to be strict with '\w[-\w]*', or would '.*' suffice? >> @@ -7637,7 +7637,9 @@ sub git_blobdiff { >> last if $line =~ m!^\+\+\+!; >> } >> local $/ = undef; >> + binmode STDOUT, ':raw'; >> print <$fd>; >> + binmode STDOUT, ':utf8'; # as set at the beginning of gitweb.cgi >> close $fd; >> } >> } >> @@ -7884,12 +7886,16 @@ sub git_commitdiff { >> >> } elsif ($format eq 'plain') { >> local $/ = undef; >> + binmode STDOUT, ':raw'; >> print <$fd>; >> + binmode STDOUT, ':utf8'; # as set at the beginning of gitweb.cgi >> close $fd >> or print "Reading git-diff-tree failed\n"; >> } elsif ($format eq 'patch') { >> local $/ = undef; >> + binmode STDOUT, ':raw'; >> print <$fd>; >> + binmode STDOUT, ':utf8'; # as set at the beginning of gitweb.cgi >> close $fd >> or print "Reading git-format-patch failed\n"; >> } Second it changes 'blobdiff_plain', 'commitdiff_plain' (which I think that should be abandoned in favor of 'patch' view; but that is a separate issue) and 'patch' views so they use binary-safe output. Note that in all cases (I think) we use $cgi->header( -type => 'text/plain', -charset => 'utf-8', ... ); promising web browser that output is as whole in 'utf-8' encoding. It is not explained in the commit message what is the reason for this change. Is it better handing of a situation where files being diff-ed being in other encoding (like for example in commit that changes encoding of files from legacy encoding such like e.g. iso-8859-2 or cp1250 to utf-8)? But what about -charset => 'utf-8' then? About implementation: I think after this change common code crosses threshold for refactoring it into subroutine, for example: sub dump_fh_raw { my $fh = shift; local $/ = undef; binmode STDOUT, ':raw'; print <$fh>; binmode STDOUT, ':utf8'; # as set at the beginning of gitweb.cgi return $fh; } -- Jakub Narębski -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html