Hello! Currently gitweb converts data it receives from git commands to Perl internal utf8 representation via to_utf8() subroutine # decode sequences of octets in utf8 into Perl's internal form, # which is utf-8 with utf8 flag set if needed. gitweb writes out # in utf-8 thanks to "binmode STDOUT, ':utf8'" at beginning sub to_utf8 { my $str = shift; return undef unless defined $str; if (utf8::valid($str)) { utf8::decode($str); return $str; } else { return decode($fallback_encoding, $str, Encode::FB_DEFAULT); } } Each part of data must be handled separately. It is quite error prone process, as can be seen from quite a number of patches that fix handling of UTF-8 data (latest from Jürgen). Much, much simpler would be to force opening of all files (including output pipes from git commands) in ':utf8' mode: use open qw(:std :utf8); [Note: perhaps instead of ':utf8' it should be ':encoding(UTF-8)' there...] But doing this would change gitweb behavior. Currently when encountering something (usually line of output) that is not valid UTF-8, we decode it (to UTF-8) using $fallback_encoding, by default 'latin1'. Note however that this value is per gitweb installation, not per repository. Using "use open qw(:std :utf8);" would be like changing the value of $fallback_encoding to 'utf8' -- errors would be ignored, and characters which are invalid in UTF-8 encoding would get replaced[1] with substitution character '�' U+FFFD. Though at least for HTML output we could use Encode::FB_HTMLCREF handling (which would produce &#NNN;) or Encode::FB_XMLCREF (which would produce &#xHHHH;), though this must be done after HTML escaping... and is probaby not worth it (FYI this can be done by setting $PerlIO::encoding::fallback to either of those values[2]) [1] http://perldoc.perl.org/Encode.html#Handling-Malformed-Data http://p3rl.org/Encode [2] http://perldoc.perl.org/PerlIO/encoding.html http://p3rl.org/PerlIO::encoding I don't know if people are relying on the old behavior. I guess it could be emulated by defining our own 'utf-8-with-fallback' encoding, or by defining our own PerlIO layer with PerlIO::via. But it no longer be simple solution (though still automatic). Alternate approach would be to audit gitweb code, and call to_utf8 before storing extracted output of git command in variable (excluding save types like SHA-1, filemode, timestamp and timezone). The fact that to_utf8 is idempotent and can be called multiple times would help here, I think. The correct solution would be of course to respect `gui.encoding` per-repository config variable, and `encoding` gitattribute... though the latter is hampered by the fact that there is currently no way to read attribute with "git check-attr" from a given tree: think of a diff of change of encoding of a file! -- Jakub Narebski Poland -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html