[RFD] Handling of non-UTF8 data in gitweb

Jakub Narebski <jnareb@xxxxxxxxx> · Sun, 4 Dec 2011 17:09:30 +0100

Hello!

Currently gitweb converts data it receives from git commands to Perl 
internal utf8 representation via to_utf8() subroutine

 # decode sequences of octets in utf8 into Perl's internal form,
 # which is utf-8 with utf8 flag set if needed.  gitweb writes out
 # in utf-8 thanks to "binmode STDOUT, ':utf8'" at beginning
 sub to_utf8 {
 	my $str = shift;
 	return undef unless defined $str;
 	if (utf8::valid($str)) {
 		utf8::decode($str);
 		return $str;
 	} else {
 		return decode($fallback_encoding, $str, Encode::FB_DEFAULT);
 	}
 }

Each part of data must be handled separately.  It is quite error prone
process, as can be seen from quite a number of patches that fix handling
of UTF-8 data (latest from Jürgen).

Much, much simpler would be to force opening of all files (including 
output pipes from git commands) in ':utf8' mode:

  use open qw(:std :utf8);

[Note: perhaps instead of ':utf8' it should be ':encoding(UTF-8)' 
 there...]

But doing this would change gitweb behavior.  Currently when 
encountering something (usually line of output) that is not valid 
UTF-8, we decode it (to UTF-8) using $fallback_encoding, by default
'latin1'.  Note however that this value is per gitweb installation,
not per repository.

Using "use open qw(:std :utf8);" would be like changing the value of 
$fallback_encoding to 'utf8' -- errors would be ignored, and characters 
which are invalid in UTF-8 encoding would get replaced[1] with 
substitution character '�' U+FFFD.

Though at least for HTML output we could use Encode::FB_HTMLCREF 
handling (which would produce &#NNN;) or Encode::FB_XMLCREF (which
would produce &#xHHHH;), though this must be done after HTML escaping...
and is probaby not worth it (FYI this can be done by setting 
$PerlIO::encoding::fallback to either of those values[2])

[1] http://perldoc.perl.org/Encode.html#Handling-Malformed-Data
    http://p3rl.org/Encode
[2] http://perldoc.perl.org/PerlIO/encoding.html
    http://p3rl.org/PerlIO::encoding

I don't know if people are relying on the old behavior.  I guess
it could be emulated by defining our own 'utf-8-with-fallback'
encoding, or by defining our own PerlIO layer with PerlIO::via.
But it no longer be simple solution (though still automatic).

Alternate approach would be to audit gitweb code, and call to_utf8
before storing extracted output of git command in variable (excluding
save types like SHA-1, filemode, timestamp and timezone).  The fact
that to_utf8 is idempotent and can be called multiple times would
help here, I think.

The correct solution would be of course to respect `gui.encoding`
per-repository config variable, and `encoding` gitattribute...
though the latter is hampered by the fact that there is currently
no way to read attribute with "git check-attr" from a given tree:
think of a diff of change of encoding of a file!

-- 
Jakub Narebski
Poland
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html