Re: [PATCH] gitweb: handle non UTF-8 text

Jakub Narebski <jnareb@xxxxxxxxx> · Fri, 1 Jun 2007 23:05:40 +0200

On Tue, 29 May 2007, Martin Koegler wrote:
> On Tue, May 29, 2007 at 11:21:11AM +0200, Jakub Narebski wrote:
>> On Tue, 29 May 2007, Petr Baudis wrote:
>>> On Mon, May 28, 2007 at 10:47:34PM CEST, Martin Koegler wrote:
>> 
>>>> gitweb assumes, that everything is in UTF-8. If a text contains invalid
>>>> UTF-8 character sequences, the text must be in a different encoding.
>> 
>> But it doesn't tell us _what_ is the encoding. For commit messages,
>> with reasonable new git, we have 'encoding' header if git known that
>> commit message was not in utf-8.
>> 
>> By the way, I winder why we don't have such header for tag objects
>> (i18n.tagEncoding ;-)...
> 
> Why do I need to set i18n.commitEncoding on a normal Linux systems?  We
> have a locale, which contains this information. With this, its more
> likely, that the commits can be read correctly later, if somebody
> forget to set "i18n.commitEncoding" in a repository.

Because repository is (or at least can be) _shared_. People working on
the same repository can have set different locale. Web server running
gitweb can have different locale.

>>>> This patch interprets such a text as latin1.
>> 
>> Meaning that it tries to recode text from latin1 (iso-8859-1) to utf-8
>> (not changing gitweb output encoding, which is utf-8).

And this (i.e. what does "interprets" mean) is what should be in the
commit message too.

>> It would be much better, and much easier at least for commit message
>> to add --encoding=utf-8 to git-rev-list / git-log invocation.
> 
> It does not help for old commits, where the encoding was not specified
> correctly. If my research is correct, the encoding handling was
> introduced at the end of 2006 and released this february.

True. But it _can_ help.

>>>> If commit/blob/... is not in UTF-8, it displays the text
>>>> with a very high probability correct. 
>>
>> And I doubt very much about this "very high probability to be
>> correct".
> 
> For normal text, this should be true:
> 
> We can divide ISO-8859-1 into some groups:
> a) 0x00-0x7f: shared with UTF-8
> b) 0x80-0xBF: continuation characters in UTF-8 (0x80-0x9F are control characters/unused)
> c) 0xC0-0xDF: start of a two byte UTF-8 character
> d) 0xE0-0xEF: start of a tree byte UTF-8 character
> e) 0xF0-0xFF: start of other longer UTF-8 sequences
> 
> To misinterpret a ISO-8859-1 text as UTF-8, each character of class
> c/d/e must be followed by the correct number of character of class b.
[cut]
> As gitweb is processing a line of text at once, one UTF-8 compatible
> combinations has no effect, if any other non UTF-8 combatible
> character sequence occurs.

Thanks for the explanation. In short: if characters not shared with UTF-8
(outside US-ASCII), "special characters" occur usually solo, there is
low probability that line in non-UTF-8 encoding will be valid UTF-8.
Which perhaps is valid for German and latin1 aka. iso-8859-1; not
necessarily so for example for Polish and iso-8859-2, see
  zażółć gęsią jaźń
which is perfectly good fragment containing all Polish special
characters, and as you can see those characters occur one after another.
Well, it still could be invalid UTF-8 sequence; what about koi8r and
eucjp (or other non-UTF-8 encoding for Asian languages)?

> But I agree, that there should be the possibilty to choose a the
> fallback encoding.

I think for the beginning it would be enough to have

  # assume this charset if line contains non-UTF-8 characters
  our $fallback_encoding = "latin1";

or something like that (perhaps different wording in the comment,
perhaps different name of the variable) in the gitweb.perl for your
idea to be accepted.

That, and using to_utf8 (as before e3ad95a8) and not my_decode_utf8
as subroutine name. If only it would be possible to avoid I think
quote costly "eval {....}" invocation...

[cut]

There are six sources of possibly non-UTF-8 input: commits, tags,
trees (file names), blobs, gitweb files and results of system calls.

Only first one, commits, comes with encoding specified... if commit
was made with new enough git, and if committer correctly specified
encoding. Commits are read using git-rev-list, which accept --encoding
parameter, so we can convert it easily to utf-8... if git was compiled
with iconv support. It is possible that due to repository, gitweb user
or global configuration (i18n.logOutputEncoding, i18n.commitEncoding)
this is done automatically. On the other hand I think it is easiest
to have accidental wrongly encoded sequence in commit message.

Second one, tags, really _should_ have encoding header like commits.
On the other hand usually the message is version + PGP signature, so
there is no place for any encoding. Tags are read using git-cat-file,
which does not do any encoding/decoding.

Third, filenames in tree objects, "suffers" from git design decision:
for performance and simplicity git stories filenames in tree 'as is',
and relies on the fact that filenames are the same in tree objects,
in the index (dircache), in the filesystem during saving, and as read
from filesystem. Moreover I think that names encoding on filesystem
might depend on filesystem in question and be different from locale
specified encoding (locale is user local, filesystem is global).
On the other hand side one ususually does not use special characters
in filenames because of the problems they cause.

Fourth, blobs (file contents). They can use different encoding than
commit messages; moreover different files can use different encoding.
Encoding has to be specified externally; there is no place for encoding
header in the blob object structure.

Fifth, gitweb files include files read and transformed such as 
GIT_DIR/description file, or projects index file $projects_list,
and files containing fragments of HTML like README.html or header/footer
files.

Sixth, we sometimes have to decode to utf8 results of system calls
like getpwuid to get owner of a file (of a project), or decode to utf8
path (fragment) to the repository.

There are two places to specify gitweb output charset. First is charset
used in HTML output, which is also default charset (binmode) of STDOUT
stream. Gitweb uses utf-8 here, and utf-8 is recommended for XML and for
XHTML by W3C, although we could theoretically add an option to use
different charset by default, and decode (or not) to this charset, instead
of recoding everything (see above) to utf-8.

Second place is default charset for text/plain blob_plain output:
  # default blob_plain mimetype and default charset for text/plain blob
  our $default_blob_plain_mimetype = 'text/plain';
  our $default_text_plain_charset  = undef;
and for other *_plain output written as text/plain; charset=utf-8, and
which is actually dumpled :raw to STDOUT.

So what should be the solution? Add global, per gitweb installation
configureation variables $input_encoding and $fallback_input_encoding?
What do you think? Do you have other ideas?

-- 
Jakub Narebski
Poland
-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html