[Cc: authors of git-cvsserver] On Tue, 29 May 2007, Petr Baudis wrote: > On Mon, May 28, 2007 at 10:47:34PM CEST, Martin Koegler wrote: >> gitweb assumes, that everything is in UTF-8. If a text contains invalid >> UTF-8 character sequences, the text must be in a different encoding. But it doesn't tell us _what_ is the encoding. For commit messages, with reasonable new git, we have 'encoding' header if git known that commit message was not in utf-8. By the way, I winder why we don't have such header for tag objects (i18n.tagEncoding ;-)... >> This patch interprets such a text as latin1. Meaning that it tries to recode text from latin1 (iso-8859-1) to utf-8 (not changing gitweb output encoding, which is utf-8). It would be much better, and much easier at least for commit message to add --encoding=utf-8 to git-rev-list / git-log invocation. >> Signed-off-by: Martin Koegler <mkoegler@xxxxxxxxxxxxxxxxx> >> --- >> For correct UTF-8, the patch does not change anything. >> >> If commit/blob/... is not in UTF-8, it displays the text >> with a very high probability correct. It is commit (with its 'encoding' header, and `--encoding' option we can use instead of doing it in gitweb, provided that git was compiled with iconv support), tag (similar to commit, but IIRC without 'encoding' header, and `--encoding' option), blob (with no place to store encoding) and pathname in tree (which can be different from blob encoding). And I doubt very much about this "very high probability to be correct". >> As git itself is not aware of any encoding, I know no better >> possibility to handle non UTF-8 text in gitweb. > > I don't think this is a reasonable approach; I actually dispute the high > probability - in western Europe it's obvious to assume latin1, but does > majority of users using non-ascii characters come from there? Or rather > from central Europe (like me, Petr Baudiš? ;-))? Somewhere else? I also don't think that hardcoding latin1 (iso-8859-1) as default alternate encoding is a good idea. I don't think using iso-8859-1 (outside us-ascii) is _nowadays_ that common. On the other hand I think that not all users of koi8r, eucjp or iso-2022-jp converted (and can convert) to utf-8; latin1 users can. And using latin1 (other encoding) _only_ when there is an invalid utf-8 sequence is not a good idea either; I think that that there are some latin1 sequences outside us-ascii which are valid utf-8 sequences. That kind of magic is wrong, wrong, wrong... > If we do something like this, we should do it properly and look at > configured i18n.commitEncoding for the project. (But as config lookup > may be expensive, probably do it only when we need it.) I think it would be best to make it into %feature, overridable or not (which would look at i18n.commitEncoding instead of at gitweb.commitEncoding, but still a feature). About config lookup: we can either "borrow" config reading code in Perl from git-cvsserver, perhaps via putting it into Git.pm. Or we can implement at last core git support for dumping whole config in unambiguous machine parseable output: "git config --dump", e.g. key <LF> value <NUL> or key <NUL> (the second for "boolean" variables without set value). Having alternate (read-only) config parser has its advantages and disadvantages. Advantage is that we avoid fork+exec (performance), and having two implementations is always good for having format standarized. Disadvantage is that is yet another code to maintain, and that config parsing (even read-only config parsing) is a bit tricky with current git config file format. -- Jakub Narebski Poland - To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html