> ideally we should be able to say "function X takes non-UTF8 and > works on it", "function Y takes UTF8 and works on it", and "function > Z takes non-UTF8 and gives UTF8 data back" for each functions > clearly, not "function W can take either UTF8 or any other garbage > and tries to return UTF8". Yes, I totally agree with that. > Some codepaths in the resulting codeflow look even harder than they > already are. For example, format_rem_add_lines_pair() calls > untabify() and then feeds its result to esc_html(). Honestly, I discovered this problem exactly from format_rem_add_lines_pair(). In my environment($fallback_encoding = 'cp932'), some commitdiff shows garbled text only inside color-words portions. I added a reproduce process at the end of this message. After my investigation, I thought format_rem_add_lines_pair() tries to use split()/index()/substr()/etc against raw blob contents and that produces funny characters. These builtin functions should be used to decoded string. untabify() looks proper place for me to decode blob contents beforehand, as it definitely is not to be used for binary contests like images and compressed snapshot files. I'm sure using to_utf8() in untabify() fixes this problem. In fact, there is also another similar problem in blame function that assumes blob contents as if utf8 encoded: binmode $fd, ':utf8'; I personally consider text blob contents should be decoded as soon as possible and esc_html() should not contain to_utf8(), but the codeflow is slightly vast and I couldn't eliminate the possibility of calls from somewhere else that does not care character encodings. So yes, I would appreciate hearing your thoughts. > Also, does it even "fix" the problem to use to_utf8() here in the > first place? Untabify is about aligning the character after a HT to > multiple of 8 display position, so we'd want measure display width, > which is not the same as either byte count or char count. Following is a reproduce process: $ git --version git version 2.17.0 $ mkdir test $ cd test $ git init $ echo 'モバイル' | iconv -f UTF-8 -t Shift_JIS > dummy $ git add . $ git commit -m 'init' $ echo 'インスタント' | iconv -f UTF-8 -t Shift_JIS > dummy $ git commit -am 'change' $ git instaweb $ echo 'our $fallback_encoding = "cp932";' >> .git/gitweb/gitweb_config.perl $ w3m -dump 'http://127.0.0.1:1234/?p=.git;a=commitdiff' What I got: gitprojects / .git / commitdiff [commit ] ? search: [ ] [ ]re summary | shortlog | log | commit | commitdiff | tree raw | patch | inline | side by side (parent: 79e26fe) change master author Shin Kojima <shin@xxxxxxxxxx> Wed, 2 May 2018 10:55:01 +0000 (19:55 +0900) committer Shin Kojima <shin@xxxxxxxxxx> Wed, 2 May 2018 10:55:01 +0000 (19:55 +0900) dummy patch | blob | history diff --git a/dummy b/dummy index ac37f38..31fb96a 100644 (file) --- a/dummy +++ b/dummy @@ -1 +1 @@ -coイル +Cンスタント Unnamed repository; edit this file 'description' to name the repository. RSS Atom What I expected: gitprojects / .git / commitdiff [commit ] ? search: [ ] [ ]re summary | shortlog | log | commit | commitdiff | tree raw | patch | inline | side by side (parent: 79e26fe) change master author Shin Kojima <shin@xxxxxxxxxx> Wed, 2 May 2018 10:55:01 +0000 (19:55 +0900) committer Shin Kojima <shin@xxxxxxxxxx> Wed, 2 May 2018 10:55:01 +0000 (19:55 +0900) dummy patch | blob | history diff --git a/dummy b/dummy index ac37f38..31fb96a 100644 (file) --- a/dummy +++ b/dummy @@ -1 +1 @@ -モバイル +インスタント Unnamed repository; edit this file 'description' to name the repository. RSS Atom -- Shin Kojima