Re: [PATCH] gitweb: Measure offsets against UTF-8 flagged string

Shin Kojima <shin@xxxxxxxxxx> · Wed, 2 May 2018 20:47:45 +0900

> ideally we should be able to say "function X takes non-UTF8 and
> works on it", "function Y takes UTF8 and works on it", and "function
> Z takes non-UTF8 and gives UTF8 data back" for each functions
> clearly, not "function W can take either UTF8 or any other garbage
> and tries to return UTF8".

Yes, I totally agree with that.

> Some codepaths in the resulting codeflow look even harder than they
> already are.  For example, format_rem_add_lines_pair() calls
> untabify() and then feeds its result to esc_html().

Honestly, I discovered this problem exactly from
format_rem_add_lines_pair().  In my environment($fallback_encoding
= 'cp932'), some commitdiff shows garbled text only inside color-words
portions.

I added a reproduce process at the end of this message.

After my investigation, I thought format_rem_add_lines_pair() tries to
use split()/index()/substr()/etc against raw blob contents and that
produces funny characters.  These builtin functions should be used to
decoded string.

untabify() looks proper place for me to decode blob contents
beforehand, as it definitely is not to be used for binary contests
like images and compressed snapshot files.

I'm sure using to_utf8() in untabify() fixes this problem.  In fact,
there is also another similar problem in blame function that assumes
blob contents as if utf8 encoded:

    binmode $fd, ':utf8';

I personally consider text blob contents should be decoded as soon as
possible and esc_html() should not contain to_utf8(), but the
codeflow is slightly vast and I couldn't eliminate the possibility of
calls from somewhere else that does not care character encodings.

So yes, I would appreciate hearing your thoughts.

> Also, does it even "fix" the problem to use to_utf8() here in the
> first place?  Untabify is about aligning the character after a HT to
> multiple of 8 display position, so we'd want measure display width,
> which is not the same as either byte count or char count.

Following is a reproduce process:

    $ git --version
        git version 2.17.0

    $ mkdir test
    $ cd test
    $ git init
    $ echo 'モバイル' | iconv -f UTF-8 -t Shift_JIS > dummy
    $ git add .
    $ git commit -m 'init'
    $ echo 'インスタント' | iconv -f UTF-8 -t Shift_JIS > dummy
    $ git commit -am 'change'
    $ git instaweb
    $ echo 'our $fallback_encoding = "cp932";' >> .git/gitweb/gitweb_config.perl
    $ w3m -dump 'http://127.0.0.1:1234/?p=.git;a=commitdiff'

What I got:

    gitprojects / .git / commitdiff
    [commit   ] ? search: [                    ] [ ]re
    summary | shortlog | log | commit | commitdiff | tree
    raw | patch | inline | side by side (parent: 79e26fe)
    change master

    author    Shin Kojima <shin@xxxxxxxxxx>
              Wed, 2 May 2018 10:55:01 +0000 (19:55 +0900)
    committer Shin Kojima <shin@xxxxxxxxxx>
              Wed, 2 May 2018 10:55:01 +0000 (19:55 +0900)

    dummy  patch | blob | history

    diff --git a/dummy b/dummy
    index ac37f38..31fb96a 100644 (file)
    --- a/dummy
    +++ b/dummy
    @@ -1 +1 @@
    -ｃoイル
    +Cンスタント
    Unnamed repository; edit this file 'description' to name the repository.
    RSS Atom

What I expected:

    gitprojects / .git / commitdiff
    [commit   ] ? search: [                    ] [ ]re
    summary | shortlog | log | commit | commitdiff | tree
    raw | patch | inline | side by side (parent: 79e26fe)
    change master

    author    Shin Kojima <shin@xxxxxxxxxx>
              Wed, 2 May 2018 10:55:01 +0000 (19:55 +0900)
    committer Shin Kojima <shin@xxxxxxxxxx>
              Wed, 2 May 2018 10:55:01 +0000 (19:55 +0900)

    dummy  patch | blob | history

    diff --git a/dummy b/dummy
    index ac37f38..31fb96a 100644 (file)
    --- a/dummy
    +++ b/dummy
    @@ -1 +1 @@
    -モバイル
    +インスタント
    Unnamed repository; edit this file 'description' to name the repository.
    RSS Atom

-- 
Shin Kojima