Re: [PATCH 3/3] sequencer: reencode to utf-8 before arrange rebase's todo list

Danh Doan <congdanhqx@xxxxxxxxx> · Fri, 1 Nov 2019 11:49:49 +0700

On 2019-10-31 15:26:50 -0400, Jeff King wrote:
> I'm confused about a few things here, though. I agree with you that the
> subjects here are only used for finding the fixup/squash relationships.
> But I don't understand the musl connection.

You're right.

Because of musl's iconv implementation, the problem is being shown up
earlier.

> Wouldn't failure to reencode here always be a problem? E.g., if I do:
> 
>   for encoding in utf-8 iso-8859-1; do
>     # commit using the encoding
>     echo $encoding >file && git add file
>     echo "éñcödèd with $encoding" | iconv -f utf-8 -t $encoding |
>       git -c i18n.commitEncoding=$encoding commit -F -
>     # and then fixup without it
>     echo "$encoding fixed" >file && git add file
>     git commit --fixup HEAD
>   done
>   
>   GIT_EDITOR='echo; grep -v ^#' git rebase -i --root --autosquash
> 
> then the resulting todo-list output (on my glibc system) is:
> 
>   pick 3a5bace éñcödèd with utf-8
>   fixup aa9f09c fixup! éñcödèd with utf-8
>   pick 6e85d32 éñcödèd with iso-8859-1
>   pick 3ceac05 fixup! éñcödèd with iso-8859-1
> 
> I.e., we don't actually match up the second pair, and I think we
> probably ought to.

Yes, we ought to match up the second pair, and after changing
get_commit_buffer to logmsg_reencode, we do.

> 
> I guess the test in t3900 is less exotic; it uses the same encoding for
> both commits. And it's just that "foo" and "!fixup foo" can (and do in
> musl) end up with different encodings (because of the specific language,
> and the vagaries of each iconv implementation).
> 
> Would we have similar problems in all of the other functions which use
> get_commit_buffer() without reencoding? For instance if I do this:
> 
>   echo base >file && git add file && git commit -m base
>   for encoding in utf-8 iso-8859-1; do
>     echo $encoding >file && git add file
>     echo "éñcödèd with $encoding" | iconv -f utf-8 -t $encoding |
>       git -c i18n.commitEncoding=$encoding commit -F -
>   done
>   git checkout -b side HEAD~2
>   git cherry-pick master master^
>   cat .git/sequencer/todo
> 
> then the resulting todo file has a mix of iso-8859-1 and utf-8.
> 
> It seems to me that we should always be working with the subjects in a
> single encoding internally,

I'm in favour of this idea.

> and likewise outputting in that format
> (which should probably be git_log_output_encoding(), for the instances
> where we show it to the user).

This is git's current behaviour but it's get_log_output_encoding()
instead of git_log_output_encoding().

> I.e., we should always call logmsg_reencode() instead of
> get_commit_buffer().

-- 
Danh