Re: [PATCH v2 3/3] sequencer: reencode to utf-8 before arrange rebase's todo list

Danh Doan <congdanhqx@xxxxxxxxx> · Sat, 2 Nov 2019 08:02:15 +0700

On 2019-11-01 12:59:21 -0400, Jeff King wrote:
> On Fri, Nov 01, 2019 at 03:25:11PM +0700, Doan Tran Cong Danh wrote:
> 
> > for encoding in utf-8 iso-8859-1; do
> > 	# commit using the encoding
> > 	echo $encoding >file && git add file
> > 	echo "éñcödèd with $encoding" | iconv -f utf-8 -t $encoding |
> > 	  git -c i18n.commitEncoding=$encoding commit -F -
> > 	# and then fixup without it
> > 	echo "$encoding fixed" >file && git add file
> > 	git commit --fixup HEAD
> > done
> > git rebase -i --autosquash --root
> 
> Is it worth adding this as a test in t3900?

I think yes, but with a little more work.
I'll make it as a separated patch in a re-roll.

> >  		parse_commit(item->commit);
> > -		commit_buffer = get_commit_buffer(item->commit, NULL);
> > +		commit_buffer = logmsg_reencode(item->commit, NULL, "UTF-8");
> 
> I think there are several other spots in this file that could use the
> same treatment. But I can live with it if you want to just fix the one
> that's bugging you and move on. It's still a strict improvement.

There're 6 more occurence of get_commit_buffer in sequencer.c, and 13
occurences in other C source files. I'll try to figure out if it's
safe to change.

Anyway, if we're going to working with a single encoding internally,
can we take other extreme approach: reencode the commit message to
utf-8 before writing the commit object? (Is there any codepoint in
other encoding that can't be reencoded to utf-8?)
Since git-log and friends are doing 2 steps conversion for commit
message for now (reencode to utf-8 first, then reencode again to
get_log_output_encoding()). With this new approach, first step is
likely a noop (but must be kept for backward compatible).

-- 
Danh