Re: [PATCH v2 3/3] sequencer: reencode to utf-8 before arrange rebase's todo list

Junio C Hamano <gitster@xxxxxxxxx> · Wed, 06 Nov 2019 10:30:34 +0900

Jeff King <peff@xxxxxxxx> writes:

> That's normally what we do. The only cases we're covering here are when
> somebody has explicitly asked that the commit object be stored in
> another encoding. Presumably they'd also be using a matching
> i18n.logOutputEncoding in that case, in which case logmsg_reencode()
> would be a noop. I think the only reasons to do that are:
>
>   1. You're stuck on some legacy encoding for your terminal. But in that
>      case, I think you'd still be better off storing utf-8 and
>      translating on the fly, since whatever encoding you do store is
>      baked into your objects for all time (so accept some slowness now,
>      but eventually move to utf-8).
>
>   2. Your preferred language is bigger in utf-8 than in some specific
>      encoding, and you'd rather save some bytes. I'm not sure how big a
>      deal this is, given that commit messages don't tend to be that big
>      in the first place (compared to trees and blobs). And the zlib
>      deflation on the result might help remove some of the redundancy,
>      too.

Perhaps add

    3. You are dealing with a project originated on and migrated
       from a foreign SCM, and older parts of the history is stored
       in a non-utf-8, even though recent history is in utf-8

to the mix?

> The two-part user-format thing goes back to 7e77df39bf (pretty: two
> phase conversion for non utf-8 commits, 2013-04-19). It does seem like
> it would be cheaper to convert the format string into the output
> encoding (it would need to be an ascii superset, but that's already the
> case, since we expect to parse "author", etc out of the re-encoded
> commit object). But again, I have trouble caring too much about the
> performance of this case, as I consider it to be mostly legacy at this
> point. But I also don't write in (say) Japanese, so maybe I'm being too
> narrow-minded about whether people really want to avoid utf-8.

I suspect even the heavy Windows/Mac users in Japan have migrated
out of legacy (the suspicion comes from an anecdote that is offtopic
here).