Re: [PATCH v2] pretty-options.txt: describe supported encoding

Jeff King <peff@xxxxxxxx> · Fri, 27 Aug 2021 14:03:39 -0400

On Fri, Aug 27, 2021 at 10:03:56AM -0700, Junio C Hamano wrote:

> > +       The encoding must be a system encoding supported by iconv(1),
> > +       otherwise this option will be ignored.
> > +       POSIX character maps used by iconv(1p) are not supported.
> 
> This paragraph is a bit hard to grok.
> 
> I think it is saying that the "-f frommap -t tomap" form in [*1*]
> that can use arbitrary character set description file is not
> supported, but "-f fromcode -t tocode" form, which also is what
> iconv_open() takes [*2*], is supported.  Am I reading it correctly?
> 
> Is there an easier-to-read way to explain the distinction to our
> average reader?
> 
> What I am getting at is this.  Imagine average users who need to see
> their commits recoded to iso-8859-2.  They see "git log" has
> "--encoding=<encoding>" option, read the above paragraph and wonder
> if they are on the supported side or unsupported side of the above
> paragraph.  I want to make it easy for them to stop wondering.
> 
> For that purpose, "iconv(1) vs iconv(1p)" would not help them very
> much, especially considering that not all Git users are UNIX users
> (they probably do not even know what (1) and (1p) means).

I likewise found the mention of character maps confusing. If we were to
refer to anything, it would be iconv(3) or iconv_open(3). But really,
all of the discussion that led to this patch seemed to be about the
distinction between "character set conversion" (or "character encoding",
or "codeset conversion", all terms used by the POSIX pages) and the
syntactic encoding of HTML.

Is there any version of iconv that would convert "<" to "&lt;"?

I guess that _conceptually_ one could think of that as a multi-byte
character conversion, but it seems to me that it is generally considered
a layer above (after all, the original "<" and characters in the HTML
entity have to be in some character encoding; generally ASCII, but I
think you could have UTF-16 HTML, too).

What I'm getting it is that maybe we just need to use a less generic
word than "encoding". Perhaps just s/encoding/character &/ or something?
And maybe add something like:

  Conversions are done using the system iconv(3) function. The set of
  available encodings will depend on your system.

You _can_ use "iconv -l" to get such a list on many systems, but it is
not even necessarily the same list.

I also wonder if other mentions of encoding would want to use the same
term (e.g., gitattributes working-tree-encoding), and of course
i18n.commitEncoding (though peeking at the latter, it seems to already
say "Character encoding", so maybe that is sufficient).

-Peff