Re: Effect of -finput-charset in gcc and phases of translation

Jonathan Wakely <jwakely.gcc@xxxxxxxxx> · Thu, 6 Jun 2019 00:32:10 +0100

On Wed, 5 Jun 2019 at 15:29, esoteric escape <manips88@xxxxxxxxx> wrote:
>
> Hello, I am trying to understand the phase of translation where the option
> in gcc -finput-charset comes in effect as given here:
> https://gcc.gnu.org/onlinedocs/gcc/Preprocessor-Options.html#Preprocessor-Options
>
> > -finput-charset=charset
> >
> Set the input character set, used for translation from the character set of
> > the input file to the source character set used by GCC. If the locale does
> > not specify, or GCC cannot get this information from the locale, the
> > default is UTF-8. This can be overridden by either the locale or this
> > command-line option. Currently the command-line option takes precedence if
> > there’s a conflict. charset can be any encoding supported by the system’s
> > iconv library routine.
> >
> For phases of translation, I looked at this article at
> https://en.cppreference.com/w/cpp/language/translation_phases
>
> In Phase 1, it mentions:
>
> The individual bytes of the source code file are mapped (in
> > implementation-defined manner) to the characters of the basic source
> > character set. In particular, OS-dependent end-of-line indicators are
> > replaced by newline characters.
> >
> Then in Phase 5, it says that -finput-charset comes in effect.
>
> Note: the conversion performed at this stage can be controlled by command
> > line options in some implementations: gcc and clang use  -finput-charset
> > to specify the encoding of the source character set ...
> >
> To my understanding, in Phase 1, when compiler translates the source file
> to basic source character set, the encoding specified by -finput-charset should
> be already in effect. E.g., the encoding by default is UTF-8 on GCC, then
> the source file is read using UTF-8.

What does "read using UTF-8" mean?

GCC decides how it reads the input. If the input-charset and
exec-charset are the same, then a valid implementation strategy is to
just read in raw bytes in Phase 1 and not alter them in any way, and
then in Phase 5 perform no conversions (because the input characters
are already in the execution character set).

>
> Why do they say that in Phase 5 that -finput-charset can be used to specify
> the encoding at that stage? Since the characters were already read in Phase
> 1 from source using UTF-8. Are they correct in this regard?

Since phase 5 conversions are from one character set to another, the
option that specifies the input character set is going to have an
effect here. Whether the actual work happens in "Phase 1" or is
postponed until "Phase 5" (or whether GCC actually works in separate
phases at all, or does it differently producing the same results)
doesn't matter. The fact is that the -finput-charset option changes
the results of translation.