Re: Effect of -finput-charset in gcc and phases of translation

esoteric escape <manips88@xxxxxxxxx> · Thu, 6 Jun 2019 07:05:09 +0530

Thanks! It was a little tricky to take in especially when code pages are
involved.

So,

   - If -finput-charset and -fexec-charset are the same then no conversion
   is performed. That is clear.
   - If -finput-charset and -fexec-charset are not set then no conversion
   is performed either. Because it appears, GCC cannot get system locale under
   Windows so -finput-charset is UTF-8 by default and so is -fexec-charset so
   again both are same.
   - Only if only of these are specified, then a conversion is performed
   and -fexec-charset is always UTF-8.

That means if I use Windows 1251 or a UTF-8 source file, with none of these
options specified, then only raw bytes will be read and no conversion will
happen at all?

On Thu, Jun 6, 2019 at 5:02 AM Jonathan Wakely <jwakely.gcc@xxxxxxxxx>
wrote:

> On Wed, 5 Jun 2019 at 15:29, esoteric escape <manips88@xxxxxxxxx> wrote:
> >
> > Hello, I am trying to understand the phase of translation where the
> option
> > in gcc -finput-charset comes in effect as given here:
> >
> https://gcc.gnu.org/onlinedocs/gcc/Preprocessor-Options.html#Preprocessor-Options
> >
> > > -finput-charset=charset
> > >
> > Set the input character set, used for translation from the character set
> of
> > > the input file to the source character set used by GCC. If the locale
> does
> > > not specify, or GCC cannot get this information from the locale, the
> > > default is UTF-8. This can be overridden by either the locale or this
> > > command-line option. Currently the command-line option takes
> precedence if
> > > there’s a conflict. charset can be any encoding supported by the
> system’s
> > > iconv library routine.
> > >
> > For phases of translation, I looked at this article at
> > https://en.cppreference.com/w/cpp/language/translation_phases
> >
> > In Phase 1, it mentions:
> >
> > The individual bytes of the source code file are mapped (in
> > > implementation-defined manner) to the characters of the basic source
> > > character set. In particular, OS-dependent end-of-line indicators are
> > > replaced by newline characters.
> > >
> > Then in Phase 5, it says that -finput-charset comes in effect.
> >
> > Note: the conversion performed at this stage can be controlled by command
> > > line options in some implementations: gcc and clang use
> -finput-charset
> > > to specify the encoding of the source character set ...
> > >
> > To my understanding, in Phase 1, when compiler translates the source file
> > to basic source character set, the encoding specified by -finput-charset
> should
> > be already in effect. E.g., the encoding by default is UTF-8 on GCC, then
> > the source file is read using UTF-8.
>
> What does "read using UTF-8" mean?
>
> GCC decides how it reads the input. If the input-charset and
> exec-charset are the same, then a valid implementation strategy is to
> just read in raw bytes in Phase 1 and not alter them in any way, and
> then in Phase 5 perform no conversions (because the input characters
> are already in the execution character set).
>
> >
> > Why do they say that in Phase 5 that -finput-charset can be used to
> specify
> > the encoding at that stage? Since the characters were already read in
> Phase
> > 1 from source using UTF-8. Are they correct in this regard?
>
> Since phase 5 conversions are from one character set to another, the
> option that specifies the input character set is going to have an
> effect here. Whether the actual work happens in "Phase 1" or is
> postponed until "Phase 5" (or whether GCC actually works in separate
> phases at all, or does it differently producing the same results)
> doesn't matter. The fact is that the -finput-charset option changes
> the results of translation.
>