On Wed, 5 Jun 2019 at 15:29, esoteric escape <manips88@xxxxxxxxx> wrote: > > Hello, I am trying to understand the phase of translation where the option > in gcc -finput-charset comes in effect as given here: > https://gcc.gnu.org/onlinedocs/gcc/Preprocessor-Options.html#Preprocessor-Options > > > -finput-charset=charset > > > Set the input character set, used for translation from the character set of > > the input file to the source character set used by GCC. If the locale does > > not specify, or GCC cannot get this information from the locale, the > > default is UTF-8. This can be overridden by either the locale or this > > command-line option. Currently the command-line option takes precedence if > > there’s a conflict. charset can be any encoding supported by the system’s > > iconv library routine. > > > For phases of translation, I looked at this article at > https://en.cppreference.com/w/cpp/language/translation_phases > > In Phase 1, it mentions: > > The individual bytes of the source code file are mapped (in > > implementation-defined manner) to the characters of the basic source > > character set. In particular, OS-dependent end-of-line indicators are > > replaced by newline characters. > > > Then in Phase 5, it says that -finput-charset comes in effect. > > Note: the conversion performed at this stage can be controlled by command > > line options in some implementations: gcc and clang use -finput-charset > > to specify the encoding of the source character set ... > > > To my understanding, in Phase 1, when compiler translates the source file > to basic source character set, the encoding specified by -finput-charset should > be already in effect. E.g., the encoding by default is UTF-8 on GCC, then > the source file is read using UTF-8. What does "read using UTF-8" mean? GCC decides how it reads the input. If the input-charset and exec-charset are the same, then a valid implementation strategy is to just read in raw bytes in Phase 1 and not alter them in any way, and then in Phase 5 perform no conversions (because the input characters are already in the execution character set). > > Why do they say that in Phase 5 that -finput-charset can be used to specify > the encoding at that stage? Since the characters were already read in Phase > 1 from source using UTF-8. Are they correct in this regard? Since phase 5 conversions are from one character set to another, the option that specifies the input character set is going to have an effect here. Whether the actual work happens in "Phase 1" or is postponed until "Phase 5" (or whether GCC actually works in separate phases at all, or does it differently producing the same results) doesn't matter. The fact is that the -finput-charset option changes the results of translation.