Thanks! It was a little tricky to take in especially when code pages are involved. So, - If -finput-charset and -fexec-charset are the same then no conversion is performed. That is clear. - If -finput-charset and -fexec-charset are not set then no conversion is performed either. Because it appears, GCC cannot get system locale under Windows so -finput-charset is UTF-8 by default and so is -fexec-charset so again both are same. - Only if only of these are specified, then a conversion is performed and -fexec-charset is always UTF-8. That means if I use Windows 1251 or a UTF-8 source file, with none of these options specified, then only raw bytes will be read and no conversion will happen at all? On Thu, Jun 6, 2019 at 5:02 AM Jonathan Wakely <jwakely.gcc@xxxxxxxxx> wrote: > On Wed, 5 Jun 2019 at 15:29, esoteric escape <manips88@xxxxxxxxx> wrote: > > > > Hello, I am trying to understand the phase of translation where the > option > > in gcc -finput-charset comes in effect as given here: > > > https://gcc.gnu.org/onlinedocs/gcc/Preprocessor-Options.html#Preprocessor-Options > > > > > -finput-charset=charset > > > > > Set the input character set, used for translation from the character set > of > > > the input file to the source character set used by GCC. If the locale > does > > > not specify, or GCC cannot get this information from the locale, the > > > default is UTF-8. This can be overridden by either the locale or this > > > command-line option. Currently the command-line option takes > precedence if > > > there’s a conflict. charset can be any encoding supported by the > system’s > > > iconv library routine. > > > > > For phases of translation, I looked at this article at > > https://en.cppreference.com/w/cpp/language/translation_phases > > > > In Phase 1, it mentions: > > > > The individual bytes of the source code file are mapped (in > > > implementation-defined manner) to the characters of the basic source > > > character set. In particular, OS-dependent end-of-line indicators are > > > replaced by newline characters. > > > > > Then in Phase 5, it says that -finput-charset comes in effect. > > > > Note: the conversion performed at this stage can be controlled by command > > > line options in some implementations: gcc and clang use > -finput-charset > > > to specify the encoding of the source character set ... > > > > > To my understanding, in Phase 1, when compiler translates the source file > > to basic source character set, the encoding specified by -finput-charset > should > > be already in effect. E.g., the encoding by default is UTF-8 on GCC, then > > the source file is read using UTF-8. > > What does "read using UTF-8" mean? > > GCC decides how it reads the input. If the input-charset and > exec-charset are the same, then a valid implementation strategy is to > just read in raw bytes in Phase 1 and not alter them in any way, and > then in Phase 5 perform no conversions (because the input characters > are already in the execution character set). > > > > > Why do they say that in Phase 5 that -finput-charset can be used to > specify > > the encoding at that stage? Since the characters were already read in > Phase > > 1 from source using UTF-8. Are they correct in this regard? > > Since phase 5 conversions are from one character set to another, the > option that specifies the input character set is going to have an > effect here. Whether the actual work happens in "Phase 1" or is > postponed until "Phase 5" (or whether GCC actually works in separate > phases at all, or does it differently producing the same results) > doesn't matter. The fact is that the -finput-charset option changes > the results of translation. >