On 16 October 2016 at 16:25, Marvin Gülker <m-guelker@xxxxxxxxxxxxxx> wrote: > Hi everyone, > > GCC currently supports two options for dealing with the content of > strings hardcoded in source files, -fexec-charset and -finput-charset > (and -fwide-exec-charset, but let's keep that one aside for now). > > In my understanding, when GCC reads a source file from disk, it assumes > the file to be in the "input charset" specified with -finput-charset, or > in lack there of, in the locale's charset, or in lack there of, in > UTF-8. The content is then transcoded to whatever internal charset GCC > uses. This includes any ordinary string constants' contents. > > When GCC then creates an executable, the strings are transcoded from > GCC's internal charset into the charset specified with -fexec-charset, > or in lack there of, into UTF-8. So even if my source file is written > in, say, UTF-32, any ordinary string literals end up as UTF-8 in the > executable. > > If the above was incorrect, please correct me. I would appreciate it if > you gave me a pointer where to read up the correct process then. > > Now comes the question. The above is true for ordinary string > literals. The string literal in the following source code: > > int main() > { > const char* some_string = "Bärenstark"; > return 0; > } > > will thus always end up being transcoded to UTF-8 and stored as UTF-8 in > the executable if -fexec-charset=UTF-8 is set and the input charset is > set or detected correctly. If on the other hand I specify > -fexec-charset=ISO-8859-1, it should be stored in the executable in > *that* charset. I believe that's true. > Which effect does -fexec-charset have if the source code uses the new > C++11 charset-aware literals? For example, if the source code looks like > this: > > int main() > { > const char* some_string = u8"Bärenstark"; > return 0; > } > > u8 denotes a string encoded in UTF-8, so in my expectation, this string > literal should *always* end up in UTF-8 in the final executable, > i.e. the value of the option -fexec-charset should be ignored, > especially if it is unset. However, even if I set > -fexec-charset=ISO-8859-1, I would expect the string still to be in > UTF-8 in the final executable, since there is an explicit request for > UTF-8 in the source code (and GCC should probably emit a warning that > this doesn't fit together well). Even more, this assumption should be > true on all conformant C++ compilers, shouldn't it? Yes, the characters of the u8 string literal are required to be UTF-8 encoded code units by the standard. So I think your description is correct. I don't see why a warning should be issued though. u8 literals are useful when the execution character set is *not* UTF-8, because you can use them to ensure a string is UTF-8 encoded when it otherwise wouldn't be. Warning for those use cases seems unnecessary.