Effect of -fexec-charset with C++11 string literals

Marvin Gülker <m-guelker@xxxxxxxxxxxxxx> · Sun, 16 Oct 2016 17:25:15 +0200

Hi everyone,

GCC currently supports two options for dealing with the content of
strings hardcoded in source files, -fexec-charset and -finput-charset
(and -fwide-exec-charset, but let's keep that one aside for now).

In my understanding, when GCC reads a source file from disk, it assumes
the file to be in the "input charset" specified with -finput-charset, or
in lack there of, in the locale's charset, or in lack there of, in
UTF-8. The content is then transcoded to whatever internal charset GCC
uses. This includes any ordinary string constants' contents.

When GCC then creates an executable, the strings are transcoded from
GCC's internal charset into the charset specified with -fexec-charset,
or in lack there of, into UTF-8. So even if my source file is written
in, say, UTF-32, any ordinary string literals end up as UTF-8 in the
executable.

If the above was incorrect, please correct me. I would appreciate it if
you gave me a pointer where to read up the correct process then.

Now comes the question. The above is true for ordinary string
literals. The string literal in the following source code:

    int main()
    {
        const char* some_string = "Bärenstark";
        return 0;
    }

will thus always end up being transcoded to UTF-8 and stored as UTF-8 in
the executable if -fexec-charset=UTF-8 is set and the input charset is
set or detected correctly. If on the other hand I specify
-fexec-charset=ISO-8859-1, it should be stored in the executable in
*that* charset.

Which effect does -fexec-charset have if the source code uses the new
C++11 charset-aware literals? For example, if the source code looks like
this:

    int main()
    {
        const char* some_string = u8"Bärenstark";
        return 0;
    }

u8 denotes a string encoded in UTF-8, so in my expectation, this string
literal should *always* end up in UTF-8 in the final executable,
i.e. the value of the option -fexec-charset should be ignored,
especially if it is unset. However, even if I set
-fexec-charset=ISO-8859-1, I would expect the string still to be in
UTF-8 in the final executable, since there is an explicit request for
UTF-8 in the source code (and GCC should probably emit a warning that
this doesn't fit together well). Even more, this assumption should be
true on all conformant C++ compilers, shouldn't it?

Is this correct? Can an explanation of this be added to the
documentation of the -fexec-charset commandline switch?

Greetings
Marvin

-- 
Blog: http://www.guelkerdev.de
PGP/GPG ID: F1D8799FBCC8BC4F