Re: Effect of -fexec-charset with C++11 string literals

Jonathan Wakely <jwakely.gcc@xxxxxxxxx> · Wed, 26 Oct 2016 16:55:44 +0100

On 16 October 2016 at 16:25, Marvin Gülker <m-guelker@xxxxxxxxxxxxxx> wrote:
> Hi everyone,
>
> GCC currently supports two options for dealing with the content of
> strings hardcoded in source files, -fexec-charset and -finput-charset
> (and -fwide-exec-charset, but let's keep that one aside for now).
>
> In my understanding, when GCC reads a source file from disk, it assumes
> the file to be in the "input charset" specified with -finput-charset, or
> in lack there of, in the locale's charset, or in lack there of, in
> UTF-8. The content is then transcoded to whatever internal charset GCC
> uses. This includes any ordinary string constants' contents.
>
> When GCC then creates an executable, the strings are transcoded from
> GCC's internal charset into the charset specified with -fexec-charset,
> or in lack there of, into UTF-8. So even if my source file is written
> in, say, UTF-32, any ordinary string literals end up as UTF-8 in the
> executable.
>
> If the above was incorrect, please correct me. I would appreciate it if
> you gave me a pointer where to read up the correct process then.
>
> Now comes the question. The above is true for ordinary string
> literals. The string literal in the following source code:
>
>     int main()
>     {
>         const char* some_string = "Bärenstark";
>         return 0;
>     }
>
> will thus always end up being transcoded to UTF-8 and stored as UTF-8 in
> the executable if -fexec-charset=UTF-8 is set and the input charset is
> set or detected correctly. If on the other hand I specify
> -fexec-charset=ISO-8859-1, it should be stored in the executable in
> *that* charset.

I believe that's true.

> Which effect does -fexec-charset have if the source code uses the new
> C++11 charset-aware literals? For example, if the source code looks like
> this:
>
>     int main()
>     {
>         const char* some_string = u8"Bärenstark";
>         return 0;
>     }
>
> u8 denotes a string encoded in UTF-8, so in my expectation, this string
> literal should *always* end up in UTF-8 in the final executable,
> i.e. the value of the option -fexec-charset should be ignored,
> especially if it is unset. However, even if I set
> -fexec-charset=ISO-8859-1, I would expect the string still to be in
> UTF-8 in the final executable, since there is an explicit request for
> UTF-8 in the source code (and GCC should probably emit a warning that
> this doesn't fit together well). Even more, this assumption should be
> true on all conformant C++ compilers, shouldn't it?

Yes, the characters of the u8 string literal are required to be UTF-8
encoded code units by the standard. So I think your description is
correct.

I don't see why a warning should be issued though. u8 literals are
useful when the execution character set is *not* UTF-8, because you
can use them to ensure a string is UTF-8 encoded when it otherwise
wouldn't be. Warning for those use cases seems unnecessary.