Re: Implementation-defined behavior or not?

Jonathan Wakely <jwakely.gcc@xxxxxxxxx> · Mon, 3 Jun 2019 10:22:15 +0100

On Mon, 3 Jun 2019 at 10:07, esoteric escape wrote:
>
> Hello, I am on Windows OS where CHAR_BIT == 8.
>
> I am trying to understand whether this behavior is implementation-defined
> or not.
>
> I have this string in UTF-8, and I am trying to understand if it is
> implementation-defined:
>
> std::string s = u8"€";
>
> It's clear to me that char c = 0xC8 is implementation defined for the
> reasons:
> 1. char's signedness depends on compiler..
> 2. If the value is beyond the representatable range of char say, -128 to
> 127, then again it is implementation-defined.

The char value is implementation-defined, but there will be some
unique value that corresponds to 0xC8 and can be unambiguously
converted to (unsigned char)0xC8. For GCC (and in C++20) the
conversion to char is the obvious two's complement one, producing
(char)-56.

>
> In the same way, I am trying to understand how the std::string case is
> handled because its also uses char.
>
> So, € in UTF-8 means E2 82 AC sequence of bytes in hex. If std::string uses
> suppose signed version of char, don't they fall beyond the representable
> range and therefore, their value is implementation defined in std::string?
> Or, is this case actually well-defined?

Both. The code is perfectly well-defined in C++11, C++14 and C++17.
The precise numerical values are implementation-defined, but there is
a one-to-one mapping from 8-bit UTF-8 code units to char values, and
back again.

N.B. In C++20 the code is ill-formed and won't compile, because the
type of u8"€" is const char8_t[4] which cannot be used to initialize a
std::string. You'd need to cast it to (const char*) or use
std::u8string instead.