Re: Implementation-defined behavior or not?

Jonathan Wakely <jwakely.gcc@xxxxxxxxx> · Mon, 3 Jun 2019 11:41:31 +0100

On Mon, 3 Jun 2019 at 10:49, esoteric escape <manips88@xxxxxxxxx> wrote:
>
> Thanks! I see, yes speaking of C++17. Just to make sure I grasped it I'll say how I get it:
>
> 1. In the std::string's case, we care about bits regardless of the value of the chars inside std::string, so because mapping is precise that makes it well-defined.
> 2. In case of char, the underlying bit representation changes

On most implementations, no. The underlying bit representation is the
same. 0xC8 as an unsigned char is 11001000 and as a char is also
11001000. What is implementation-defined is the value of 11001000 as a
char. For signed char with GCC that value is (char)-56. For an
unsigned char it's (char)200. One a one's complement system

> if value overflows range and its implementation-defined.

There's no difference between #1 and #2. In both cases the UTF-8
encoding produces some (implementation-defined) char value for each
UTF-8 code unit. If you want UTF-8 encoded data then that's what you
get. The resulting chars will work perfectly well with anything that
expects UTF-8 encoded data. The fact that some characters in the
string might have negative values is irrelevant. If converting the
code unit to a char produces a negative value then that's what you
get. You probably don't need to worry about it in any more detail than
that.

If you're using that string somewhere that expects UTF-8 then
everything just works. If you're using it somewhere that expects 7-bit
ASCII values then it might not work, but that's always true of UTF-8
data, it has nothing to do with whether char is signed or unsigned.

>
> Say, I decide to manually do this:
>
> std::string s = "\xE2\x82\xAC";
>
> Or,
>
> char c[3];
> c[0] = 0xE2;
> c[1] = 0x82;
> c[2] = 0xAC;
>
> Then, I suppose these cases will be more like my #2 above than #1, true?

Case #2 and #1 are the same.