From: Pali Rohár > Sent: 20 January 2020 16:27 ... > > Unfortunately there is neither a 1:1 mapping of all possible byte sequences > > to wchar_t (or unicode code points), > > I was talking about valid UTF-8 sequence (invalid, illformed is out of > game and for sure would always cause problems). Except that they are always likely to happen. I've been pissed off by programs crashing because they assume that a input string (eg an email) is UTF-8 but happens to contain a single 0xa3 byte in the otherwise 7-bit data. The standard ought to have defined a translation for such sequences and just a 'warning' from the function(s) that unexpected bytes were processed. > > nor a 1:1 mapping of all possible wchar_t values to UTF-8. > > This is not truth. There is exactly only one way how to convert sequence > of Unicode code points to UTF-8. UTF is Unicode Transformation Format > and has exact definition how is Unicode Transformed. But a wchar_t can hold lots of values that aren't Unicode code points. Prior to the 2003 changes half of the 2^32 values could be converted. Afterwards only a small fraction. > If you have valid UTF-8 sequence then it describe one exact sequence of > Unicode code points. And if you have sequence (ordinals) of Unicode code > points there is exactly one and only one its representation in UTF-8. > > I would suggest you to read Unicode standard, section 2.5 Encoding Forms. That all assumes everyone is playing the correct game > > Really both need to be defined - even for otherwise 'invalid' sequences. > > > > Even the 16-bit values above 0xd000 can appear on their own in > > windows filesystems (according to wikipedia). > > If you are talking about UTF-16 (which is _not_ 16-bit as you wrote), > look at my previous email: UFT-16 is a sequence of 16-bit values.... It can contain 0xd000 to 0xffff (usually in pairs) but they aren't UTF-8 codepoints. David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)