> > UCS-2, UCS-4, and UTF-16 terms do not appear in the exfat specification. > > It just says "Unicode". > > That is because in MS world, "Unicode" term lot of times means UCS-2 or UTF-16. For example, the Joliet Specification describes using UCS-2 for character sets. Similarly, the UDF Specification describes using Unicode Version 2.0 for character sets. However, Windows File Systems also accepts UTF-16 encoded UCS-4. The foundation of their main product(Windows NT) was designed in the era when UTF-16 and UCS-2 were equal. The non-BMP plains were probably not fully considered. > You need to have a crystal ball to correctly understand their specifications. Exactly!! My crystal ball says ... "They've designed D800-DFFF to be a mysterious area, so it's going through it." > > Microsoft's File Systems uses the UTF-16 encoded UCS-4 code set. > > The character type is basically 'wchar_t'(16bit). > > The description "0000h to FFFFh" also assumes the use of 'wchar_t'. > > > > This “0000h to FFFFh” also includes surrogate characters(U+D800 to > > U+DFFF), but these should not be converted to upper case. > > Passing a surrogate character to RtlUpcaseUnicodeChar() on Windows, just returns the same value. > > (* RtlUpcaseUnicodeChar() is one of Windows native API) > > > > If the upcase-table contains surrogate characters, exfat_toupper() will cause incorrect conversion. > > With the current implementation, the results of exfat_utf8_d_cmp() and exfat_uniname_ncmp() may differ. > > > > The normal exfat's upcase-table does not contain surrogate characters, so the problem does not occur. > > To be more strict... > > D800h to DFFFh should be excluded when loading upcase-table or in exfat_toupper(). > > Exactly, that is why surrogate pairs cannot be put into any "to upper" > function. Or rather "to upper" function needs to be identity for them to not break anything. "to upper" does not make > any sense on one u16 item from UTF-16 sequence when you do not have a complete code point. > So API for UTF-16 "to upper" function needs to take full string, not just one u16. > > So for code points above U+FFFF it is needed some other mechanism how to represent upcase table (e.g. by providing full > UTF-16 pair or code point encoded in UTF-32). And this is unknown and reason why I put question which was IIRC forwarded > to MS. That's exactly the case with the "generic" UTF-16 toupper function. However, exfat (and other MS-FS's) does not require uppercase conversion for non-BMP plains characters. For non-BMP characters, I think it's enough to just do nothing (no skip, no conversion).So like Windows. > > WTF-8 is new to me. > > That's an interesting idea, but is it needed for exfat? > > > > For characters over U+FFFF, > > -For UTF-32, a value of 0x10000 or more -For UTF-16, the value from > > 0xd800 to 0xdfff I think these are just "don't convert to uppercase." > > > > If the File Name Directory Entry contains illegal surrogate > > characters(such as one unpaired surrogate half), it will simply be ignored by utf16s_to_utf8s(). > > This is the example why it can be useful for exfat on linux. exfat filename can contain just sequence of unpaired halves > of surrogate pairs. Such thing is not representable in UTF-8, but valid in exfat. > Therefore current linux kernel exfat driver with UTF-8 encoding cannot handle such filenames. But with WTF-8 it is possible. In fact, exfat(and other MS-FSs) accept unpaired surrogate characters. But this is illegal unicode. Also, it is very rarely generated by normal user operation (except for VFAT shortname). Illegal unicode characters were often a security risk and I think they should not be accepted. even if possible. > So if we want that userspace would be able to read such files from exfat fs, some mechanism for converting "unpaired halves" > to NULL-term char* string suitable for filenames is needed. And WTF-8 seems like a good choice as it is backward compatible > with UTF-8. I think there are very few requirements to access such file names. It is rare to use non-BMP characters in file names, and it is even rarer to illegally record only half of them. > > string after utf8 conversion does not include illegal byte sequence. > > Yes, but this is loosy conversion. When you would have two filenames with different "surrogate halves" they would be converted > to same file name. So you would not be able to access both of them. I also think there is a problem with this conversion. Illegal byte sequences are stripped off, and behave as if they didn't exist from the beginning (like a legal UTF-8 string). I think it's safest to fail the conversion if it detects an illegal byte sequence. And it's also popular to replace it with another character(such as'_ '). (not perfect, but works reasonably) Anyway, we don't need to convert non-BMP characters or unpaired surrogate characters to uppercase in exfat(and other MS-FSs). BR --- Kohada Tetsuhiro <Kohada.Tetsuhiro@xxxxxxxxxxxxxxxxxxxxxxxxxxx>