> We do not know how code points above U+FFFF could be converted to upper case. Code points above U+FFFF do not need to be converted to uppercase. > Basically from exfat specification can be deduced it only for > U+0000 .. U+FFFF code points. exFAT specifications (sec.7.2.5.1) saids ... -- table shall cover the complete Unicode character range (from character codes 0000h to FFFFh inclusive). UCS-2, UCS-4, and UTF-16 terms do not appear in the exfat specification. It just says "Unicode". > Second problem is that all MS filesystems (vfat, ntfs and exfat) do not use UCS-2 nor UTF-16, but rather some mix between > it. Basically any sequence of 16bit values (except those :/<>... vfat chars) is valid, even unpaired surrogate half. So > surrogate pair (two 16bit values) represents one unicode code point (as in UTF-16), but one unpaired surrogate half is > also valid and represent (invalid) unicode code point of its value. In unicode are not defined code points for values > of single / half surrogate. Microsoft's File Systems uses the UTF-16 encoded UCS-4 code set. The character type is basically 'wchar_t'(16bit). The description "0000h to FFFFh" also assumes the use of 'wchar_t'. This “0000h to FFFFh” also includes surrogate characters(U+D800 to U+DFFF), but these should not be converted to upper case. Passing a surrogate character to RtlUpcaseUnicodeChar() on Windows, just returns the same value. (* RtlUpcaseUnicodeChar() is one of Windows native API) If the upcase-table contains surrogate characters, exfat_toupper() will cause incorrect conversion. With the current implementation, the results of exfat_utf8_d_cmp() and exfat_uniname_ncmp() may differ. The normal exfat's upcase-table does not contain surrogate characters, so the problem does not occur. To be more strict... D800h to DFFFh should be excluded when loading upcase-table or in exfat_toupper(). > Therefore if we talk about encoding UTF-16 vs UTF-32 we first need to fix a way how to handle those non-representative > values in VFS encoding (iocharset=) as UTF-8 is not able to represent it too. One option is to extend UTF-8 to WTF-8 > encoding [1] (yes, this is a real and make sense!) and then ideally change exfat_toupper() to UTF-32 without restriction > for surrogate pairs values. WTF-8 is new to me. That's an interesting idea, but is it needed for exfat? For characters over U+FFFF, -For UTF-32, a value of 0x10000 or more -For UTF-16, the value from 0xd800 to 0xdfff I think these are just "don't convert to uppercase." If the File Name Directory Entry contains illegal surrogate characters(such as one unpaired surrogate half), it will simply be ignored by utf16s_to_utf8s(). string after utf8 conversion does not include illegal byte sequence. > Btw, same problem with UTF-16 also in vfat, ntfs and also in iso/joliet kernel drivers. Ugh... BR --- Kohada Tetsuhiro <Kohada.Tetsuhiro@xxxxxxxxxxxxxxxxxxxxxxxxxxx>