RE: [PATCH 1/4] exfat: Simplify exfat_utf8_d_hash() for code points above U+FFFF

"Kohada.Tetsuhiro@xxxxxxxxxxxxxxxxxxxxxxxxxxx" <Kohada.Tetsuhiro@xxxxxxxxxxxxxxxxxxxxxxxxxxx> · Wed, 15 Apr 2020 07:46:27 +0000

> > UCS-2, UCS-4, and UTF-16 terms do not appear in the exfat specification.
> > It just says "Unicode".
> 
> That is because in MS world, "Unicode" term lot of times means UCS-2 or UTF-16. 

For example, the Joliet Specification describes using UCS-2 for character sets.
Similarly, the UDF Specification describes using Unicode Version 2.0 for character sets.
However, Windows File Systems also accepts UTF-16 encoded UCS-4.
The foundation of their main product(Windows NT) was designed in the era when UTF-16 and UCS-2 were equal.
The non-BMP plains were probably not fully considered.

> You need to have a crystal ball to correctly understand their specifications.

Exactly!!
My crystal ball says ...
"They've designed D800-DFFF to be a mysterious area, so it's going through it."

> > Microsoft's File Systems uses the UTF-16 encoded UCS-4 code set.
> > The character type is basically 'wchar_t'(16bit).
> > The description "0000h to FFFFh" also assumes the use of 'wchar_t'.
> >
> > This “0000h to FFFFh” also includes surrogate characters(U+D800 to
> > U+DFFF), but these should not be converted to upper case.
> > Passing a surrogate character to RtlUpcaseUnicodeChar() on Windows, just returns the same value.
> > (* RtlUpcaseUnicodeChar() is one of Windows native API)
> >
> > If the upcase-table contains surrogate characters, exfat_toupper() will cause incorrect conversion.
> > With the current implementation, the results of exfat_utf8_d_cmp() and exfat_uniname_ncmp() may differ.
> >
> > The normal exfat's upcase-table does not contain surrogate characters, so the problem does not occur.
> > To be more strict...
> > D800h to DFFFh should be excluded when loading upcase-table or in exfat_toupper().
> 
> Exactly, that is why surrogate pairs cannot be put into any "to upper"
> function. Or rather "to upper" function needs to be identity for them to not break anything. "to upper" does not make
> any sense on one u16 item from UTF-16 sequence when you do not have a complete code point.
> So API for UTF-16 "to upper" function needs to take full string, not just one u16.
>
> So for code points above U+FFFF it is needed some other mechanism how to represent upcase table (e.g. by providing full
> UTF-16 pair or code point encoded in UTF-32). And this is unknown and reason why I put question which was IIRC forwarded
> to MS.

That's exactly the case with the "generic" UTF-16 toupper function.
However, exfat (and other MS-FS's) does not require uppercase conversion for non-BMP plains characters.
For non-BMP characters, I think it's enough to just do nothing (no skip, no conversion).So like Windows.

> > WTF-8 is new to me.
> > That's an interesting idea, but is it needed for exfat?
> >
> > For characters over U+FFFF,
> >  -For UTF-32, a value of 0x10000 or more  -For UTF-16, the value from
> > 0xd800 to 0xdfff I think these are just "don't convert to uppercase."
> >
> > If the File Name Directory Entry contains illegal surrogate
> > characters(such as one unpaired surrogate half), it will simply be ignored by utf16s_to_utf8s().
> 
> This is the example why it can be useful for exfat on linux. exfat filename can contain just sequence of unpaired halves
> of surrogate pairs. Such thing is not representable in UTF-8, but valid in exfat.
> Therefore current linux kernel exfat driver with UTF-8 encoding cannot handle such filenames. But with WTF-8 it is possible.

In fact, exfat(and other MS-FSs) accept unpaired surrogate characters.
But this is illegal unicode.
Also, it is very rarely generated by normal user operation (except for VFAT shortname).
Illegal unicode characters were often a security risk and I think they should not be accepted. even if possible.

> So if we want that userspace would be able to read such files from exfat fs, some mechanism for converting "unpaired halves"
> to NULL-term char* string suitable for filenames is needed. And WTF-8 seems like a good choice as it is backward compatible
> with UTF-8.

I think there are very few requirements to access such file names.
It is rare to use non-BMP characters in file names, and it is even rarer to illegally record only half of them.

> > string after utf8 conversion does not include illegal byte sequence.
> 
> Yes, but this is loosy conversion. When you would have two filenames with different "surrogate halves" they would be converted
> to same file name. So you would not be able to access both of them.

I also think there is a problem with this conversion.
Illegal byte sequences are stripped off, and behave as if they didn't exist 
from the beginning (like a legal UTF-8 string).
I think it's safest to fail the conversion if it detects an illegal byte sequence.
And it's also popular to replace it with another character(such as'_ ').
(not perfect, but works reasonably)

Anyway, we don't need to convert non-BMP characters or unpaired surrogate characters 
to uppercase in exfat(and other MS-FSs).

BR
---
Kohada Tetsuhiro <Kohada.Tetsuhiro@xxxxxxxxxxxxxxxxxxxxxxxxxxx>