On Thursday 02 January 2020 16:20:33 Namjae Jeon wrote: > This adds the implementation of nls operations for exfat. Hello! In whole patch series are different naming convention for nls/Unicode related terms. E.g. uni16s, utf16s, nls, vfsname, ... Could this be fixed, so it would be unambiguously named? "uni16s" name is misleading as Unicode does not fit into 16byte type. Based on what is in nls.h I would propose following names: * unicode_t *utf32s always for strings in UTF-32/UCS-4 encoding (host endianity) (or "unicode_t *unis" as this is the fixed-width encoding for all Unicode codepoints) * wchar_t *utf16s always for strings in UTF-16 encoding (host endianity) * u8 *utf8s always for strings in UTF-8 encoding * wchar_t *ucs2s always for strings in UCS-2 encoding (host endianity) Plus in the case you need to work with UTF-16 or UCS-2 in little endian, add appropriate naming suffixes. And use e.g. "vfsname" (char * OR unsigned char * OR u8 *) like you already have on some places for strings in iocharset= encoding. Looking at the whole code + exfat specification and usage is: Kernel NLS functions do conversion between UCS-2 and iocharset=. exfat upcase table has definitions only for UCS-2 characters. All exfat string structures are stored in UTF-16LE, except upcase table which is in UCS-2LE. It is great mess in specification, specially when it talks about Unicode upcase table for case insensitivity, which is limited only to code points up to the U+FFFF and does not say anything about Unicode Normalization and Normal Forms. ======================================================================= And this opens a new question, what should kernel do if userspace asks to create these 4 files? (Assume that iocharset=uff8 for full Unicode support) 1. U+00e9 2. U+0065, U+0301 3. U+00c9 4. U+0045, U+0301 According to Unicode uppercase algorithm, all 4 filenames results in same grapheme "LATIN CAPITAL LETTER E WITH ACUTE". But with current exfat implementation first and third are treated as same and then second and fourth are treated as same. Therefore first and fourth are treated as different filenames, even the fact that they represent same grapheme just only one is upper case and one lower case. To prevent such thing we need to use some kind of Unicode normalization form here. What do you think what should kernel's exfat driver do in this case? CCing Gabriel as he was implementing some Unicode normalization for ext4 driver and maybe should bring some light to new exfat driver too. -- Pali Rohár pali.rohar@xxxxxxxxx
Attachment:
signature.asc
Description: PGP signature