[BUG] UTF-16 names in multiple fs's are handled as UCS-2

"Mingye Wang (Artoria2e5)" <arthur200126@xxxxxxxxx> · Thu, 12 Apr 2018 23:43:29 -0400

Hello,

Linux kernel supports many filesystems that use some form of UTF-16 for 
filenames. For 5 of these filesystems, the "UTF-16" is only implemented 
as UCS-2, i.e. without surrogate handling. This prevents a file with SMP 
characters in the name (represented in UTF-16 as a pair of surrogates), 
like "🐧.tar", from being created, listed, or both. This email serves a 
batch bug report.

For many of the affected filesystems, a loop is used in the utf16->8-bit 
routine to call "nls->uni2char" or a similar routine for each code unit. 
Such handling breaks surrogate elements that can only be decoded in 
pairs. (The uni2char/char2uni interface is bad to start with -- they 
only accept a single wchar_t, which only goes up to 0xffff.) The 
utf16s_to_utf8s routine is correct.

AFFECTED FILESYSTEMS
--------------------

The following table shows fs drivers affected by a buggy or near-buggy 
implementation. Most of the results come from grepping for "uni2char" 
and reading the surrounding code; UDF, NTFS, and vfat has been manually 
tested. utf16->x8 typically affects listing; its inverse typically 
affects creation.

FS		UTF16->x8	x8->UTF16	Note
joliet		OK[^1]		N/A	
jfs[^2]		BAD		BAD
hfsplus		BAD		BAD	
udf[^2]		BAD		BAD		bz #199291, ack'ed
ntfs		BAD		BAD		bz #199361
vfat		BAD		OK[^1]	
cifs		OK		OK?[^3]		

  [^1]: utf16s_to_utf8s (or its inverse) is used for utf8, dodging the 
uni2char problem. Currently no other code pages under fs/nls can encode 
any character in SMP.
  [^2]: Filesystem advertises "Unicode" support with 16-bit code units, 
which has undergone a semantic change from UCS-2 to UTF-16 circa 2000 in 
commercial systems such as Microsoft Windows.
  [^3]: Only some routines have surrogate handling.

REPRODUCING THE BUG
-------------------

The most reliable way to reproduce the "dir" bug is by creating a 
"🐧.txt" in the target filesystem in a known-good OS, such as post-2000 
Microsoft Windows. After that mount the fs in Linux and do a "ls".

To reproduce the file creation bug, use `touch $'\U1F427.txt'` in bash.

FIXING THE BUG
--------------

A way to go around the bug is by special-casing utf8, which is for now 
the only other encoding in fs/nls capable of handling Unicode SMP at 
all, to use the specialized nls routines utf16s_to_utf8s & utf8s_to_utf16s.

The 8/16 NLS routines may be not sufficiently tolerant of malformed 
surrogates depending on usage, but we are not introducing WTF-8 today.