Hello,
Linux kernel supports many filesystems that use some form of UTF-16 for
filenames. For 5 of these filesystems, the "UTF-16" is only implemented
as UCS-2, i.e. without surrogate handling. This prevents a file with SMP
characters in the name (represented in UTF-16 as a pair of surrogates),
like "🐧.tar", from being created, listed, or both. This email serves a
batch bug report.
For many of the affected filesystems, a loop is used in the utf16->8-bit
routine to call "nls->uni2char" or a similar routine for each code unit.
Such handling breaks surrogate elements that can only be decoded in
pairs. (The uni2char/char2uni interface is bad to start with -- they
only accept a single wchar_t, which only goes up to 0xffff.) The
utf16s_to_utf8s routine is correct.
AFFECTED FILESYSTEMS
--------------------
The following table shows fs drivers affected by a buggy or near-buggy
implementation. Most of the results come from grepping for "uni2char"
and reading the surrounding code; UDF, NTFS, and vfat has been manually
tested. utf16->x8 typically affects listing; its inverse typically
affects creation.
FS UTF16->x8 x8->UTF16 Note
joliet OK[^1] N/A
jfs[^2] BAD BAD
hfsplus BAD BAD
udf[^2] BAD BAD bz #199291, ack'ed
ntfs BAD BAD bz #199361
vfat BAD OK[^1]
cifs OK OK?[^3]
[^1]: utf16s_to_utf8s (or its inverse) is used for utf8, dodging the
uni2char problem. Currently no other code pages under fs/nls can encode
any character in SMP.
[^2]: Filesystem advertises "Unicode" support with 16-bit code units,
which has undergone a semantic change from UCS-2 to UTF-16 circa 2000 in
commercial systems such as Microsoft Windows.
[^3]: Only some routines have surrogate handling.
REPRODUCING THE BUG
-------------------
The most reliable way to reproduce the "dir" bug is by creating a
"🐧.txt" in the target filesystem in a known-good OS, such as post-2000
Microsoft Windows. After that mount the fs in Linux and do a "ls".
To reproduce the file creation bug, use `touch $'\U1F427.txt'` in bash.
FIXING THE BUG
--------------
A way to go around the bug is by special-casing utf8, which is for now
the only other encoding in fs/nls capable of handling Unicode SMP at
all, to use the specialized nls routines utf16s_to_utf8s & utf8s_to_utf16s.
The 8/16 NLS routines may be not sufficiently tolerant of malformed
surrogates depending on usage, but we are not introducing WTF-8 today.