[BUG] UTF-16 names in multiple fs's are handled as UCS-2

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hello,

Linux kernel supports many filesystems that use some form of UTF-16 for filenames. For 5 of these filesystems, the "UTF-16" is only implemented as UCS-2, i.e. without surrogate handling. This prevents a file with SMP characters in the name (represented in UTF-16 as a pair of surrogates), like "🐧.tar", from being created, listed, or both. This email serves a batch bug report.

For many of the affected filesystems, a loop is used in the utf16->8-bit routine to call "nls->uni2char" or a similar routine for each code unit. Such handling breaks surrogate elements that can only be decoded in pairs. (The uni2char/char2uni interface is bad to start with -- they only accept a single wchar_t, which only goes up to 0xffff.) The utf16s_to_utf8s routine is correct.

AFFECTED FILESYSTEMS
--------------------

The following table shows fs drivers affected by a buggy or near-buggy implementation. Most of the results come from grepping for "uni2char" and reading the surrounding code; UDF, NTFS, and vfat has been manually tested. utf16->x8 typically affects listing; its inverse typically affects creation.

FS		UTF16->x8	x8->UTF16	Note
joliet		OK[^1]		N/A	
jfs[^2]		BAD		BAD
hfsplus		BAD		BAD	
udf[^2]		BAD		BAD		bz #199291, ack'ed
ntfs		BAD		BAD		bz #199361
vfat		BAD		OK[^1]	
cifs		OK		OK?[^3]		


[^1]: utf16s_to_utf8s (or its inverse) is used for utf8, dodging the uni2char problem. Currently no other code pages under fs/nls can encode any character in SMP. [^2]: Filesystem advertises "Unicode" support with 16-bit code units, which has undergone a semantic change from UCS-2 to UTF-16 circa 2000 in commercial systems such as Microsoft Windows.
  [^3]: Only some routines have surrogate handling.

REPRODUCING THE BUG
-------------------

The most reliable way to reproduce the "dir" bug is by creating a "🐧.txt" in the target filesystem in a known-good OS, such as post-2000 Microsoft Windows. After that mount the fs in Linux and do a "ls".

To reproduce the file creation bug, use `touch $'\U1F427.txt'` in bash.

FIXING THE BUG
--------------

A way to go around the bug is by special-casing utf8, which is for now the only other encoding in fs/nls capable of handling Unicode SMP at all, to use the specialized nls routines utf16s_to_utf8s & utf8s_to_utf16s.

The 8/16 NLS routines may be not sufficiently tolerant of malformed surrogates depending on usage, but we are not introducing WTF-8 today.




[Index of Archives]     [Linux Ext4 Filesystem]     [Union Filesystem]     [Filesystem Testing]     [Ceph Users]     [Ecryptfs]     [AutoFS]     [Kernel Newbies]     [Share Photos]     [Security]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux Cachefs]     [Reiser Filesystem]     [Linux RAID]     [Samba]     [Device Mapper]     [CEPH Development]

  Powered by Linux