Unification of filesystem encoding options

Pali Rohár <pali.rohar@xxxxxxxxx> · Thu, 2 Jan 2020 22:18:55 +0100

Hello!

While I was reading a new patch series for exfat filesystem driver I
saw how is proposed implementation for converting exfat's UTF-16LE
filenames to userspace and so I decided to investigate what filesystems
which are already part of Linux kernel are doing.

I looked at filesystems supported by Linux kernel which do not store
filenames as sequence of octets, but rather expect that on-disk format
of filenames is according to some encoding.

Below is list of these filesystems with its native encoding:

befs     UTF-8
cifs     UTF-16LE
msdos    unspecified OEM codepage
vfat     unspecified OEM codepage or UTF-16LE
hfs      octets
hfsplus  UTF-16BE-NFD-Apple
isofs    octets or UTF-16BE
jfs      UTF-16LE
ntfs     UTF-16LE
udf      Latin1 or UTF-16BE

Filesystems msdos, vfat, hfs and isofs are bogus as their filesystem
structure does not say in which encoding is filename stored. For vfat
and isofs there is information if it is UTF-16LE or some unspecified
encoding. User who access such filesystem must know under which encoding
he stored data on it. For this purpose there is for vfat and hfs mount
option codepage=<codepage>.

All other filesystems stores in their structures encoding of filenames.
Either implicitly (hfsplus is always in UTF-16BE with modified Apple's
NFD normalization) or explicitly (in UDF is byte which says if filename
is in Latin1 or in UTF-16BE).

As passing UTF-16(LE|BE) buffers is not possible via null term strings
for any VFS sycall, Linux kernel translates these Unicode filenames to
some charset. It is done by various mount options. I looked which mount
options are understood by our Linux filesystems implementations. In all
next paragraphs by filesystem I would mean Linux driver implementation
(and not structure on disk), so do not be confused.

Below is table:

befs     iocharset=<charset>
cifs     iocharset=<charset>
msdos    (unsupported)
vfat     utf8=0|no|false|1|yes|true OR utf8 OR iocharset=<charset>
hfs      iocharset=<charset>
hfsplus  nls=<charset>
isofs    iocharset=<charset> OR utf8
jfs      iocharset=<charset>
ntfs     nls=<charset> OR iochrset=<charset> OR utf8
udf      utf8 OR iocharset=<charset>

Filesystem msdos does not support specifying OEM codepage encoding. It
passthrough 8bit buffer to userspace and expects that userspace
understand correct OEM codepage. There is no support for reencoding it
to UTF-8 (or any other charset). Same applies for isofs when Joliet
structure is not stored on filesystem.

Filesystem vfat has the most obscure way how to specifying charset.
Details are in mount(8) manual page. What is important: option
iocharset=utf8 is buggy and may break filesystem consistency (it allows
to create two directory entries which would differ only in case
sensitivity which is not allowed by FAT specification). Due to this
problem there is a fix, mount option utf8=1 (or utf8=yes or utf8=true or
just utf8) which do what you have would expect from iocharset=utf8 if it
was not buggy.

Filesystem ntfs has option iocharset=<charset> which is just alias for
nls=<charset> and says that iocharset= is deprecated. Same applies for
option utf8 which is just alias for nls=utf8.

Filesystems isofs and udf have two ways how to specify UTF-8 encoding.
First way is via utf8 mount option and second one via iocharset=utf8
option. Looks like that difference is only one, iocharset=utf8 supports
only Uncicode code points up to the U+FFFF (limited to 3 byte long UTF-8
sequences, like utf8/utf8mb3 encoding in MySQL/MariaDB) and utf8 option
supports also code points above U+FFFF, so full Unicode and not just
limited subset.

Filesystem cifs in UTF-8 mode (via iocharset=utf8) always supports code
points above U+FFFF. But remaining filesystems befs, hfs, hfsplus, jfs
and ntfs seems to supports only Unicode code points up to the U+FFFF. So
effectively they do not support UTF-16, but effectively just UCS-2. This
limitation comes from Kernel NLS table framework/API which is limited to
16bit integers and therefore maximal Unicode code point is U+FFFF.
Filesystems cifs, isofs, udf and vfat has own special code to work with
surrogate pairs and do not use limited NLS table functions. There are
also functions utf8s_to_utf16s() and utf16s_to_utf8s() for this purpose.

And here I see these improvements for all above filesystems:

1) Unify mount options for specifying charset.

Currently all filesystems except msdos and hfsplus have mount option
iocharset=<charset>. hfsplus has nls=<charset> and msdos does not
implement re-encoding support. Plus vfat, udf and isofs have broken
iocharset=utf8 option (but working utf8 option) And ntfs has deprecated
iocharset=<charset> option.

I would suggest following changes for unification:

* Add a new alias iocharset= for hfsplus which would do same as nls=
* Make iocharset=utf8 option for vfat, udf and isofs to do same as utf8
* Un-deprecate iocharset=<charset> option for ntfs

This would cause that all filesystems would have iocharset=<charset>
option which would work for any charset, including iocharset=utf8.
And it would fix also broken iocharset=utf8 for vfat, udf and isofs.

2) Add support for Unicode code points above U+FFFF for filesystems
befs, hfs, hfsplus, jfs and ntfs, so iocharset=utf8 option would work
also with filenames in userspace which would be 4 bytes long UTF-8.

3) Add support for iocharset= and codepage= options for msdos
filesystem. It shares lot of pars of code with vfat driver.

What do you think about these improvements? First improvement should be
relatively simple and if we agree that this unification of mount option
iocharset= make sense, I could do it.

-- 
Pali Rohár
pali.rohar@xxxxxxxxx
Attachment:
signature.asc

Description: PGP signature