Re: Unification of filesystem encoding options

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Tuesday 07 January 2020 14:32:33 Jan Kara wrote:
> On Thu 02-01-20 22:18:55, Pali Rohár wrote:
> > 1) Unify mount options for specifying charset.
> > 
> > Currently all filesystems except msdos and hfsplus have mount option
> > iocharset=<charset>. hfsplus has nls=<charset> and msdos does not
> > implement re-encoding support. Plus vfat, udf and isofs have broken
> > iocharset=utf8 option (but working utf8 option) And ntfs has deprecated
> > iocharset=<charset> option.
> > 
> > I would suggest following changes for unification:
> > 
> > * Add a new alias iocharset= for hfsplus which would do same as nls=
> > * Make iocharset=utf8 option for vfat, udf and isofs to do same as utf8
> > * Un-deprecate iocharset=<charset> option for ntfs
> > 
> > This would cause that all filesystems would have iocharset=<charset>
> > option which would work for any charset, including iocharset=utf8.
> > And it would fix also broken iocharset=utf8 for vfat, udf and isofs.
> 
> Makes sense to me.

Ok!

> > 2) Add support for Unicode code points above U+FFFF for filesystems
> > befs, hfs, hfsplus, jfs and ntfs, so iocharset=utf8 option would work
> > also with filenames in userspace which would be 4 bytes long UTF-8.
> 
> Also looks good but when doing this, I'd suggest we extend NLS to support
> full UTF-8 rather than implementing it by hand like e.g. we did for UDF.

Current kernel NLS framework API supports upper-case / lower-case
conversion only for single byte encodings. So no case-insensitive
support for UTF-8 encoding. And for Unicode conversion it supports only
UCS-2, therefore code points up to the U+FFFF, so for UTF-8 maximally
3byte long sequences.

This really is not possible to fix without rewriting existing
filesystems which uses NLS API.

One hacky option would be to extend NLS API from UCS-2 to UTF-16 and fix
all users of NLS API to expects UTF-16 surrogate pairs.

But I dislike UTF-16 and rather would use usage of unicode_t (UTF-32)
which is already present in kernel. But because existing filesystems
drivers pass their UCS-2/UTF-16 buffers from FS to NLS API it is not
easy to change whole NLS API from UCS-2 to UTF-32.

And still this change does not add support for case-insensitivity, so
is useless for all MS filesystems (msdos, vfat, ntfs), which is
majority.

Kernel already provides functions for converting between UTF-8 and
UTF-16, so this seems to be the easiest way how to provide full UTF-8
support for filesystems which internally uses UTF-16. Similarly like it
is implemented in UDF.

Moreover all NLS encodings except UTF-8 are single byte encodings and
maps into Plane-0, so can be represented by currently used UCS-2
encoding. Therefore conversion to Unicode works correctly and also their
case-insensitivity functions (or rather tables).

Adding support for case-insensitivity into UTF-8 NLS encoding would mean
to create completely new kernel NLS API (which would support variable
length encodings) and rewrite all NLS filesystems to use this new API.
Also all existing NLS encodings would be needed to port into this new
API.

It is really something which have a value? Just because of UTF-8?

For me it looks like better option would be to remove UTF-8 NLS encoding
as it is broken. Some filesystems already do not use NLS API for their
UTF-8 support (e.g. vfat, udf or newly prepared exfat). And others could
be modified/extended/fixed in similar way.

> > 3) Add support for iocharset= and codepage= options for msdos
> > filesystem. It shares lot of pars of code with vfat driver.
> 
> I guess this is for msdos filesystem maintainers to decide.

Yes!

-- 
Pali Rohár
pali.rohar@xxxxxxxxx

Attachment: signature.asc
Description: PGP signature


[Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux