Re: [PATCH 00/20] EXT4 encoding support

Gabriel Krisman Bertazi <krisman@xxxxxxxxxxxxxxx> · Tue, 17 Jul 2018 20:27:28 -0400

"Theodore Y. Ts'o" <tytso@xxxxxxx> writes:

> So maybe we need to talk about is having a feature called
> EXT4_FEATURE_INCOMPAT_CHARSET_ENCODING, which enables two fields in
> the superblock.  One is the encoding identifier (and 8 or 16 bits is
> probably *plenty*), and the other is the "encoding flags" field.

The current patchset makes encoding an INCOMPAT feature, but I'm
using 32 bits for the encoding identifier.  I will change it to 16 bits
in the next iteration of the patch.

> Some of these flags might specify an encoding --- e.g., the file
> system supports normalization- and/or case- insensitive lookups in an
> efficient way by normalizing the string before calculating the
> dir_index hash.  Some of these might specify the default behavior
> (e.g., case-insensitive or normalization-insensitive) file lookups if
> not overridden by a mount option.

I like the idea of encoding flags for selecting the default for
case/normalization -sensitiveness.  But I'm not really sure about a flag
stating support for normalized hashes.  It could be made redundant with
the feature/casefold flag itself, if we make tune2fs or similar rehash
the disk when enabling/disabling the encoding feature flag.

Feature flag is set  ->  Hash(normalization(x))
Feature flag and parent inode casefold flag are set  ->  Hash(casefold(x))

The casefold superblock flag would state whether the casefold inode
flags defaults to true or false.

> This assumes that normalization and case sensitivity are completely
> orthogonal.

I'm thinking of casefolding as a special case of the normalization
problem, just because its semantics are interesting for users.  In fact,
it could be seen as just a different normalization function, from the
implementation point of view.

So, it is not completely orthogonal per-se, but it also deserves some
special stuff attention be more useful, like being per-directory, and to
carrying its on activation flags.

> The other thing is there seems to be some debate (and Apple isn't even
> consistent over time) over what kind of normalization is considered
> "best" or "correct".  e.g., NFD, NFC, NFKD, NFKC.  And if you want to
> export the file system over APFS, it might make a difference which one
> you use.  (This is usually the point where some people will assert
> that teaching everyone in the world English really *would* be easier
> than supporting full I18N.  :-) Is this something we can or should
> consider when deciding what we want to support in Linux long-term?

Since the implementation is normalization-preserving on-disk, isn't this
something that can be changed in the future if it is ever needed?
Provided we can rehash the dentries if we need to change the
normalization, a flag in the superblock, stating what normalization
method is used, should suffice if we ever want to support other
normalization methods.  I have to say, It is not in my plans to support
anything other than NFKD. :)

> ... and what I'm really asking is do we really want to be specifying
> whether or not normalization is a Thing as a property of the encoding,
> or a property of the file system (or object, or document) that uses
> that particular encoding?

I see normalization as an inherent property of the encoding, since, for
the user equivalent strings should mean the same thing in the natural
language.  But I see the point of filesystems wanting to ignore
normalization.  I am pending towards the permissive route, where this
can be enabled/disabled when loading a NLS charset table.  This way we
can merge utf8 and utf8n, and satisfy the normalization case, while
keeping compatibility with older users, What do you think?

-- 
Gabriel Krisman Bertazi