"Theodore Y. Ts'o" <tytso@xxxxxxx> writes: > So maybe we need to talk about is having a feature called > EXT4_FEATURE_INCOMPAT_CHARSET_ENCODING, which enables two fields in > the superblock. One is the encoding identifier (and 8 or 16 bits is > probably *plenty*), and the other is the "encoding flags" field. The current patchset makes encoding an INCOMPAT feature, but I'm using 32 bits for the encoding identifier. I will change it to 16 bits in the next iteration of the patch. > Some of these flags might specify an encoding --- e.g., the file > system supports normalization- and/or case- insensitive lookups in an > efficient way by normalizing the string before calculating the > dir_index hash. Some of these might specify the default behavior > (e.g., case-insensitive or normalization-insensitive) file lookups if > not overridden by a mount option. I like the idea of encoding flags for selecting the default for case/normalization -sensitiveness. But I'm not really sure about a flag stating support for normalized hashes. It could be made redundant with the feature/casefold flag itself, if we make tune2fs or similar rehash the disk when enabling/disabling the encoding feature flag. Feature flag is set -> Hash(normalization(x)) Feature flag and parent inode casefold flag are set -> Hash(casefold(x)) The casefold superblock flag would state whether the casefold inode flags defaults to true or false. > This assumes that normalization and case sensitivity are completely > orthogonal. I'm thinking of casefolding as a special case of the normalization problem, just because its semantics are interesting for users. In fact, it could be seen as just a different normalization function, from the implementation point of view. So, it is not completely orthogonal per-se, but it also deserves some special stuff attention be more useful, like being per-directory, and to carrying its on activation flags. > The other thing is there seems to be some debate (and Apple isn't even > consistent over time) over what kind of normalization is considered > "best" or "correct". e.g., NFD, NFC, NFKD, NFKC. And if you want to > export the file system over APFS, it might make a difference which one > you use. (This is usually the point where some people will assert > that teaching everyone in the world English really *would* be easier > than supporting full I18N. :-) Is this something we can or should > consider when deciding what we want to support in Linux long-term? Since the implementation is normalization-preserving on-disk, isn't this something that can be changed in the future if it is ever needed? Provided we can rehash the dentries if we need to change the normalization, a flag in the superblock, stating what normalization method is used, should suffice if we ever want to support other normalization methods. I have to say, It is not in my plans to support anything other than NFKD. :) > ... and what I'm really asking is do we really want to be specifying > whether or not normalization is a Thing as a property of the encoding, > or a property of the file system (or object, or document) that uses > that particular encoding? I see normalization as an inherent property of the encoding, since, for the user equivalent strings should mean the same thing in the natural language. But I see the point of filesystems wanting to ignore normalization. I am pending towards the permissive route, where this can be enabled/disabled when loading a NLS charset table. This way we can merge utf8 and utf8n, and satisfy the normalization case, while keeping compatibility with older users, What do you think? -- Gabriel Krisman Bertazi