On Thu, Jul 12, 2018 at 01:16:15PM -0400, Gabriel Krisman Bertazi wrote: > "Theodore Y. Ts'o" <tytso@xxxxxxx> writes: > > > On Tue, Jul 03, 2018 at 01:06:40PM -0400, Gabriel Krisman Bertazi wrote: > >> Since not every NLS tables support normalization operations, we limit > >> which encodings can be used by an ext4 volume. Right now, ascii and > >> utf8n are supported, utf8n being a new version of the utf8 charset, but > >> with normalization support using the SGI patches, which are part of this > >> patchset. > > > > Why do we need to have to distinguish between utf8n vs utf8? Why > > can't we just add normalization to existing utf8 character set? What > > would break? > > The reason I made it separate charsets is that if we ever decide to > support normalization on filesystems that already implement some support > for uftf8 already (fat, for instance), we don't want to change the > behavior of existing disks, where strings wouldn't be normalized, since > that would be an ABI breakage. By separating the non-normalized and > normalized version of the charset, we let the user decide, or at least > the superblock inform whether the disk wants normalization or not by > setting the right charset. Hmm, so there's a philosophical question hiding here, I think. Does a file system which is encoding aware have to do normalization? Or more generally what does it *mean* for a file system to be encoding aware? There are all things that a file system could do given that it is encoding aware and the file system is declared to be using a particular encoding: A) Filenames that are "invalid" with respect to an encoding are rejected B) Filenames are normalized before they are stored in the directory C) Filenames are compared in a normalization-insensitive manner D) Filenames are forced to a case before they are stored in a directory E) Filenames are compared in a case-insensitive manner Some of these behaviors are orthogonal; that is, you could do A, or you could do C, or you could do both, or you could do neither. And some of these behaviors can be format-dependent (e.g., you can't change an encoding without running some kind of off-line fsck-like program across the entire file systems); and some of them are not format-dependent (and so could be overriden by a mount option). So maybe we need to talk about is having a feature called EXT4_FEATURE_INCOMPAT_CHARSET_ENCODING, which enables two fields in the superblock. One is the encoding identifier (and 8 or 16 bits is probably *plenty*), and the other is the "encoding flags" field. Some of these flags might specify an encoding --- e.g., the file system supports normalization- and/or case- insensitive lookups in an efficient way by normalizing the string before calculating the dir_index hash. Some of these might specify the default behavior (e.g., case-insensitive or normalization-insensitive) file lookups if not overridden by a mount option. This assumes that normalization and case sensitivity are completely orthogonal. The other thing is there seems to be some debate (and Apple isn't even consistent over time) over what kind of normalization is considered "best" or "correct". e.g., NFD, NFC, NFKD, NFKC. And if you want to export the file system over APFS, it might make a difference which one you use. (This is usually the point where some people will assert that teaching everyone in the world English really *would* be easier than supporting full I18N. :-) Is this something we can or should consider when deciding what we want to support in Linux long-term? > If, for some reason, this is not a problem in this case, I can change it > in the next iteration, to merge utf8n and utf8, and also allow other > charsets. ... and what I'm really asking is do we really want to be specifying whether or not normalization is a Thing as a property of the encoding, or a property of the file system (or object, or document) that uses that particular encoding? - Ted