Re: [PATCH 00/20] EXT4 encoding support

"Theodore Y. Ts'o" <tytso@xxxxxxx> · Thu, 12 Jul 2018 17:40:35 -0400

On Thu, Jul 12, 2018 at 01:16:15PM -0400, Gabriel Krisman Bertazi wrote:
> "Theodore Y. Ts'o" <tytso@xxxxxxx> writes:
> 
> > On Tue, Jul 03, 2018 at 01:06:40PM -0400, Gabriel Krisman Bertazi wrote:
> >> Since not every NLS tables support normalization operations, we limit
> >> which encodings can be used by an ext4 volume.  Right now, ascii and
> >> utf8n are supported, utf8n being a new version of the utf8 charset, but
> >> with normalization support using the SGI patches, which are part of this
> >> patchset.
> >
> > Why do we need to have to distinguish between utf8n vs utf8?  Why
> > can't we just add normalization to existing utf8 character set?  What
> > would break?
> 
> The reason I made it separate charsets is that if we ever decide to
> support normalization on filesystems that already implement some support
> for uftf8 already (fat, for instance), we don't want to change the
> behavior of existing disks, where strings wouldn't be normalized, since
> that would be an ABI breakage.  By separating the non-normalized and
> normalized version of the charset, we let the user decide, or at least
> the superblock inform whether the disk wants normalization or not by
> setting the right charset.

Hmm, so there's a philosophical question hiding here, I think.  Does a
file system which is encoding aware have to do normalization?  Or more
generally what does it *mean* for a file system to be encoding aware?

There are all things that a file system could do given that it is
encoding aware and the file system is declared to be using a
particular encoding:

A) Filenames that are "invalid" with respect to an encoding are rejected
B) Filenames are normalized before they are stored in the directory
C) Filenames are compared in a normalization-insensitive manner
D) Filenames are forced to a case before they are stored in a directory
E) Filenames are compared in a case-insensitive manner

Some of these behaviors are orthogonal; that is, you could do A, or
you could do C, or you could do both, or you could do neither.  And
some of these behaviors can be format-dependent (e.g., you can't
change an encoding without running some kind of off-line fsck-like
program across the entire file systems); and some of them are not
format-dependent (and so could be overriden by a mount option).

So maybe we need to talk about is having a feature called
EXT4_FEATURE_INCOMPAT_CHARSET_ENCODING, which enables two fields in
the superblock.  One is the encoding identifier (and 8 or 16 bits is
probably *plenty*), and the other is the "encoding flags" field.

Some of these flags might specify an encoding --- e.g., the file
system supports normalization- and/or case- insensitive lookups in an
efficient way by normalizing the string before calculating the
dir_index hash.  Some of these might specify the default behavior
(e.g., case-insensitive or normalization-insensitive) file lookups if
not overridden by a mount option.

This assumes that normalization and case sensitivity are completely
orthogonal.

The other thing is there seems to be some debate (and Apple isn't even
consistent over time) over what kind of normalization is considered
"best" or "correct".  e.g., NFD, NFC, NFKD, NFKC.  And if you want to
export the file system over APFS, it might make a difference which one
you use.  (This is usually the point where some people will assert
that teaching everyone in the world English really *would* be easier
than supporting full I18N.  :-) Is this something we can or should
consider when deciding what we want to support in Linux long-term?

> If, for some reason, this is not a problem in this case, I can change it
> in the next iteration, to merge utf8n and utf8, and also allow other
> charsets.

... and what I'm really asking is do we really want to be specifying
whether or not normalization is a Thing as a property of the encoding,
or a property of the file system (or object, or document) that uses
that particular encoding?

						- Ted