Re: [PATCH 17/20] ext4: Include encoding information in the superblock

"Theodore Y. Ts'o" <tytso@xxxxxxx> · Thu, 12 Jul 2018 17:54:30 -0400

On Thu, Jul 12, 2018 at 10:13:14AM -0400, Gabriel Krisman Bertazi wrote:
> 
> My concern here is that the casefold and normalization operations don't
> make sense, semantically, when dealing with opaque byte sequences.  We
> can assume that no-encoding means ASCII, but this is an arbitrary
> decision, that only make sense for english speakers.  I think it is
> safer/less confusing to only allow this kind of operation when an
> explicit encoding format is in place.

The real question which we need to answer (and document, so everyone
understands what should happen) is what should we do if we come across
an invalid byte sequence for a particular encoding?  And there are two
versions of this question --- what should we do if a stored name in a
directory is an invalid byte sequence?  What should we do if the user
has passed an invalid byte sequence to a system call?  (And for the
latter, should it be different depending on whether it is a creation,
lookup, deletion, or rename operation?)

We don't have a way of specifying the encoding of a filename being
passed in the system call, so usually people will either assume that
it's some fixed encoding (the native encoding of the system, whatever
that means, which in practice was most commonly ASCII, ISO-Latin-1, or
UTF-8, with the last being the most common in more modern systems).

In your patches, it looks like you aren't actually doing any
processing (either enforcing that the byte sequence is valid, or
normalizing, which I understand is highly controversial and has caused
much hand-wringing in the Apple world recently since the defaults have
changed post-APFS) on filenames when they are passed to ext4 for
creation.  So there will quite possibly be invalid byte characters in
a directory entry.  So we need to be clear how they should be handled.

And even if we did prevent those file names from being created during
the normal course of events, we still need to understand what to do if
we come across one (if the file system was corrupted in some way,
either accidentally or deliberately).

> > The normalization for ASCII is the identify function, so it's kind of
> > pointless to support ASCII if we ony have case-folding support and not
> > normalization for now, right?
> 
> Yes.  the ASCII normalization is boilerplate code, in a sense, since it
> is just the identity function, but I'm trying to make the NLS interface
> generic enough to minimize how much EXT4 needs to know about encoding.
> Ideally, this knowledge should be left for the NLS system, in my
> opinion.  Does it make sense?

It does.  How big does the kernel get if we enable only NLS and ASCII?
If it's small, maybe we don't need to worry all that much.

   	       	     	      	      	    	- Ted