Re: [PATCH 17/20] ext4: Include encoding information in the superblock

Gabriel Krisman Bertazi <krisman@xxxxxxxxxxxxxxx> · Mon, 16 Jul 2018 16:57:43 -0400

"Theodore Y. Ts'o" <tytso@xxxxxxx> writes:

> On Thu, Jul 12, 2018 at 10:13:14AM -0400, Gabriel Krisman Bertazi wrote:
>> 
>> My concern here is that the casefold and normalization operations don't
>> make sense, semantically, when dealing with opaque byte sequences.  We
>> can assume that no-encoding means ASCII, but this is an arbitrary
>> decision, that only make sense for english speakers.  I think it is
>> safer/less confusing to only allow this kind of operation when an
>> explicit encoding format is in place.
>
> The real question which we need to answer (and document, so everyone
> understands what should happen) is what should we do if we come across
> an invalid byte sequence for a particular encoding?  And there are two
> versions of this question --- what should we do if a stored name in a
> directory is an invalid byte sequence?  What should we do if the user
> has passed an invalid byte sequence to a system call?  (And for the
> latter, should it be different depending on whether it is a creation,
> lookup, deletion, or rename operation?)

Hi Ted,

I gave this more thought over the weekend, and I'm convinced now that we
cannot simply reject invalid filenames, like I wanted to do. There are
many cases where the user won't control the file names. For instance, in
the example of APFS you gave, there was a case where it was failing to
extract data from an archive that had an utf8 invalid sequence file
name.  I'm sure there are much other cases out there of programs that
save files with arbitrary encodings, and we would be breaking them by
blindly rejecting these files.

For the similar reason of not breaking programs, we should be able to
return the exact match file if the user requested it.  This would
provide a way for the user to access files with 'broken' names to fix
them or just use them normally.  The only case where we need to special
handle exact-match files is when dealing with invalid sequences if we
can prevent the possibility of name collisions, for instance, if we
require encoding to only be set at superblock creation time, and only
allow setting the casefold flag if the directory is empty.  Assuming
this is correct, I think it is safe to define for ext4 that invalid
sequences are considered opaque byte sequences, and fallback to treat
them as such.

I don't see why treating invalid names as opaque sequences would be
problematic, for the sake of the argument would it be equivalent to say
that invalid code-points are valid and decompose and fold to themselves?

> We don't have a way of specifying the encoding of a filename being
> passed in the system call, so usually people will either assume that
> it's some fixed encoding (the native encoding of the system, whatever
> that means, which in practice was most commonly ASCII, ISO-Latin-1, or
> UTF-8, with the last being the most common in more modern systems).
>
> In your patches, it looks like you aren't actually doing any
> processing (either enforcing that the byte sequence is valid, or
> normalizing, which I understand is highly controversial and has caused
> much hand-wringing in the Apple world recently since the defaults have
> changed post-APFS) on filenames when they are passed to ext4 for
> creation.  So there will quite possibly be invalid byte characters in
> a directory entry.  So we need to be clear how they should be handled.

I didn't want to save normalization on-disk because I'm thinking of
casefold as parallel to normalization, which just adds the handling of
case.  If we store the normalization/casefold in the disk as the dentry
name, the file system is no longer normalization/casefold-preserving to
what the user requested, and a readdir() would be confusing in the
casefold case.  I'd like to eventually try to store it side-by-side with
the user created name in the future, but it would require changing the
directory entry layout.

In the current patchset, I'm rejecting invalid sequences (or at least I
thought so) as soon as the normalization fails.  The open syscall will
trigger an -EINVAL once it identifies a bad sequence, for instance.

>> > The normalization for ASCII is the identify function, so it's kind of
>> > pointless to support ASCII if we ony have case-folding support and not
>> > normalization for now, right?
>> 
>> Yes.  the ASCII normalization is boilerplate code, in a sense, since it
>> is just the identity function, but I'm trying to make the NLS interface
>> generic enough to minimize how much EXT4 needs to know about encoding.
>> Ideally, this knowledge should be left for the NLS system, in my
>> opinion.  Does it make sense?
>
> It does.  How big does the kernel get if we enable only NLS and ASCII?
> If it's small, maybe we don't need to worry all that much.

looks like bzImage increased by 4k.

$ diff -u without-nls/config with-nls/config

--- without-nls/config	2018-07-16 16:28:21.833479753 -0400
+++ with-nls/config	2018-07-16 16:31:16.466500014 -0400
@@ -1511,7 +1511,58 @@
 CONFIG_9P_FS=y
 CONFIG_9P_FS_POSIX_ACL=y
 CONFIG_9P_FS_SECURITY=y
+CONFIG_NLS=y
+CONFIG_NLS_DEFAULT="ascii"
+CONFIG_NLS_ASCII=y

[trimmed out the 'CONFIG_NLS_* is not set' lines]

$ ls -l without-nls/bzImage with-nls/bzImage
-rw-r--r-- 1 krisman krisman 3477552 Jul 16 16:31 with-nls/bzImage
-rw-r--r-- 1 krisman krisman 3473456 Jul 16 16:28 without-nls/bzImage

Thanks!

-- 
Gabriel Krisman Bertazi