"Theodore Y. Ts'o" <tytso@xxxxxxx> writes: > On Thu, Jul 12, 2018 at 10:13:14AM -0400, Gabriel Krisman Bertazi wrote: >> >> My concern here is that the casefold and normalization operations don't >> make sense, semantically, when dealing with opaque byte sequences. We >> can assume that no-encoding means ASCII, but this is an arbitrary >> decision, that only make sense for english speakers. I think it is >> safer/less confusing to only allow this kind of operation when an >> explicit encoding format is in place. > > The real question which we need to answer (and document, so everyone > understands what should happen) is what should we do if we come across > an invalid byte sequence for a particular encoding? And there are two > versions of this question --- what should we do if a stored name in a > directory is an invalid byte sequence? What should we do if the user > has passed an invalid byte sequence to a system call? (And for the > latter, should it be different depending on whether it is a creation, > lookup, deletion, or rename operation?) Hi Ted, I gave this more thought over the weekend, and I'm convinced now that we cannot simply reject invalid filenames, like I wanted to do. There are many cases where the user won't control the file names. For instance, in the example of APFS you gave, there was a case where it was failing to extract data from an archive that had an utf8 invalid sequence file name. I'm sure there are much other cases out there of programs that save files with arbitrary encodings, and we would be breaking them by blindly rejecting these files. For the similar reason of not breaking programs, we should be able to return the exact match file if the user requested it. This would provide a way for the user to access files with 'broken' names to fix them or just use them normally. The only case where we need to special handle exact-match files is when dealing with invalid sequences if we can prevent the possibility of name collisions, for instance, if we require encoding to only be set at superblock creation time, and only allow setting the casefold flag if the directory is empty. Assuming this is correct, I think it is safe to define for ext4 that invalid sequences are considered opaque byte sequences, and fallback to treat them as such. I don't see why treating invalid names as opaque sequences would be problematic, for the sake of the argument would it be equivalent to say that invalid code-points are valid and decompose and fold to themselves? > We don't have a way of specifying the encoding of a filename being > passed in the system call, so usually people will either assume that > it's some fixed encoding (the native encoding of the system, whatever > that means, which in practice was most commonly ASCII, ISO-Latin-1, or > UTF-8, with the last being the most common in more modern systems). > > In your patches, it looks like you aren't actually doing any > processing (either enforcing that the byte sequence is valid, or > normalizing, which I understand is highly controversial and has caused > much hand-wringing in the Apple world recently since the defaults have > changed post-APFS) on filenames when they are passed to ext4 for > creation. So there will quite possibly be invalid byte characters in > a directory entry. So we need to be clear how they should be handled. I didn't want to save normalization on-disk because I'm thinking of casefold as parallel to normalization, which just adds the handling of case. If we store the normalization/casefold in the disk as the dentry name, the file system is no longer normalization/casefold-preserving to what the user requested, and a readdir() would be confusing in the casefold case. I'd like to eventually try to store it side-by-side with the user created name in the future, but it would require changing the directory entry layout. In the current patchset, I'm rejecting invalid sequences (or at least I thought so) as soon as the normalization fails. The open syscall will trigger an -EINVAL once it identifies a bad sequence, for instance. >> > The normalization for ASCII is the identify function, so it's kind of >> > pointless to support ASCII if we ony have case-folding support and not >> > normalization for now, right? >> >> Yes. the ASCII normalization is boilerplate code, in a sense, since it >> is just the identity function, but I'm trying to make the NLS interface >> generic enough to minimize how much EXT4 needs to know about encoding. >> Ideally, this knowledge should be left for the NLS system, in my >> opinion. Does it make sense? > > It does. How big does the kernel get if we enable only NLS and ASCII? > If it's small, maybe we don't need to worry all that much. looks like bzImage increased by 4k. $ diff -u without-nls/config with-nls/config --- without-nls/config 2018-07-16 16:28:21.833479753 -0400 +++ with-nls/config 2018-07-16 16:31:16.466500014 -0400 @@ -1511,7 +1511,58 @@ CONFIG_9P_FS=y CONFIG_9P_FS_POSIX_ACL=y CONFIG_9P_FS_SECURITY=y +CONFIG_NLS=y +CONFIG_NLS_DEFAULT="ascii" +CONFIG_NLS_ASCII=y [trimmed out the 'CONFIG_NLS_* is not set' lines] $ ls -l without-nls/bzImage with-nls/bzImage -rw-r--r-- 1 krisman krisman 3477552 Jul 16 16:31 with-nls/bzImage -rw-r--r-- 1 krisman krisman 3473456 Jul 16 16:28 without-nls/bzImage Thanks! -- Gabriel Krisman Bertazi