On Tuesday 05 February 2019 14:08:00 Gabriel Krisman Bertazi wrote: > Pali Rohár <pali.rohar@xxxxxxxxx> writes: > > > On Monday 28 January 2019 16:32:12 Gabriel Krisman Bertazi wrote: > >> The main change presented here is a proposal to migrate the > >> normalization method from NFKD to NFD. After our discussions, and > >> reviewing other operating systems and languages aspects, I am more > >> convinced that canonical decomposition is more viable solution than > >> compatibility decomposition, because it doesn't ignore eliminate any > >> semantic meaning, like the definitive case of superscript numbers. NFD > >> is also the documented method used by HFS+ and APFS, so there is > >> precedent. Notice however, that as far as my research goes, APFS doesn't > >> completely follows NFD, and in some cases, like <compat> flags, it > >> actually does NFKD, but not in others (<fraction>), where it applies the > >> canonical form. We take a more consistent approach and always do plain NFD. > >> > >> This RFC, therefore, aims to resume/start conversation with some > >> stalkeholders that may have something to say regarding the normalization > >> method used. I added people from SMB, NFS and FS development who > >> might be interested on this. > > > > Hello! I think that choice of NFD normalization is not right decision. > > Some reasons: > > > > 1) NFD is not widely used. Even Apple does not use it (as you wrote > > Apple has own normalization form). > > To be exact, Apple claims to use NFD in their specification [1] . Interesting... > What I > observed is that they don't ignore some types of compatibility > characters correctly as they should. For instance, the ff ligature is > decomposed into f + f. I'm sure that Apple does not do NFD, but their own invented normal form. Some graphemes are decomposed, and some not. > > 2) All filesystems which I known either do not use any normalization or > > use NFC. > > 3) Lot of existing Linux application generate file names in NFC. > > > > Most do use NFC. But this is an internal representation for ext4 and it > is name preserving. Ok. I was in impression that it does not preserve original names, just like implementation in Apple's system, where char* passed to creat() does not appear in readdir(). > We only use the normalization when comparing if names > matches and to calculate dcache and dx hashes. The unicode standard > recomends the D forms for internal representation. Ok, this should be less destructive and less visible to userspace. > > 4) Linux GUI libraries like Qt and Gtk generate strings from key strokes > > in NFC. So if user type file name in Qt/Gtk box it would be in NFC. > > > > So why to use NFD in ext4 filesystem if Linux userspace ecosystem > > already uses NFC? > > NFC is costlier to calculate, usually requiring an intermediate NFD > step. Whether it is prohibitively expensive to do in the dcache path, I > don't know, but since it is a critical path, any gain matters. > > > NFD here just makes another layer of problems, unexpected things and > > make it somehow different. > > Is there any case where > NFC(x) == NFC(y) && NFD(x) != NFD(y) , or > NFC(x) != NFC(y) && NFD(x) == NFD(y) This is good question. And I think we should get definite answer for it prior inclusion of normalization into kernel. > I am having a hard time thinking of an example. This is the main > (only?) scenario where choosing C or D form for an internal > representation would affect userspace. For decision between normal format, probably yes. > > > > Why not rather choose NFS? It would be more compatible with Linux GUI > > applications and also with Microsoft Windows systems, which uses NFC > > too. > > > > Please, really consider to not use NFD. Most Linux applications really > > do not do any normalization or do NFC. And usage of decomposition form > > for application which do not implement full Unicode grapheme algorithms > > just make for them another problems. > > > Yes, there are still lot of legacy application which expect that one > > code point = one visible symbol (therefore one Unicode grapheme). And > > because GUI in most cases generates NFC strings, also existing file > > names are in NFC, these application works in most cases without problem. > > Force usage of NFD filenames just break them. > > As I said, this shouldn't be a problem because what the application > creates and retrieves is the exact name that was used before, we'd > only use this format for internal metadata on the disk (hashes) and for > in-kernel comparisons. There is another problem for userspace applications: Currently ext4 accepts as file name any sequence of bytes which do not contain nul byte and '/'. So having Latin1 file name is perfectly correct. What would happen if userspace application want to create following two file names? "\xDF" and "\F0"? First one is sharp S second one is eth (in Latin1). But file names are invalid UTF-8 sequences. Is it disallowed to create such file names? Or both file names are internally converted to "U+FFFD" (replacement character) and because NFD(first U+FFFD) == NFD(second U+FFFD) only first file would be created? And what happen in general with invalid UTF-8 sequences? Because there are many different types of invalid UTF-8 sequences, like non-shortest sequence for valid code point, valid sequence for invalid code points (either surrogate pairs code points, or code points above U+10FFFF, ...), incorrect byte which should start new code point, incorrect byte when decoding of code point started, ... Different (userspace) application handles these invalid UTF-8 sequences differently, some of them accept some kind of "incorrectness" (e.g. non-shortest form of code point representation), some not. Some applications replace invalid parts of UTF-8 sequence by sequence of UTF-8 replacement character, some not. Also it can be observed that some applications use just one replacement characters and some other replace invalid UTF-8 sequence by more replacement characters. So trying to "recover" from invalid UTF-8 sequence to valid one is done in more ways... And usage of any existing way could cause problems... E.g. not possible to create two files "\xDF\xF0" and "\xF0\xDF"... > > (PS: I think that only 2 programming languages implements Unicode > > grapheme algorithms correctly: Elixir and Perl 6; which is not so > > much) > > [1] https://developer.apple.com/support/apple-file-system/Apple-File-System-Reference.pdf > -- Pali Rohár pali.rohar@xxxxxxxxx