On 3/18/19 1:27 PM, Gabriel Krisman Bertazi wrote: > Hi, > > This version pretty much the same as v5. I am resending cause as the > previous version didn't grab much discussion on the main topic of moving > from KD to D. > > Same as version 5, at a first glance, you will notice the series got a > lot smaller, with the separation of unicode code from the NLS subsystem, > as Linus requested. The ext4 parts are pretty much the same, with only > the addition of a verification in ext4_feature_set_ok() to fail encoding > mounts when without CONFIG_UNICODE on newer kernels. > > The main change presented here is a proposal to migrate the > normalization method from NFKD to NFD. After our discussions, and > reviewing other operating systems and languages aspects, I am more > convinced that canonical decomposition is more viable solution than > compatibility decomposition, because it doesn't ignore eliminate any > semantic meaning, like the definitive case of superscript numbers. NFD > is also the documented method used by HFS+ and APFS, so there is > precedent. Notice however, that as far as my research goes, APFS doesn't > completely follows NFD, and in some cases, like <compat> flags, it > actually does NFKD, but not in others (<fraction>), where it applies the > canonical form. We take a more consistent approach and always do plain NFD. > > This RFC, therefore, aims to resume/start conversation with some > stalkeholders that may have something to say regarding the normalization > method used. I added people from SMB, NFS and FS development who > might be interested on this. > > Regarding Casefold, I am unsure whether Casefold Common + Full still > makes sense after migrating from the compatibility to the canonical > form. While Casefold Full, by definition, addresses cases where the > casefolding grows in size, like the casefold of the german eszett to SS, > it also is responsible for folding smallcase ligatures without a > corresponding uppercase to their compatible counterpart. Which means > that on -F directories, o_f_f_i_c_e and o_ff_i_c_e will differ, while on > +F directories they will match. This seems unaceptable to me, > suggesting that we should start to use Common + Simple instead of Common > + Full, but I would like more input on what seems more reasonable to > you. > > After we decide on this, I will be sending new patches to update > e2fsprogs to the agreed method and remove the normalization/casefold > type flags (EXT4_UTF8_NORMALIZATION_TYPE_NFKD, > EXT4_UTF8_CASEFOLD_TYPE_NFKDCF), before actually proposing the current > patch series for inclusion in the kernel. > > For the record, I am aware that unicode 12 was released 2 weeks ago. The > world can't live without a new set of emojis every 6 months. I will > withold updating the unicode version until we get something > upstreamable, then I will update to the latest version and send a new > version. This way I avoid having to update versions that will never > actually be used. > > Practical things, w.r.t. this patch series: > > - As usual, the UCD files are not part of the series, because they > would cause the email to bounce. To test this one would need to fetch > the files as explained in the commit message. > > - If you prefer, you can checkout from > https://gitlab.collabora.com/krisman/linux -b ext4-ci-directory-no-nls > > - More details on the design decisions restricted to ext4 are > available in the corresponding commit messages. > > Thanks! > Hi, I briefly scanned but did not look terribly closely: Does this patch series ignore ext3 filesystems that are being handled by the ext4fs code? Thanks. > > Gabriel Krisman Bertazi (7): > unicode: Implement higher level API for string handling > unicode: Introduce test module for normalized utf8 implementation > MAINTAINERS: Add Unicode subsystem entry > ext4: Include encoding information in the superblock > ext4: Support encoding-aware file name lookups > ext4: Implement EXT4_CASEFOLD_FL flag > docs: ext4.rst: Document encoding and case-insensitive > > Olaf Weber (4): > unicode: Add unicode character database files > scripts: add trie generator for UTF-8 > unicode: Introduce code for UTF-8 normalization > unicode: reduce the size of utf8data[] > > Documentation/admin-guide/ext4.rst | 41 + > MAINTAINERS | 6 + > fs/Kconfig | 1 + > fs/Makefile | 1 + > fs/ext4/dir.c | 43 + > fs/ext4/ext4.h | 42 +- > fs/ext4/hash.c | 38 +- > fs/ext4/ialloc.c | 2 +- > fs/ext4/inline.c | 2 +- > fs/ext4/inode.c | 4 +- > fs/ext4/ioctl.c | 18 + > fs/ext4/namei.c | 104 +- > fs/ext4/super.c | 91 + > fs/unicode/Kconfig | 13 + > fs/unicode/Makefile | 22 + > fs/unicode/ucd/README | 33 + > fs/unicode/utf8-core.c | 183 ++ > fs/unicode/utf8-norm.c | 797 +++++++ > fs/unicode/utf8-selftest.c | 320 +++ > fs/unicode/utf8n.h | 117 + > include/linux/fs.h | 2 + > include/linux/unicode.h | 30 + > scripts/Makefile | 1 + > scripts/mkutf8data.c | 3418 ++++++++++++++++++++++++++++ > 24 files changed, 5307 insertions(+), 22 deletions(-) > create mode 100644 fs/unicode/Kconfig > create mode 100644 fs/unicode/Makefile > create mode 100644 fs/unicode/ucd/README > create mode 100644 fs/unicode/utf8-core.c > create mode 100644 fs/unicode/utf8-norm.c > create mode 100644 fs/unicode/utf8-selftest.c > create mode 100644 fs/unicode/utf8n.h > create mode 100644 include/linux/unicode.h > create mode 100644 scripts/mkutf8data.c > -- ~Randy