Re: [PATCH RFC v6 00/11] Ext4 Encoding and Case-insensitive support

Randy Dunlap <rdunlap@xxxxxxxxxxxxx> · Thu, 21 Mar 2019 15:30:35 -0700

On 3/18/19 1:27 PM, Gabriel Krisman Bertazi wrote:
> Hi,
> 
> This version pretty much the same as v5. I am resending cause as the
> previous version didn't grab much discussion on the main topic of moving
> from KD to D.
> 
> Same as version 5, at a first glance, you will notice the series got a
> lot smaller, with the separation of unicode code from the NLS subsystem,
> as Linus requested.  The ext4 parts are pretty much the same, with only
> the addition of a verification in ext4_feature_set_ok() to fail encoding
> mounts when without CONFIG_UNICODE on newer kernels.
> 
> The main change presented here is a proposal to migrate the
> normalization method from NFKD to NFD.  After our discussions, and
> reviewing other operating systems and languages aspects, I am more
> convinced that canonical decomposition is more viable solution than
> compatibility decomposition, because it doesn't ignore eliminate any
> semantic meaning, like the definitive case of superscript numbers.  NFD
> is also the documented method used by HFS+ and APFS, so there is
> precedent. Notice however, that as far as my research goes, APFS doesn't
> completely follows NFD, and in some cases, like <compat> flags, it
> actually does NFKD, but not in others (<fraction>), where it applies the
> canonical form.  We take a more consistent approach and always do plain NFD.
> 
> This RFC, therefore, aims to resume/start conversation with some
> stalkeholders that may have something to say regarding the normalization
> method used.  I added people from SMB, NFS and FS development who
> might be interested on this.
> 
> Regarding Casefold, I am unsure whether Casefold Common + Full still
> makes sense after migrating from the compatibility to the canonical
> form.  While Casefold Full, by definition, addresses cases where the
> casefolding grows in size, like the casefold of the german eszett to SS,
> it also is responsible for folding smallcase ligatures without a
> corresponding uppercase to their compatible counterpart.  Which means
> that on -F directories, o_f_f_i_c_e and o_ff_i_c_e will differ, while on
> +F directories they will match.  This seems unaceptable to me,
> suggesting that we should start to use Common + Simple instead of Common
> + Full, but I would like more input on what seems more reasonable to
> you.
> 
> After we decide on this, I will be sending new patches to update
> e2fsprogs to the agreed method and remove the normalization/casefold
> type flags (EXT4_UTF8_NORMALIZATION_TYPE_NFKD,
> EXT4_UTF8_CASEFOLD_TYPE_NFKDCF), before actually proposing the current
> patch series for inclusion in the kernel.
> 
> For the record, I am aware that unicode 12 was released 2 weeks ago. The
> world can't live without a new set of emojis every 6 months.  I will
> withold updating the unicode version until we get something
> upstreamable, then I will update to the latest version and send a new
> version.  This way I avoid having to update versions that will never
> actually be used.
> 
> Practical things, w.r.t. this patch series:
> 
>   - As usual, the UCD files are not part of the series, because they
>   would cause the email to bounce.  To test this one would need to fetch
>   the files as explained in the commit message.
> 
>   - If you prefer, you can checkout from
>      https://gitlab.collabora.com/krisman/linux -b ext4-ci-directory-no-nls
> 
>   - More details on the design decisions restricted to ext4 are
>     available in the corresponding commit messages.
> 
> Thanks!
> 

Hi,
I briefly scanned but did not look terribly closely:

Does this patch series ignore ext3 filesystems that are being handled
by the ext4fs code?

Thanks.

> 
> Gabriel Krisman Bertazi (7):
>   unicode: Implement higher level API for string handling
>   unicode: Introduce test module for normalized utf8 implementation
>   MAINTAINERS: Add Unicode subsystem entry
>   ext4: Include encoding information in the superblock
>   ext4: Support encoding-aware file name lookups
>   ext4: Implement EXT4_CASEFOLD_FL flag
>   docs: ext4.rst: Document encoding and case-insensitive
> 
> Olaf Weber (4):
>   unicode: Add unicode character database files
>   scripts: add trie generator for UTF-8
>   unicode: Introduce code for UTF-8 normalization
>   unicode: reduce the size of utf8data[]
> 
>  Documentation/admin-guide/ext4.rst |   41 +
>  MAINTAINERS                        |    6 +
>  fs/Kconfig                         |    1 +
>  fs/Makefile                        |    1 +
>  fs/ext4/dir.c                      |   43 +
>  fs/ext4/ext4.h                     |   42 +-
>  fs/ext4/hash.c                     |   38 +-
>  fs/ext4/ialloc.c                   |    2 +-
>  fs/ext4/inline.c                   |    2 +-
>  fs/ext4/inode.c                    |    4 +-
>  fs/ext4/ioctl.c                    |   18 +
>  fs/ext4/namei.c                    |  104 +-
>  fs/ext4/super.c                    |   91 +
>  fs/unicode/Kconfig                 |   13 +
>  fs/unicode/Makefile                |   22 +
>  fs/unicode/ucd/README              |   33 +
>  fs/unicode/utf8-core.c             |  183 ++
>  fs/unicode/utf8-norm.c             |  797 +++++++
>  fs/unicode/utf8-selftest.c         |  320 +++
>  fs/unicode/utf8n.h                 |  117 +
>  include/linux/fs.h                 |    2 +
>  include/linux/unicode.h            |   30 +
>  scripts/Makefile                   |    1 +
>  scripts/mkutf8data.c               | 3418 ++++++++++++++++++++++++++++
>  24 files changed, 5307 insertions(+), 22 deletions(-)
>  create mode 100644 fs/unicode/Kconfig
>  create mode 100644 fs/unicode/Makefile
>  create mode 100644 fs/unicode/ucd/README
>  create mode 100644 fs/unicode/utf8-core.c
>  create mode 100644 fs/unicode/utf8-norm.c
>  create mode 100644 fs/unicode/utf8-selftest.c
>  create mode 100644 fs/unicode/utf8n.h
>  create mode 100644 include/linux/unicode.h
>  create mode 100644 scripts/mkutf8data.c
> 

-- 
~Randy