On Thu, Oct 19, 2017 at 12:18:42AM -0700, Christoph Hellwig wrote: > On Wed, Oct 18, 2017 at 04:37:55PM -0700, Darrick J. Wong wrote: > > From: Darrick J. Wong <darrick.wong@xxxxxxxxxx> > > > > The upcoming xfs_scrub tool will have the ability to warn about > > suspicious UTF-8 normalization collisions. We want generic/45[34] to be > > able to test this functionality, but to do that we have to forcibly set > > the codeset to UTF-8 via LC_ALL since the rest of xfstests only uses > > LC_ALL=C. > > Wait. Where do you want to validate UTF-8 normalization? There is > absolutely no guarantee that someone uses UTF-8, so any reliance on > the character set in the file system is bogus. I'll start by summarizing a problem statement[1]. In XFS (and nearly all the other filesystems), neither the on-disk format nor the kernel driver care about the contents of file names or attribute names; they treat these as an arbitrary byte sequence. Userspace can set whatever localization and encoding parameters it wants, and the kernel doesn't care except for '\0' and '/'. That doesn't change. In modern Linux userspace, however, we /do/ care about being able to encode Unicode codepoints into byte streams, so we encode them in UTF8. Because there's two different normalization methods in Unicode, this leads to the funny situation where two unique filename byte sequences can render the same but point to totally different files: $ echo NFC > "$(echo -e "french_caf\xc3\xa9.txt")" $ echo NFD > "$(echo -e "french_caf\xcc\x81.txt")" $ ls -lai 133 -rw-r--r-- 1 root root 4 Oct 20 10:40 french_café.txt 132 -rw-r--r-- 1 root root 4 Oct 20 10:40 french_café.txt $ echo $LANG en_US.UTF-8 At least on my computer, the two filenames render identically yet point to different inodes. This could be used to mislead people into opening a malicious file whose name appears identical to a legitimate file. xfs_scrub is the (proposed) userspace component of XFS online fsck. The first four phases simply call the in-kernel fsck code and pass status back, but the fifth phase walks the directory tree looking for problems. If xfs_scrub (the userspace component of online fsck) was built with libunistring and the LC_MESSAGES string contains "UTF-8", phase 5 will warn if it finds multiple filenames in a directory that normalize to the same string but point to different inodes. Similarly, it will warn about colliding attribute names. Warnings in xfs_scrub are for situations that warrant administrative review but are not filesystem corruptions. IOWs, if userspace is configured for UTF-8, the userspace part of online fsck will flag suspicious-looking uses of Unicode for admin review. The kernel remains uninvolved. --D [1] https://eclecticlight.co/2017/04/06/apfs-is-currently-unusable-with-most-non-english-languages/ > -- > To unsubscribe from this list: send the line "unsubscribe fstests" in > the body of a message to majordomo@xxxxxxxxxxxxxxx > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-xfs" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html