On Wed, Apr 05, 2023 at 03:48:00AM -0700, Christoph Hellwig wrote: > On Tue, Apr 04, 2023 at 10:07:06AM -0700, Darrick J. Wong wrote: > > Which means that the kernel and userspace do not agree on the hash value > > for a directory filename that contains those higher values. The hash > > values are written into the leaf index block of directories that are > > larger than two blocks in size, which means that xfs_repair will flag > > these directories as having corrupted hash indexes and rewrite the index > > with hash values that the kernel now will not recognize. > > > > Because the ascii-ci feature is not frequently enabled and the kernel > > touches filesystems far more frequently than xfs_repair does, fix this > > by encoding the kernel's toupper predicate and tolower functions into > > libxfs. This makes userspace's behavior consistent with the kernel. > > I agree with making the userspace behavior consistent with the actual > kernel behavior. Sadly the documented behavior differs from both > of them, so I think we need to also document the actual tables used > in the mkfs.xfs manpage, as it isn't actually just ASCII. Agreed. Given that kernel tolower() behavior has been stable since 1996 (and remaps the ISO 8859-1 accented letters), the "ASCII CI" feature most closely maps to "ISO 8859-1 CI". But at this point there's not even a shared understanding (Dave said latin1, you said 7-bit ascii, IDGAF) so I agree that documenting the exact transformations in the manpage is the only sane way forward. I propose the changing the mkfs.xfs manpage wording from: "The version=ci option enables ASCII only case-insensitive filename lookup and version 2 directories. Filenames are case-preserving, that is, the names are stored in directories using the case they were created with." into: "If the version=ci option is specified, the kernel will transform certain bytes in filenames before performing lookup-related operations. The byte sequence given to create a directory entry is persisted without alterations. The lookup transformations are defined as follows: 0x41 - 0x5a -> 0x61 - 0x7a 0xc0 - 0xd6 -> 0xe0 - 0xf6 0xd8 - 0xde -> 0xf8 - 0xfe This transformation roughly corresponds to case insensitivity in ISO 8859-1 and may cause problems with other encodings (e.g. UTF8). The feature will be disabled by default in September 2025, and removed from the kernel in September 2030." > Does the kernel twolower behavior map to an actual documented charset? > In that case we can just point to it, which would be way either than > documenting all the details. It *seems* to operate on ISO 8859-1 (aka latin1), but Linus implied that the history of lib/ctype.c is lost to the ages. Or at least 1996-era mailing list archives. --D