Re: [PATCH 1/3] xfs: stabilize the tolower function used for ascii-ci dir hash computation

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Wed, Apr 05, 2023 at 03:48:00AM -0700, Christoph Hellwig wrote:
> On Tue, Apr 04, 2023 at 10:07:06AM -0700, Darrick J. Wong wrote:
> > Which means that the kernel and userspace do not agree on the hash value
> > for a directory filename that contains those higher values.  The hash
> > values are written into the leaf index block of directories that are
> > larger than two blocks in size, which means that xfs_repair will flag
> > these directories as having corrupted hash indexes and rewrite the index
> > with hash values that the kernel now will not recognize.
> > 
> > Because the ascii-ci feature is not frequently enabled and the kernel
> > touches filesystems far more frequently than xfs_repair does, fix this
> > by encoding the kernel's toupper predicate and tolower functions into
> > libxfs.  This makes userspace's behavior consistent with the kernel.
> 
> I agree with making the userspace behavior consistent with the actual
> kernel behavior.  Sadly the documented behavior differs from both
> of them, so I think we need to also document the actual tables used
> in the mkfs.xfs manpage, as it isn't actually just ASCII.

Agreed.  Given that kernel tolower() behavior has been stable since 1996
(and remaps the ISO 8859-1 accented letters), the "ASCII CI" feature
most closely maps to "ISO 8859-1 CI".  But at this point there's not
even a shared understanding (Dave said latin1, you said 7-bit ascii,
IDGAF) so I agree that documenting the exact transformations in the
manpage is the only sane way forward.

I propose the changing the mkfs.xfs manpage wording from:

"The version=ci  option  enables  ASCII  only case-insensitive filename
lookup and version 2 directories. Filenames  are  case-preserving, that
is, the names are stored in directories using  the  case  they  were
created with."

into:

"If the version=ci option is specified, the kernel will transform
certain bytes in filenames before performing lookup-related operations.
The byte sequence given to create a directory entry is persisted without
alterations.  The lookup transformations are defined as follows:

0x41 - 0x5a -> 0x61 - 0x7a
0xc0 - 0xd6 -> 0xe0 - 0xf6
0xd8 - 0xde -> 0xf8 - 0xfe

This transformation roughly corresponds to case insensitivity in ISO
8859-1 and may cause problems with other encodings (e.g. UTF8).  The
feature will be disabled by default in September 2025, and removed from
the kernel in September 2030."

> Does the kernel twolower behavior map to an actual documented charset?
> In that case we can just point to it, which would be way either than
> documenting all the details.

It *seems* to operate on ISO 8859-1 (aka latin1), but Linus implied that
the history of lib/ctype.c is lost to the ages.  Or at least 1996-era
mailing list archives.

--D



[Index of Archives]     [XFS Filesystem Development (older mail)]     [Linux Filesystem Development]     [Linux Audio Users]     [Yosemite Trails]     [Linux Kernel]     [Linux RAID]     [Linux SCSI]


  Powered by Linux