Re: [PATCH 1/3] xfs: stabilize the tolower function used for ascii-ci dir hash computation

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Tue, Apr 04, 2023 at 11:32:14AM -0700, Darrick J. Wong wrote:
> On Tue, Apr 04, 2023 at 10:54:27AM -0700, Linus Torvalds wrote:
> > On Tue, Apr 4, 2023 at 10:07 AM Darrick J. Wong <djwong@xxxxxxxxxx> wrote:
> > >
> > > +       if (c >= 0xc0 && c <= 0xd6)     /* latin A-O with accents */
> > > +               return true;
> > > +       if (c >= 0xd8 && c <= 0xde)     /* latin O-Y with accents */
> > > +               return true;
> > 
> > Please don't do this.
> > 
> > We're not in the dark ages any more. We don't do crazy locale-specific
> > crud. There is no such thing as "latin1" any more in any valid model.
> > 
> > For example, it is true that 0xC4 is 'Ä' in Latin1, and that the
> > lower-case version is 'ä', and you can do the lower-casing exactly the
> > same way as you do for US-ASCII: you just set bit 5 (or "add 32" or
> > "subtract 0xE0" - the latter is what you seem to do, crazy as it is).
> > 
> > So the above was fine back in the 80s, and questionably correct in the
> > 90s, but it is COMPLETE GARBAGE to do this in the year 2023.
> 
> Yeah, I get that.  Fifteen years ago, Barry Naujok and Christoph merged
> this weird ascii-ci feature for XFS that purportedly does ... something.
> It clearly only works properly if you force userspace to use latin1,
> which is totally nuts in 2023 given that the distros default to UTF8
> and likely don't test anything else.  It probably wasn't even a good
> idea in *2008*, but it went in anyway.  Nobody tested this feature,
> metadump breaks with this feature enabled, but as maintainer I get to
> maintain these poorly designed half baked projects.

It was written specifically for a NFS/CIFS fileserver appliance
product and Samba needed filesystem-side CI to be able to perform
even vaguely well on industry-standard fileserver benchmarketing
workloads that were all the rage at the time.

Because of the inherent problems with CI and UTF-8 encoding, etc, it
was decided that Samba would be configured to export latin1
encodings as that covered >90% of the target markets for the
product. Hence the "ascii-ci" casefolding code could be encoded into
the XFS directory operations and remove all the overhead of
casefolding from Samba.

In various "important" directory benchmarketing workloads, ascii-ci
resulted in speedups of 100-1000x. These were competitive results
comapred to the netapp/bluearc/etc appliances of the time in the
same cost bracket.  Essentially, it was a special case solution to
meet a specific product requirement and was never intended to be
used outside that specific controlled environment.

Realistically, this is the one major downside of "upstream first"
development principle.  i.e. when the vendor product that required
a specific feature is long gone, upstream still has to support that
functionality even though there may be no users of it remaining
and/or no good reason for it continuing to exist.

I'd suggest that after this is fixed we deprecate ascii-ci and make
it go away at the same time v4 filesystems go away. It was, after
all, a feature written for v4 filesystems....

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx



[Index of Archives]     [XFS Filesystem Development (older mail)]     [Linux Filesystem Development]     [Linux Audio Users]     [Yosemite Trails]     [Linux Kernel]     [Linux RAID]     [Linux SCSI]


  Powered by Linux