On 26/10/2022 20.10, Linus Torvalds wrote: > On Tue, Oct 25, 2022 at 5:10 PM Rasmus Villemoes > <linux@xxxxxxxxxxxxxxxxxx> wrote: >> >> Only very tangentially related (because it has to do with chars...): Can >> we switch our ctype to be ASCII only, just as it was back in the good'ol >> mid 90s > > Those US-ASCII days weren't really very "good" old days, but I forget > why we did this (it's attributed to me, but that's from the > pre-BK/pre-git days before we actually tracked things all that well, > so..) > > Anyway, I think anybody using ctype.h on 8-bit chars gets what they > deserve, and I think Latin1 (or something close to it) is better than > US-ASCII, in that it's at least the same as Unicode in the low 8 > chars. My concern is that it's currently somewhat ill specified what our ctype actually represents, and that would be a lot easier to specify if we just said ASCII, everything above 0x7f is neither punct or ctrl or alpha or anything else. For example, people may do stuff like isprint(c) ? c : '.' in a printk() call, but most likely the consumer (somebody doing dmesg) would, at least these days, use utf-8, so that just results in a broken utf-8 sequence. Now I see that a lot of callers actually do "isascii(c) && isprint(c)", so they already know about this, but there are also many instances where isprint() is used by itself. There's also stuff like fs/afs/cell.c and other places that use isprint/isalnum/... to make decisions on what is allowed on the wire and/or in a disk format, where it's then hard to reason about just exactly what is accepted. And places that use toupper() on their strings to normalize them; that's broken when toupper() isn't idempotent. > So no, I'm disinclined to go back in time to what I think is an even > worse situation. Latin1 isn't great, but it sure beats US-ASCII. And > if you really want just US-ASII, then don't use the high bit, and make > your disgusting 7-bit code be *explicitly* 7-bit. > > Now, if there are errors in that table wrt Latin1 / "first 256 > codepoints of Unicode" too, then we can fix those. AFAICT, the differences are: - 0xaa (FEMININE ORDINAL INDICATOR), 0xb5 (MICRO SIGN), 0xba (FEMININE ORDINAL INDICATOR) should be lower (hence alpha and alnum), not punct. - depending a little on just exactly what one wants latin1 to mean, but if it does mean "first 256 codepoints of Unicode", 0x80-0x9f should be cntrl - for some reason at least glibc seems to classify 0xa0 as punctuation and not space (hence also as isgraph) - 0xdf and 0xff are correctly classified as lower, but since they don't have upper-case versions (at least not any that are representable in latin1), correct toupper() behaviour is to return them unchanged, but we just subtract 0x20, so 0xff becomes 0xdf which isn't isupper() and 0xdf becomes something that isn't even isalpha(). Fixing the first would create more instances of the last, and I think the only sane way to fix that would be a 256 byte lookup table to use by toupper(). > Not that anybody has apparently cared since 2.0.1 was released back in > July of 1996 (btw, it's sad how none of the old linux git archive > creations seem to have tried to import the dates, so you have to look > those up separately) Huh? That commit has 1996 as the author date, while its commit date is indeed 2007. The very first line says: author linus1 <torvalds@xxxxxxxxxxxxxxxxxxx> 1996-07-02 11:00:00 -0600 > And if nobody has cared since 1996, I don't really think it matters. Indeed, I don't think it's a huge problem in practice. But it still bothers me that such a simple (and usually overlooked) corner of the kernel's C library is ill-defined and arguably a little buggy. Rasmus