Re: [PATCH] isalpha.3: behavior is undefined if c is out-of-range

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi,

On 6/5/23 14:34, Yedidyah Bar David wrote:
> Hi,
> 
> On Mon, Jun 5, 2023 at 2:35 PM Alejandro Colomar <alx.manpages@xxxxxxxxx> wrote:
>>
>> Hi Yedidyah,
>>
>> On 6/5/23 13:17, Yedidyah Bar David wrote:
>>> Clarify that the behavior of these functions is undefined if c is
>>> neither in the unsigned char range nor EOF.
>>>
>>> I copied the added text from toupper.3.
>>>
>>> In practice, calling them on out-of-range values - tested with recent
>>> glibc/gcc - is simply reading from random process memory - meaning, you
>>> either get some garbage, if the memory was readable, or a segmentation
>>> fault. See also:
>>>
>>> https://stackoverflow.com/questions/65514890/glibcs-isalpha-function-and-the-en-us-utf-8-locale
>>>
>>> Signed-off-by: Yedidyah Bar David <didi@xxxxxxxxxx>
>>
>> This is already covered by the NOTES section, isn't it?
> 
> It's _mentioned_ there, correct - but not sure it's covered.

You're right.  That's why I've sent the patch mentioning UB.
What do you think about that one?  (I now see that you like it).

> 
> It's also mentioned in toupper.3's NOTES.

I'll check that page to see if it needs some simplifying.

> 
> I think it's helpful to explicitly say that behavior is undefined in this case.

Yep.

> If you feel like doing this inside NOTES, one way or another, ok for me.
> 
> Right now, NOTES says what you must do, but not what happens if you
> don't do that.
> 
> It also says that for the common case of using them on signed char, you should
> explicitly cast to unsigned char, first. It also tries to explain why this is
> necessary. The explanation explains why it's necessary for compliance with the
> standard, but not why it's a good thing more generally - latter is not
> explained,
> and AFAICT from reading glibc sources, is not necessary - see e.g. this comment
> from ctype.h:
> 
>    These point into arrays of 384, so they can be indexed by any `unsigned
>    char' value [0,255]; by EOF (-1); or by any `signed char' value
>    [-128,-1).  ISO C requires that the ctype functions work for `unsigned
>    char' values and for EOF; we also support negative `signed char' values
>    for broken old programs.

Consider what happens with character 0xFF.  If char is signed, it will be
interpreted as -1 (i.e., EOF).  We're lucky, because 0xFF is not a meaningful
char, so probably all isXXX() functions return false for it, but it's slightly
different from EOF semantically.  If no locales give a meaning for 0xFF, maybe
the cast can be removed from ISO C.  I do something different: use
-funsigned-char when compiling, so char is effectively unsigned char (except
that pointers do not convert automatically).

> 
> The real reason why you should not call them on negative values other than
> EOF - casted to unsigned char or not - is simply that most likely this isn't
> what you meant to do. But that's not about compliance with the standard...

I guess the standard was cautious to not make 0xFF a useless char.  If that's
not an issue, I agree, and these functions could do the conversion internally.

Cheers,
Alex

> 
> In my patch, I felt like "something should be done", but when deciding what
> actually should be done, I decided to simply go with what's in toupper.3,
> which seems good enough for me.
> 
> So, what do you think?
> 
>>  BTW, I'll
>> rename that section to CAVEATS.
> 
> Not sure all of its current content is a caveat.
> 
> Best regards,
> 
>>
>> Thanks,
>> Alex
>>
>>> ---
>>>  man3/isalpha.3 | 9 +++++++++
>>>  1 file changed, 9 insertions(+)
>>>
>>> diff --git a/man3/isalpha.3 b/man3/isalpha.3
>>> index 443c2aa09..8ad997c29 100644
>>> --- a/man3/isalpha.3
>>> +++ b/man3/isalpha.3
>>> @@ -145,6 +145,15 @@ is the special locale object
>>>  .BR duplocale (3))
>>>  or is not a valid locale object handle.
>>>  .PP
>>> +If
>>> +.I c
>>> +is neither an
>>> +.I "unsigned char"
>>> +value nor
>>> +.BR EOF ,
>>> +the behavior of these functions
>>> +is undefined.
>>> +.PP
>>>  The list below explains the operation of the functions without
>>>  the "_l" suffix;
>>>  the functions with the "_l" suffix differ only in using the locale object
>>> --
>>> 2.31.1
>>>
>>> Best regards,
>>> --
>>> Didi
>>>
>>
>> --
>> <http://www.alejandro-colomar.es/>
>> GPG key fingerprint: A9348594CE31283A826FBDD8D57633D441E25BB5
> 
> 
> 

-- 
<http://www.alejandro-colomar.es/>
GPG key fingerprint: A9348594CE31283A826FBDD8D57633D441E25BB5

Attachment: OpenPGP_signature
Description: OpenPGP digital signature


[Index of Archives]     [Kernel Documentation]     [Netdev]     [Linux Ethernet Bridging]     [Linux Wireless]     [Kernel Newbies]     [Security]     [Linux for Hams]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux RAID]     [Linux Admin]     [Samba]

  Powered by Linux