Re: [PATCH] string_copying.7: don't grant strl{cpy,cat} magic

Alejandro Colomar <alx@xxxxxxxxxx> · Sat, 29 Jul 2023 14:04:59 +0200

Hi Lennart,

On 2023-07-29 01:51, Lennart Jablonka wrote:
>>> A function can't check whether a pointer points to the start of a
>>> string.  What it certainly can do is to keep reading until you either
>>> find a null byte or read the secret key that lies adjacent in memory and
>>> post it to your favorite mailing list.
>>>
>>> strlcpy and strlcat behave the exact same way any other function
>>> accepting a string behaves:  If you don't pass a string, the behavior is
>>> undefined.  And that, I believe, does not deserve a special mention
>>> here, seeing as all the other string functions don't get such a mention
>>> either.
>>
>> Hmm, you're right.
>>
>> What I intended to mean is that while most other functions --e.g.,
>> strcpy(3)-- overwrite after the buffer, the design of strlcpy(3) is a
>> bit more clever and makes it so that when the caller invokes UB, it
>> tries to exploit that UB in a way that the input string is entirely
>> read before starting to write, which makes it more likely to crash in
>> a read, rather than writing to random memory (which might still happen
>> if the read is not enough to crash, though).
> 
> I’m interested in where you got that from.  This is strlcpy:

D'oh!  I was talking from memory.  I shouldn't do that!
Please forget and forgive what I said above.  :)

> 
> 	size_t
> 	strlcpy(char *dst, const char *src, size_t dsize)
> 	{
> 		const char *osrc = src;
> 		size_t nleft = dsize;
> 
> 		/* Copy as many bytes as will fit. */
> 		if (nleft != 0) {
> 			/*
> 			 * <humm> This is where reading and
> 			 * writing take place.  src doesn’t get
> 			 * read entirely before writing begins.
> 			 */
> 			while (--nleft != 0) {
> 				if ((*dst++ = *src++) == '\0')
> 					break;
> 			}
> 		}
> 
> 		/* Not enough room in dst, add NUL and traverse rest of src. */
> 		if (nleft == 0) {
> 			if (dsize != 0)
> 				*dst = '\0';		/* NUL-terminate dst */
> 			while (*src++)
> 				;
> 		}
> 
> 		return(src - osrc - 1);	/* count does not include NUL */
> 	}
> 
> I don’t see what you mean in there.

I lied.  I should have said that it writes what is safe to write, and
then uses a somewhat "safer" version of undefined behavior (compared
to other string copying functions).  The standard differentiates
"bounded UB", which doesn't perform out-of-bounds stores, from
"critical UB", which performs them.  In usual jargon, UB is UB, and
there's no mild form of UB; however, the standard prescribes a bounded
form of UB.  However, I'm not sure compilers --and specifically GCC--
follow such a prescription of bounded UB, so it's better to consider
all UB to be critical UB, just to fall on the safe side.

In the case of strlcpy(3), it forces a bounded UB, to increase the
likeliness of a crash (due to reading out-of-bounds), to prevent
critical UB being possibly invoked in the following lines of code.

In the end, that's better hidden as a benevolent implementation detail,
and users should just use the functions with defined behavior, and
taking care of calling the appropriate function for each use case.

References:

<https://port70.net/~nsz/c/c11/n1570.html#L.2>

> 
>>> The "definitions" at the top don't make it clear enough that they aren't
>>> supposed to be precise definitions used in your usual C jargon; that
>>> while string and a string's length and an object's size are defined
>>> by C, and while you understand and sometimes use most of these terms,
>>> there is no norm that says "When you talk about a pointer to one past
>>> a buffer's last byte, you call it 'end'!"  That there is no norm that
>>> says "When you say 'copy,' you write to the beginning, not to
>>> elsewhere!"
>>
>> True.  My intention was to settle the jargon and pseudo-standardize
>> these terms (of course in decades, not tomorrow).  Every other project
>> uses a different term, and I'd like to unify.
> 
> I’m not quite sure yet what to think of that, but I do think you 
> should mention that in the man page.  “Here are some terms; 
> C defines ‘string,’ but you are invited to re-use all these 
> terms—it’d be great if we were in unity, after all.”

I'm open to discussion regarding how to express that.  I'm just not sure.

> 
>>> Furthermore, the terminology around "character sequences" confused me
>>> while reading the page.  When do you have a buffer, neither
>>> null-terminated nor null-padded, that is defined not to contain null
>>> bytes?
>>
>> NGINX uses these internally:
>>
>> $ grepc ngx_str_t
>> ./src/core/ngx_string.h:16:
>> typedef struct {
>>    size_t      len;
>>    u_char     *data;
>> } ngx_str_t;
>>
>>
>> Basically it's a non-zero buffer plus its length.  They have interesting
>> properties; for example, you can take a substring (or should I call it
>> sub-sequence) just by taking a pointer to somewhere in the middle, and
>> the length of the substring, without really copying the string.
> 
> same with a buffer without restriction of null bytes

Yep; in fact, ustpcpy() is just mempcpy(3) but with char* instead of void*.

> 
>>>  And how do functions behave that want a character sequence if
>>> that does contain a null byte?  Do they take the null byte to signal the
>>> character sequence's end?  Need they accept the null byte as part of the
>>> character sequence?  Is the behavior undefined?
>>
>> NGINX handles these strings by the length stored in the buffer.  Any
>> null byte in the middle of a string would be treated as any other
>> character, although they would be problematic when interfacing libc; in
>> general, care is taken to not have null bytes in those strings.  NGINX
>> uses mempcpy(3) (or rather, ngx_cpymem(), which is the same thing) to
>> copy these things, or other more sophisticated functions and macros
>> based on mempcpy(3).
>>
>> $ grepc ngx_cpymem
>> ./src/core/ngx_string.h:97:
>> #define ngx_cpymem(dst, src, n)   (((u_char *) ngx_memcpy(dst, src, n)) + (n))
>>
>>
>> ./src/core/ngx_string.h:107:
>> #define ngx_cpymem(dst, src, n)   (((u_char *) memcpy(dst, src, n)) + (n))
> 
> I think if you want to pseudo-standardize terminology, you should 
> mention how functions are supposed to behave when seeing a null 
> byte in a character sequence.  (I plead for undefined behavior.)
> 
> Now, do you think character sequences are more common than simple 
> buffers, than not caring at all about the specific bytes?  That’s 
> what I do; that’s what I think should be usually done.  And do you 
> suppose character sequences are more valuable than simple buffers?  
> Sure, iff the sequences are null-terminated, you can use them as 
> strings, too, but that doesn’t seem like much of a benefit.

The benefit of differentiating them, and making sure that certain buffers
(let's call those character sequences) do not have null bytes, allows you
to transform into a string just by appending a null byte, and you already
know the length.  That's extensively (ab)used in the code base in which I
work (nginx).

Of course, you could argue that you'll know if you have null bytes in
those buffers or not, and if you're going to use them as strings, you
probably don't write zeros in the middle, and can just reuse the term
buffer for them.  But I prefer to use a term to differentiate them with
a different term in this page, I think.

> 
>>> And lastly, the man page doesn't list the functions' standards or who
>>> invented them.
>>
>> That was deliberate.  The specific pages of each of those functions
>> already documents that.  Since the intention was to differentiate the
>> use cases of each of the functions, I believe mentioning the standards
>> would just deviate from that main point, and so I omitted that info.
>> The point is that choosing one of these functions shouldn't depend on
>> what standards are available to the programmer.  Instead, the
>> programmer should use the appropriate function, and then if it's not
>> available, it should be written within the project (probably as a
>> wrapper around other functions) to be able to use it.  That's why I
>> provided some naive implementations of some of them.
> 
> That’s not a bad idea, but for some functions, this is the man 
> page.  I wondered where stpecpy comes from, I tried to open 
> stpecpy(3), I got string_copying(7).

Hmm, there's some notes that they aren't defined elsewhere:

              This function is not provided by any library; See  EXAM‐
              PLES for a reference implementation.

and then in EXAMPLES:

   Implementations
       Here  are  reference implementations for functions not provided
       by libc.

I could have added a STANDARDS section that covers those, but I decided
to not do that, and instead force one to read the entire page to learn
that info.  :)

Cheers,
Alex

-- 
<http://www.alejandro-colomar.es/>
GPG key fingerprint: A9348594CE31283A826FBDD8D57633D441E25BB5

Attachment:
OpenPGP_signature

Description: OpenPGP digital signature