Re: Assignment of union containing const-qualifier member

Alejandro Colomar via Gcc-help <gcc-help@xxxxxxxxxxx> · Sun, 4 Feb 2024 19:40:23 +0100

Hi Amol,

On Sun, Feb 04, 2024 at 01:03:48PM +0530, Amol Surati wrote:
> On Wed, 31 Jan 2024 at 23:46, Alejandro Colomar via Gcc-help
> <gcc-help@xxxxxxxxxxx> wrote:
> >
> > On Tue, Jan 30, 2024 at 10:45:11PM +0100, Alejandro Colomar wrote:
> > > Hi,
> > >
> 
> [ ... ]
> 
> > structure, that doesn't help.  memcpy(3) does help, but it looses all
> > type safety.
> >
> > Maybe this could be allowed as an extension.  Any thoughts?
> >
> 
> Does it make sense to propose that, if the first top-level member of a
> union is completely (i.e. recursively) writable, then a non-const union
> object as a whole is writable? If so, then, for union objects a and b of
> a union that has such const members, a = b can be expected to not
> raise errors about const-correctness.

To have a specific proposal, I'll specify it as a diff of ISO C11:

	$ diff -u c11 suggestion 
	--- c11	2024-02-04 19:37:27.520851005 +0100
	+++ suggestion	2024-02-04 19:38:56.785402567 +0100
	@@ -8,8 +8,8 @@
	 does not have array type,
	 does not have an incomplete type,
	 does not have a const- qualified type,
	-and if it is a structure or union,
	-does not have any member
	+and if it is a structure does not have any member,
	+or if it is a union does not have all members,
	 (including, recursively,
	 any member or element of all contained aggregates or unions)
	 with a const- qualified type.

(Modifying <http://port70.net/~nsz/c/c11/n1570.html#6.3.2.1p1>)

> 
> It seems that a union only provides a view of the object. The union
> object doesn't automatically become const qualified if a member
> of the union is const-qualified. This seems to be the reason v.w = u.w
> works; otherwise, that modification can also be viewed as the
> modification of an object (v.r) defined with a const-qualified type through
> the use of an lvalue (v.w) with non-const-qualified type - something that's
> forbidden by the std.

Modifying a union via a non-const member is fine in C, I believe.  I
think you're creating a new object, and discarding the old one, so you
don't need to care if there was an old object defined via a
const-qualified type.  That is, the following code is valid C, AFAIK:

	alx@debian:~/tmp$ cat u.c 
	union u {
		int        a;
		const int  b;
	};

	int
	main(void)
	{
		union u  u = {.b = 42};

		u.a = 7;
		return u.b;
	}
	alx@debian:~/tmp$ gcc-14 -Wall -Wextra u.c 
	alx@debian:~/tmp$ ./a.out ; echo $?
	7
	alx@debian:~/tmp$ clang-17 -Weverything u.c 
	alx@debian:~/tmp$ ./a.out ; echo $?
	7

> More towards the use of the string as described:
> If there are multiple such union objects that point to the same string,
> and if a piece of code decides to modify the string, other consumers of
> this string remain unaware of the modification, unless they check for it,
> for e.g., by keeping a copy, calc. hash, etc., to ensure that the string was
> indeed not silently modified behind their backs.

`const` only guarantees that an object is not modified through that
pointer.  As long as you keep another pointer to the same object, it can
be modified via that other pointer.  To guarantee that an object is
really constant --at least for what concerns a function--, you need to
also specify `restrict`.  If you have a `const type* restrict`, then you
know for sure it is constant, as far as the current function is
concerned.

If you're worried about multi-threaded programs, well, unions aren't any
more problematic here than passing a `const T *restrict` to a function,
and modifying it in another thread via a non-const lvalue.  As long as
the original object wasn't const, that's fair game.  It's the
programmer's task to make sure the functions behave well if that can
happen.

> 
> I think it is better to have a 'class' and associated APIs.

But we can't have that in C.

> See [1], for e.g., or the implementation of c++ std::string.
> 
> The ownership of an object of such a class can be passed by passing
> a non-const pointer to the object.
> 
> Functions that are not supposed to own the object can be passed a
> const pointer. Despite that, if such functions need to modify it for local
> needs, they can create a copy to work with.
> 
> One can additionally maintain a ref-count on the char pointer, to avoid
> having to unnecessarily copy a string if it is going to be placed in several
> stay-resident-after-return data-structures.

I normally prefer simple C strings, with a simple pointer.  The reason
I'm using this struct+union is performance.  In nginx, to reduce memory
consumption (you can get substrings by copying a pointer and specifying
a length), and also avoid calculating lengths of strings more than once,
we use these structures.

So far, we were using a simple struct:

	typedef struct string {
		size_t  length;
		u_char  *start;
	} string_t;  // it has a different name, but let's keep it simple

But that means we basically can't use `const` at all with our strings.
Because if you specify

	void foo(const string_t *str);

that means that you can't modify the pointer, but you can actually
modify the pointee.  Which means that you can't guarantee that a string
isn't corrupted after some call, unless you inspect all the code that
the function calls, recursively.

I started working on a way to improve these strings around a year ago,
and have recently come up with something.

> 
> -Amol
> 
> [1] https://www.open-std.org/jtc1/sc22/wg14/www/docs/n3210.pdf

Maybe you can get something from what I've learnt with strings in Nginx,
since they're quite close to what that proposal has.

The main concern I have with that proposal is the same concern I've had
with strings in Nginx so far: you can't really make them `const`.
Unless you make the type opaque, and only provide accessors via
functions that protect the strings even if they could modify them.

You can only make them const, if you use two distinct types: a read-only
version, let's call it rstring, and a read-write version, let's call it
string.

	struct rstring_s {
	    size_t                        length;
	    const char                    *start;
	};

	union nxt_string_u {
	    struct {
		size_t                    length;
		char                      *start;
	    };
	    struct {
		size_t                    length;
		char                      *start;
	    } w;
	    const rstring_t               r;
	};

In Nginx we have another complexity: we don't necessarily terminate our
strings: this allows getting a substring in the middle of another string
without needing to make an actual copy of the memory.  But then it means
we need more types to have type safety.  I haven't finished developing
that, so I can't tell you if the code below does work, but this is what
I'm really working with at the moment:

	struct nxt_rstr_s {
	    size_t                          length;
	    const u_char                    *start;
	};

	union nxt_str_u {
	    struct {
		size_t                      length;
		u_char                      *start;
	    };
	    struct {
		size_t                      length;
		u_char                      *start;
	    } w;
	    const nxt_rstr_t                r;
	};

	union nxt_rstrz_u {
	    struct {
		size_t                      length;
		union {
		    const u_char            *start;
		    const char              *cstrz;
		};
	    };
	    struct {
		size_t                      length;
		const u_char                *start;
	    } w;
	    const nxt_rstr_t                r;
	};

	union nxt_strz_u {
	    struct {
		size_t                      length;
		union {
		    u_char                  *start;
		    char                    *cstrz;
		};
	    };
	    struct {
		size_t                      length;
		u_char                      *start;
	    } w;
	    const nxt_rstr_t                r;
	    const nxt_rstrz_t               rz;
	};

Structures `***z` contain null-terminated strings, while the other ones
don't.  You can read terminated strings as non-terminated ones, but not
the other way.  And you can access writable strings as read-only
strings, but not the other way around.

(We use `u_char` to avoid the problems that `char` has due to its
ambiguous sign; I would personally prefer using -funsigned-char, but
that's what it is, for historic reasons.)
Anyway, that `u_char` makes sure we don't mix our strings with libc
calls accidentally, and I only provide the `cstrz` member in unions that
actually provide a libc-compatible string view.

Have a lovely day,
Alex

-- 
<https://www.alejandro-colomar.es/>
Looking for a remote C programming job at the moment.
Attachment:
signature.asc

Description: PGP signature