Re: Assignment of union containing const-qualifier member

Amol Surati via Gcc-help <gcc-help@xxxxxxxxxxx> · Wed, 7 Feb 2024 09:43:55 +0530

Hi Alex,

On Mon, 5 Feb 2024 at 00:10, Alejandro Colomar <alx@xxxxxxxxxx> wrote:
>
> Hi Amol,
>
> On Sun, Feb 04, 2024 at 01:03:48PM +0530, Amol Surati wrote:
> > On Wed, 31 Jan 2024 at 23:46, Alejandro Colomar via Gcc-help
> > <gcc-help@xxxxxxxxxxx> wrote:
> > >
> > > On Tue, Jan 30, 2024 at 10:45:11PM +0100, Alejandro Colomar wrote:
> > > > Hi,
> > > >
> >
> > [ ... ]
> >
> > > structure, that doesn't help.  memcpy(3) does help, but it looses all
> > > type safety.
> > >
> > > Maybe this could be allowed as an extension.  Any thoughts?
> > >
> >
> > Does it make sense to propose that, if the first top-level member of a
> > union is completely (i.e. recursively) writable, then a non-const union
> > object as a whole is writable? If so, then, for union objects a and b of
> > a union that has such const members, a = b can be expected to not
> > raise errors about const-correctness.
>
> To have a specific proposal, I'll specify it as a diff of ISO C11:
>
>         $ diff -u c11 suggestion
>         --- c11 2024-02-04 19:37:27.520851005 +0100
>         +++ suggestion  2024-02-04 19:38:56.785402567 +0100
>         @@ -8,8 +8,8 @@
>          does not have array type,
>          does not have an incomplete type,
>          does not have a const- qualified type,
>         -and if it is a structure or union,
>         -does not have any member
>         +and if it is a structure does not have any member,
>         +or if it is a union does not have all members,

As mentioned in my other reply, the assignment from one object to
another (of the same union type that has a const member) will have
to copy the size of the largest member, regardless of which member
was active the last time the source object was written to. Such a
copy, I believe, is a sort of memcpy that the compiler inserts. And
as you had already noticed before, memcpy loses all type-safety.
These might be the reasons the std flags an error when such union
objects are assigned to - assignment implies type-erasing memcpy.

>          (including, recursively,
>          any member or element of all contained aggregates or unions)
>          with a const- qualified type.
>
> (Modifying <http://port70.net/~nsz/c/c11/n1570.html#6.3.2.1p1>)

[ ... ]

> Modifying a union via a non-const member is fine in C, I believe.  I

That's true, but as you noted in your other reply, unions allow diluting
the type-safety to a great extent.

Linux kernel has a piece of code that does modify such a union
although it possesses decorations for the sparse semantic checker.
In its essence, it is similar to:

union {
    const int flags;
    int __attribute__ ((noderef)) __flags;
};
See https://sparse.docs.kernel.org/en/latest/annotations.html#noderef

[ ... ]

> > I think it is better to have a 'class' and associated APIs.
>
> But we can't have that in C.

By 'class', I meant a rigid way of accessing the struct strictly through the
provided APIs.

[ ... ]

> The main concern I have with that proposal is the same concern I've had
> with strings in Nginx so far: you can't really make them `const`.
> Unless you make the type opaque, and only provide accessors via
> functions that protect the strings even if they could modify them.

The type need not be opaque, to allow for optimizations. But the users of
the type must behave as if the type were opaque. The bulk of the API can
be 'static inline' functions within the header, to allow for optimizations and
inlining within a single translation unit. There's also LTO and thin-LTOs to
carry out optimizations across translation units, if needed.

>
> You can only make them const, if you use two distinct types: a read-only
> version, let's call it rstring, and a read-write version, let's call it
> string.

But do we need both views (r/w and r/o) at the same time?

>
>         struct rstring_s {
>             size_t                        length;
>             const char                    *start;
>         };
>
>
>         union nxt_string_u {
>             struct {
>                 size_t                    length;
>                 char                      *start;
>             };
>             struct {
>                 size_t                    length;
>                 char                      *start;
>             } w;
>             const rstring_t               r;
>         };
>
> In Nginx we have another complexity: we don't necessarily terminate our
> strings: this allows getting a substring in the middle of another string
> without needing to make an actual copy of the memory.  But then it means
> we need more types to have type safety.  I haven't finished developing
> that, so I can't tell you if the code below does work, but this is what
> I'm really working with at the moment:
>
>         struct nxt_rstr_s {
>             size_t                          length;
>             const u_char                    *start;
>         };
>
>
>         union nxt_str_u {
>             struct {
>                 size_t                      length;
>                 u_char                      *start;
>             };
>             struct {
>                 size_t                      length;
>                 u_char                      *start;
>             } w;
>             const nxt_rstr_t                r;
>         };
>
>
>         union nxt_rstrz_u {
>             struct {
>                 size_t                      length;
>                 union {
>                     const u_char            *start;
>                     const char              *cstrz;
>                 };
>             };
>             struct {
>                 size_t                      length;
>                 const u_char                *start;
>             } w;
>             const nxt_rstr_t                r;
>         };
>
>
>         union nxt_strz_u {
>             struct {
>                 size_t                      length;
>                 union {
>                     u_char                  *start;
>                     char                    *cstrz;
>                 };
>             };
>             struct {
>                 size_t                      length;
>                 u_char                      *start;
>             } w;
>             const nxt_rstr_t                r;
>             const nxt_rstrz_t               rz;
>         };
>
> Structures `***z` contain null-terminated strings, while the other ones
> don't.  You can read terminated strings as non-terminated ones, but not
> the other way.  And you can access writable strings as read-only
> strings, but not the other way around.
>
> (We use `u_char` to avoid the problems that `char` has due to its
> ambiguous sign; I would personally prefer using -funsigned-char, but
> that's what it is, for historic reasons.)
> Anyway, that `u_char` makes sure we don't mix our strings with libc
> calls accidentally, and I only provide the `cstrz` member in unions that
> actually provide a libc-compatible string view.

Passing these structs+unions is still passing the string as r/w. It does
not seem too far away from passing non-const pointers and sizes.

In a sort of a project of mine, I had to implement a buffer that can be used
to store ref-counted (non-terminated) strings and binary data. I did initially
create two different types, buf_rw and buf_ro, but then scrapped that for
what I have now, a skeleton of which is shown below:

struct string {
    size_t size;
    const char *data;
};

data[i] is now marked as non-modifiable.

Passing 'const struct string *' to a function implies that the function cannot
modify (in addition to data[i]), the fields size and data, through the
passed ptr.

A function that is passed 'struct string *' can still be allowed to
inplace-modify data[i] by invoking appropriate string APIs that temporarily
casts away the constness of data[i].

The SEI CERT coding standard for C allows for an exception for such
modifications. See
https://wiki.sei.cmu.edu/confluence/display/c/EXP05-C.+Do+not+cast+away+a+const+qualification#:~:text=end%20is%20undefined%20*/-,EXP05%2DC%2DEX3,-%3A%20Because%20const

Some other options:
Make use of the sparse semantic checker.

It might help to name the fields with underscores to highlight the fact that
they are internal fields.  For e.g. 'str_size__' and 'str_data__'. Any use of
these fields outside of the dedicated API functions can now be easily
traced by simple grep searches.

It might also help to create a visible type such as

struct string {
    const uintptr_t raw[2];
};

and let the API manage the translation between uintptr_t and size_t/char *.

-Amol

[1] https://wiki.sei.cmu.edu/confluence/display/c/EXP05-C.+Do+not+cast+away+a+const+qualification#:~:text=end%20is%20undefined%20*/-,EXP05%2DC%2DEX3,-%3A%20Because%20const

>
> Have a lovely day,
> Alex
>
> --
> <https://www.alejandro-colomar.es/>
> Looking for a remote C programming job at the moment.