Re: Assignment of union containing const-qualifier member

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Amol,

On Wed, Feb 07, 2024 at 09:43:55AM +0530, Amol Surati wrote:
> > To have a specific proposal, I'll specify it as a diff of ISO C11:
> >
> >         $ diff -u c11 suggestion
> >         --- c11 2024-02-04 19:37:27.520851005 +0100
> >         +++ suggestion  2024-02-04 19:38:56.785402567 +0100
> >         @@ -8,8 +8,8 @@
> >          does not have array type,
> >          does not have an incomplete type,
> >          does not have a const- qualified type,
> >         -and if it is a structure or union,
> >         -does not have any member
> >         +and if it is a structure does not have any member,
> >         +or if it is a union does not have all members,
> 
> As mentioned in my other reply, the assignment from one object to
> another (of the same union type that has a const member) will have
> to copy the size of the largest member, regardless of which member
> was active the last time the source object was written to. Such a
> copy, I believe, is a sort of memcpy that the compiler inserts. And
> as you had already noticed before, memcpy loses all type-safety.

The fact that the compiler does the equivalent of memcpy(3) doesn't
imply a removal of type safety.

The copiler is aware of the union type at the assignment, and so can
emit diagnostics if anything is wrong.

When assigning to a union member, only that member needs to be copied,
and I'm not sure what happens to other members that have a larger size.
As far as I can read in the standard, it's Undefined Behavior.  Maybe
someone can correct me.

When assigning to a union (not its members), the standard doesn't seem
to specify at all the semantics, so I can't really tell.  Depending on
the semantics of assigning to a member, I can tell.  At first glance, I
don't see anything we should worry.

> These might be the reasons the std flags an error when such union
> objects are assigned to - assignment implies type-erasing memcpy.
> 
> >          (including, recursively,
> >          any member or element of all contained aggregates or unions)
> >          with a const- qualified type.
> >
> > (Modifying <http://port70.net/~nsz/c/c11/n1570.html#6.3.2.1p1>)
> 
> [ ... ]
> 
> > Modifying a union via a non-const member is fine in C, I believe.  I
> 
> That's true, but as you noted in your other reply, unions allow diluting
> the type-safety to a great extent.

Not really.  If well used, they do actually improve type safety.  Think
of my string unions vs the struct proposed by Martin.  My union has
const correctness, which is impossible to achieve via structs.  A
similar problem happens with struct sockaddr, which for historic reasons
is a series of unrelated structures.  If they had been designed as a
hierarchic union, it wouldn't be so problematic, and you wouldn't need
to cast everywhere to use them.

> 
> Linux kernel has a piece of code that does modify such a union
> although it possesses decorations for the sparse semantic checker.
> In its essence, it is similar to:
> 
> union {
>     const int flags;
>     int __attribute__ ((noderef)) __flags;
> };
> See https://sparse.docs.kernel.org/en/latest/annotations.html#noderef
> 
> [ ... ]
> 
> > > I think it is better to have a 'class' and associated APIs.
> >
> > But we can't have that in C.
> 
> By 'class', I meant a rigid way of accessing the struct strictly through the
> provided APIs.

That's precisely what a union offers.  You can access it via a union
member, which then restricts you to use the functions that accept that
type, or you can keep the entire union for unlimited use.

> 
> [ ... ]
> 
> > The main concern I have with that proposal is the same concern I've had
> > with strings in Nginx so far: you can't really make them `const`.
> > Unless you make the type opaque, and only provide accessors via
> > functions that protect the strings even if they could modify them.
> 
> The type need not be opaque, to allow for optimizations. But the users of
> the type must behave as if the type were opaque. The bulk of the API can
> be 'static inline' functions within the header, to allow for optimizations and
> inlining within a single translation unit. There's also LTO and thin-LTOs to
> carry out optimizations across translation units, if needed.

Indeed.  But I'm not sure I'd like a string type that I must treat as a
FILE*.  Also, how do you initialize such strings?  With an EMPTY_STRING
macro?  Do you always get a heap pointer, so you can't declare a string
in the stack as this?

	struct string  s;

What's the content of that string?  Is a NULL pointer a valid state of
the string?  In NGINX we found this week several cases where we were
doing pointer arithmetics on strings that were holding NULL, precisely
due to this (or it was after some calloc(3), more likely, I don't
remember).  It was NULL + 0, which is the least of undefined behaviors
I'm concerned about, but hey, there's another NULL to worry about.

So, you'll need to do

	struct string  s = EMPTY_STRING;

to avoid ever having a NULL, which is yet another thing to worry about.

Having suffered these strings, I wouldn't recommend them as something
better than a char*.  In NGINX, we have them due to historic reasons,
and rewriting the entire code base would be unreasonable, and while they
may have been a good optimization 2 decades ago, I bet that it's a
useless micro-optimization today, and we could just use char* everywhere
to have less problems.

> > You can only make them const, if you use two distinct types: a read-only
> > version, let's call it rstring, and a read-write version, let's call it
> > string.
> 
> But do we need both views (r/w and r/o) at the same time?

Yes: I want to be able to take a r/w string and pass it to a function
that accepts a r/o string.  Since distinct types are incompatible, and
this is the main problem with things like sockaddr, you can't just cast;
that's likely to invoke Undefined Behavior.

It's the same as I can pass a char* to a function that accepts a
const char*, without doing a cast.

> 
> >
> >         struct rstring_s {
> >             size_t                        length;
> >             const char                    *start;
> >         };
> >
> >
> >         union nxt_string_u {
> >             struct {
> >                 size_t                    length;
> >                 char                      *start;
> >             };
> >             struct {
> >                 size_t                    length;
> >                 char                      *start;
> >             } w;
> >             const rstring_t               r;
> >         };
> >
> > In Nginx we have another complexity: we don't necessarily terminate our
> > strings: this allows getting a substring in the middle of another string
> > without needing to make an actual copy of the memory.  But then it means
> > we need more types to have type safety.  I haven't finished developing
> > that, so I can't tell you if the code below does work, but this is what
> > I'm really working with at the moment:
> >
> >         struct nxt_rstr_s {
> >             size_t                          length;
> >             const u_char                    *start;
> >         };
> >
> >
> >         union nxt_str_u {
> >             struct {
> >                 size_t                      length;
> >                 u_char                      *start;
> >             };
> >             struct {
> >                 size_t                      length;
> >                 u_char                      *start;
> >             } w;
> >             const nxt_rstr_t                r;
> >         };
> >
> >
> >         union nxt_rstrz_u {
> >             struct {
> >                 size_t                      length;
> >                 union {
> >                     const u_char            *start;
> >                     const char              *cstrz;
> >                 };
> >             };
> >             struct {
> >                 size_t                      length;
> >                 const u_char                *start;
> >             } w;
> >             const nxt_rstr_t                r;
> >         };
> >
> >
> >         union nxt_strz_u {
> >             struct {
> >                 size_t                      length;
> >                 union {
> >                     u_char                  *start;
> >                     char                    *cstrz;
> >                 };
> >             };
> >             struct {
> >                 size_t                      length;
> >                 u_char                      *start;
> >             } w;
> >             const nxt_rstr_t                r;
> >             const nxt_rstrz_t               rz;
> >         };
> >
> > Structures `***z` contain null-terminated strings, while the other ones
> > don't.  You can read terminated strings as non-terminated ones, but not
> > the other way.  And you can access writable strings as read-only
> > strings, but not the other way around.
> >
> > (We use `u_char` to avoid the problems that `char` has due to its
> > ambiguous sign; I would personally prefer using -funsigned-char, but
> > that's what it is, for historic reasons.)
> > Anyway, that `u_char` makes sure we don't mix our strings with libc
> > calls accidentally, and I only provide the `cstrz` member in unions that
> > actually provide a libc-compatible string view.
> 
> Passing these structs+unions is still passing the string as r/w. It does
> not seem too far away from passing non-const pointers and sizes.

The thing is, I don't always pass the union.

If I want to write the string in a function, I pass the union.  If I
want it r/o, I pass the const member.

> In a sort of a project of mine, I had to implement a buffer that can be used
> to store ref-counted (non-terminated) strings and binary data. I did initially
> create two different types, buf_rw and buf_ro, but then scrapped that for
> what I have now, a skeleton of which is shown below:
> 
> struct string {
>     size_t size;
>     const char *data;
> };
> 
> data[i] is now marked as non-modifiable.
> 
> Passing 'const struct string *' to a function implies that the function cannot
> modify (in addition to data[i]), the fields size and data, through the
> passed ptr.
> 
> A function that is passed 'struct string *' can still be allowed to
> inplace-modify data[i] by invoking appropriate string APIs that temporarily
> casts away the constness of data[i].

That would be a violation of const correctness.  If necessary, yeah, go
ahead.  But personally, I'd prefer avoiding that.  For something that is
proposed as an inclusion to ISO C, I'd rather not have it.

I would prefer that a libstring adds those APIs, rather than having them
in the standard.  They aren't uncontroversial.  Neither char* is
uncontroversial, but at least it's simple.

> The SEI CERT coding standard for C allows for an exception for such
> modifications. See
> https://wiki.sei.cmu.edu/confluence/display/c/EXP05-C.+Do+not+cast+away+a+const+qualification#:~:text=end%20is%20undefined%20*/-,EXP05%2DC%2DEX3,-%3A%20Because%20const
> 
> Some other options:
> Make use of the sparse semantic checker.
> 
> It might help to name the fields with underscores to highlight the fact that
> they are internal fields.  For e.g. 'str_size__' and 'str_data__'. Any use of
> these fields outside of the dedicated API functions can now be easily
> traced by simple grep searches.
> 
> It might also help to create a visible type such as
> 
> struct string {
>     const uintptr_t raw[2];
> };
> 
> and let the API manage the translation between uintptr_t and size_t/char *.

But you'll need a transparent union to do the magic.  All of that shows
that it isn't an easy thing through structures.  unions at least have it
easier.

Cheers,
Alex

-- 
<https://www.alejandro-colomar.es/>
Looking for a remote C programming job at the moment.

Attachment: signature.asc
Description: PGP signature


[Index of Archives]     [Linux C Programming]     [Linux Kernel]     [eCos]     [Fedora Development]     [Fedora Announce]     [Autoconf]     [The DWARVES Debugging Tools]     [Yosemite Campsites]     [Yosemite News]     [Linux GCC]

  Powered by Linux