Hi Alex, On Wed, 7 Feb 2024 at 18:59, Alejandro Colomar <alx@xxxxxxxxxx> wrote: > > Hi Amol, > > On Wed, Feb 07, 2024 at 09:43:55AM +0530, Amol Surati wrote: [ ... ] > The fact that the compiler does the equivalent of memcpy(3) doesn't > imply a removal of type safety. > > The copiler is aware of the union type at the assignment, and so can > emit diagnostics if anything is wrong. True. That's the diagnostics that was emitted when running 'v=u'. > > When assigning to a union member, only that member needs to be copied, > and I'm not sure what happens to other members that have a larger size. > As far as I can read in the standard, it's Undefined Behavior. Maybe > someone can correct me. The std says: "When a value is stored in a member of an object of union type, the bytes of the object representation that do not correspond to that member but do correspond to other members take unspecified values." > When assigning to a union (not its members), the standard doesn't seem > to specify at all the semantics, so I can't really tell. Depending on > the semantics of assigning to a member, I can tell. At first glance, I > don't see anything we should worry. [ ... ] > Not really. If well used, they do actually improve type safety. Think > of my string unions vs the struct proposed by Martin. My union has > const correctness, which is impossible to achieve via structs. A I am not sure I fully understand how using structs forbids achieving const-correctness (vs the unions as declared), but I am also not aware of all the ways in which nginx uses strings. [ ... ] > Indeed. But I'm not sure I'd like a string type that I must treat as a > FILE*. Also, how do you initialize such strings? With an EMPTY_STRING > macro? Do you always get a heap pointer, so you can't declare a string > in the stack as this? > > struct string s; > > What's the content of that string? Is a NULL pointer a valid state of > the string? In NGINX we found this week several cases where we were > doing pointer arithmetics on strings that were holding NULL, precisely > due to this (or it was after some calloc(3), more likely, I don't > remember). It was NULL + 0, which is the least of undefined behaviors > I'm concerned about, but hey, there's another NULL to worry about. > > So, you'll need to do > > struct string s = EMPTY_STRING; > > to avoid ever having a NULL, which is yet another thing to worry about. > > Having suffered these strings, I wouldn't recommend them as something > better than a char*. In NGINX, we have them due to historic reasons, > and rewriting the entire code base would be unreasonable, and while they > may have been a good optimization 2 decades ago, I bet that it's a > useless micro-optimization today, and we could just use char* everywhere > to have less problems. If a variable is going to be overwritten without being read first, then there's no need to initialize it. Within my implementation, the state resulting from zero-init, i.e. 'struct string s = {0};', is deliberately chosen to be well-defined for the implementation. For e.g. appending a char to a zero-init'd string is well-defined. > > > You can only make them const, if you use two distinct types: a read-only > > > version, let's call it rstring, and a read-write version, let's call it > > > string. > > > > But do we need both views (r/w and r/o) at the same time? > > Yes: I want to be able to take a r/w string and pass it to a function > that accepts a r/o string. Since distinct types are incompatible, and > this is the main problem with things like sockaddr, you can't just cast; > that's likely to invoke Undefined Behavior. The std says: "A pointer to an object type may be converted to a pointer to a different object type. If the resulting pointer is not correctly aligned for the referenced type, the behavior is undefined. Otherwise, when converted back again, the result shall compare equal to the original pointer." So typecasting between object types is well-defined, given that the resulting pointer is indeed correctly aligned for the target type. The various sock_addr types all impose a common default alignment, because they all begin with a member of the same type. The std. also says: "An object shall have its stored value accessed only by an lvalue expression that has one of the following types: . . . — a character type." For e.g., the _sys_bind system-call under Linux kernel receives the same 'const struct sockaddr *', along with the size of the data, from glibc. The kernel then copies the data using memcpy (or equivalent routines). Does the pointer conversion forbid the kernel from memcpy'ing the object using that converted pointer? (It is likely that the converted pointer, if it is not of compatible type, may not be dereferenced using that type; but here the kernel isn't dereferencing the pointer using the 'const struct sockaddr *' type; it is reading the obj-representation using memcpy.) If no, then this behaviour is well-defined. If yes, then this behaviour is undefined, although, in practice, both gcc and clang (the compilers that can compile Linux kernels) must be allowing the expected behaviour here, since the compiled kernel binaries do indeed work wherever they are deployed. This only means that, *if* the behaviour turns out to be theoretically undefined, *then* the std. can be augmented to specifically allow reading of representations of objects pointed to by such suitably converted pointers, if it does not already allow such reads. [ ... ] > > A function that is passed 'struct string *' can still be allowed to > > inplace-modify data[i] by invoking appropriate string APIs that temporarily > > casts away the constness of data[i]. > > That would be a violation of const correctness. If necessary, yeah, go > ahead. But personally, I'd prefer avoiding that. For something that is > proposed as an inclusion to ISO C, I'd rather not have it. I keep two bits of info within the string: (1) whether the 'struct string' is heap-allocated, or not. (2) whether the data (const char *) it points to is heap-allocated, or not. The std. says: "If an attempt is made to modify an object defined with a const-qualified type through use of an lvalue with non-const-qualified type, the behavior is undefined." Given that a malloc'd object doesn't have its own declaration or a declared type to begin with, it also doesn't have a definition, let alone a const-qualified definition. Modifying malloc'd objects should not cause const-correctness to be violated, even when casting away const on a pointer that points to such an object. Modifiable strings can be created from statically allocated data (for e.g., pointing inside the .rodata section), and the data can be malloc'd and changed/appended-to on a copy-on-write basis. -Amol > > I would prefer that a libstring adds those APIs, rather than having them > in the standard. They aren't uncontroversial. Neither char* is > uncontroversial, but at least it's simple. > > > The SEI CERT coding standard for C allows for an exception for such > > modifications. See > > https://wiki.sei.cmu.edu/confluence/display/c/EXP05-C.+Do+not+cast+away+a+const+qualification#:~:text=end%20is%20undefined%20*/-,EXP05%2DC%2DEX3,-%3A%20Because%20const > > > > Some other options: > > Make use of the sparse semantic checker. > > > > It might help to name the fields with underscores to highlight the fact that > > they are internal fields. For e.g. 'str_size__' and 'str_data__'. Any use of > > these fields outside of the dedicated API functions can now be easily > > traced by simple grep searches. > > > > It might also help to create a visible type such as > > > > struct string { > > const uintptr_t raw[2]; > > }; > > > > and let the API manage the translation between uintptr_t and size_t/char *. > > But you'll need a transparent union to do the magic. All of that shows > that it isn't an easy thing through structures. unions at least have it > easier. > > Cheers, > Alex > > -- > <https://www.alejandro-colomar.es/> > Looking for a remote C programming job at the moment.