On Wed, Mar 15, 2023 at 08:10:26PM +0100, Michal Privoznik wrote: > We ran into so many clang bugs recently, that I'm strongly in favor of > making it optional and aim on gcc only. Debugging these issues burns our > time needlessly. Just consider what's happening here. > > I've pushed patches recently that call some virNuma*() APIs during QEMU > cmd line generation. And while at it, I removed virNuma*() mocks from > qemuxml2argvmock.c (v9.1.0-213-g95ae91fdd4). Andrea told me in review to > make sure our CI works. So I've fired up my freebsd VM and compiled my > code there. It worked. Except, I'm compiling with -O0 and our CI > compiles with the default (-O2). And of course this is the source of the > current build breakage. > > Long story short, clang is so eager to produce the fastest code that it > simply ignores written code and generates what it *assumes* behaves the > same. Well, it doesn't and this breaks our mocking and thus our tests. snip > And of course it's happening only with optimizations. snip > To reproduce this issue even on Linux, configure with: > > export CC=clang > meson setup -Dsystem=true -Dexpensive_tests=enabled -Dnumactl=disabled _build > > > At this point, I think we can call clang broken and focus our time on > developing libvirt. Not coming up with hacks around broken compilers. Clang's behaviour is not broken or wrong. The code it generates is functionally correct when used, because according to the C stndard it is forbidden to define the same function more than once. Apps linked aginst libvirt.so or our daemons all operate perfectly. The problem is confined to our test suite where we're trying to selectively replace code at runtime and in doing so, we're making assumptions about code optimization that are not valid. In doing this we're violating the language standard. Of course this is a feature that ELF explicitly supports, despite being outside the language stndards. What this means is that we can't assume the compiler default behaviour will do what we want. The sepecific problem we're facing here is that the compiler is doing inter-procedural analysis because it can see the body of both functions. We've already tried to suppress the optimization by using the 'noinline' annotation but unfortunately is possible to keep a function as non-inlined, but still be wrong. Whuat we're seeing here is that a separate optimization is looking at the return value and determining that it is a compile time constant and thus elide the conditional, while still calling the orginal function ! I can demonstrate this a little more clearly with an isolated example: $ cat inline.h int IsAvailable(int i) __attribute__((noinline)); int IsSomethingOK(int i); $ cat inline.c #include <stdbool.h> int IsAvailable(int i) { fprintf(stderr, "Bad hit\n"); return false; } #endif int IsSomethingOK(int i) { while (i--) { if (IsAvailable(i)) continue; fprintf(stderr, "Bad stuff\n"); return -1; } fprintf(stderr, "OK\n"); return 0; } $ cat mock.c #include <stdbool.h> #include <stdio.h> #include "inline.h" int IsAvailable(int i) { fprintf(stderr, "Good hit\n"); return true; } $ LD_LIBRARY_PATH=. ./something Bad hit Bad stuff Here we see 'Bad hit' so it must be running the original IsAvailable() code body, and we see 'Bad stuff' so it must be seeing the 'return false' return value. Now running with the override $ LD_PRELOAD=mock.so LD_LIBRARY_PATH=. ./something Good hit Bad stuff we see 'Good hit' so it is running our replacement mocked impl of IsAvailable(), but we still see 'Bad stuff' so it must be still using the old 'return value' value. This proves that the "inlining" optimization is separate from the "constant return value" optimization. We're only disabling the former optimization, so our mocks are not safe. We've hit this about 6 years ago which lead to using 'weak' annotation too: commit e4b980c853d2114b25fa805a84ea288384416221 Author: Daniel P. Berrangé <berrange@xxxxxxxxxx> Date: Wed Jul 5 11:46:28 2017 +0100 Prevent more compiler optimization of mockable functions Currently all mockable functions are annotated with the 'noinline' attribute. This is insufficient to guarantee that a function can be reliably mocked with an LD_PRELOAD. The C language spec allows the compiler to assume there is only a single implementation of each function. It can thus do things like propagating constant return values into the caller at compile time, or creating multiple specialized copies of the function body each optimized for a different caller. To prevent these optimizations we must also set the 'noclone' and 'weak' attributes. This fixes the test suite when libvirt.so is built with CLang with optimization enabled. Signed-off-by: Daniel P. Berrange <berrange@xxxxxxxxxx> but this caused unwanted side effects: commit 407a281a8e2b6c5078ba1148535663ea64fd9314 Author: Daniel P. Berrangé <berrange@xxxxxxxxxx> Date: Wed Jul 12 11:07:17 2017 +0100 Revert "Prevent more compiler optimization of mockable functions" This reverts commit e4b980c853d2114b25fa805a84ea288384416221. When a binary links against a .a archive (as opposed to a shared library), any symbols which are marked as 'weak' get silently dropped. As a result when the binary later runs, those 'weak' functions have an address of 0x0 and thus crash when run. This happened with virtlogd and virtlockd because they don't link to libvirt.so, but instead just libvirt_util.a and libvirt_rpc.a. The virRandomBits symbols was weak and so left out of the virtlogd & virtlockd binaries, despite being required by virHashTable functions. Various other binaries like libvirt_lxc, libvirt_iohelper, etc also link directly to .a files instead of libvirt.so, so are potentially at risk of dropping symbols leading to a later runtime crash. This is normal linker behaviour because a weak symbol is not treated as undefined, so nothing forces it to be pulled in from the .a You have to force the linker to pull in weak symbols using -u$SYMNAME which is not a practical approach. This risk is silent bad linkage that affects runtime behaviour is not acceptable for a fix that was merely trying to fix the test suite. So stop using __weak__ again. Signed-off-by: Daniel P. Berrange <berrange@xxxxxxxxxx> There's a bit of a thread here: https://listman.redhat.com/archives/libvir-list/2017-July/150772.html I tied another approach here: https://listman.redhat.com/archives/libvir-list/2017-July/150809.html which didn't work well enough and then IIRC we abandoned attempts at that time. Meanwhile though GCC gained a 'noipa' attribute (no inter procedural analysis) which can suppress these return value optimizations, amongst other things. CLang though doesn't support it yet https://reviews.llvm.org/D101011 Looking at the compiler flags though there is another option which is to pass '-fsemantic-interposition'. With CLang this defaults to off, which means it is free to assume replacement functions have the same semantics. Passing -fsemantic-interposition forces it to assume that replacements might have different semantics, and thus cannot do the return value optimizations. The downside is that it would have a impact on performance across everything we compile with that arg, but I don't think libvirt stuff is generally that performance sensitive, so we can probably just go with -fsemantic-interposition on clang unconditionally. With regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|