Re: [PATCH 0/2] Work around broken clang. Again.

Daniel P. Berrangé <berrange@xxxxxxxxxx> · Thu, 16 Mar 2023 10:13:28 +0000

On Wed, Mar 15, 2023 at 08:10:26PM +0100, Michal Privoznik wrote:
> We ran into so many clang bugs recently, that I'm strongly in favor of
> making it optional and aim on gcc only. Debugging these issues burns our
> time needlessly. Just consider what's happening here.
> 
> I've pushed patches recently that call some virNuma*() APIs during QEMU
> cmd line generation. And while at it, I removed virNuma*() mocks from
> qemuxml2argvmock.c (v9.1.0-213-g95ae91fdd4). Andrea told me in review to
> make sure our CI works. So I've fired up my freebsd VM and compiled my
> code there. It worked. Except, I'm compiling with -O0 and our CI
> compiles with the default (-O2). And of course this is the source of the
> current build breakage.
> 
> Long story short, clang is so eager to produce the fastest code that it
> simply ignores written code and generates what it *assumes* behaves the
> same. Well, it doesn't and this breaks our mocking and thus our tests.

snip

> And of course it's happening only with optimizations.

snip

> To reproduce this issue even on Linux, configure with:
> 
>   export CC=clang
>   meson setup -Dsystem=true -Dexpensive_tests=enabled -Dnumactl=disabled _build
> 
> 
> At this point, I think we can call clang broken and focus our time on
> developing libvirt. Not coming up with hacks around broken compilers.

Clang's behaviour is not broken or wrong. The code it generates is
functionally correct when used, because according to the C stndard
it is forbidden to define the same function more than once. Apps
linked aginst libvirt.so or our daemons all operate perfectly. 

The problem is confined to our test suite where we're trying to
selectively replace code at runtime and in doing so, we're making
assumptions about code optimization that are not valid. In doing
this we're violating the language standard. Of course this is a
feature that ELF explicitly supports, despite being outside the
language stndards. What this means is that we can't assume the
compiler default behaviour will do what we want. 

The sepecific problem we're facing here is that the compiler is
doing inter-procedural analysis because it can see the body of
both functions. We've already tried to suppress the optimization
by using the 'noinline' annotation but unfortunately is possible
to keep a function as non-inlined, but still be wrong. Whuat we're
seeing here is that a separate optimization is looking at the
return value and determining that it is a compile time constant
and thus elide the conditional, while still calling the orginal
function !

I can demonstrate this a little more clearly with an isolated
example:

$ cat inline.h
int IsAvailable(int i) __attribute__((noinline));
int IsSomethingOK(int i);

$ cat inline.c
#include <stdbool.h>

int IsAvailable(int i)
{
  fprintf(stderr, "Bad hit\n");
  return false;
}

#endif

int IsSomethingOK(int i)
{
  while (i--) {
    if (IsAvailable(i))
      continue;

    fprintf(stderr, "Bad stuff\n");
    return -1;
  }

  fprintf(stderr, "OK\n");
  return 0;
}

$ cat mock.c
#include <stdbool.h>
#include <stdio.h>

#include "inline.h"

int IsAvailable(int i)
{
  fprintf(stderr, "Good hit\n");
  return true;
}

$ LD_LIBRARY_PATH=. ./something 
Bad hit
Bad stuff

Here we see 'Bad hit' so it must be running the original
IsAvailable() code body, and we see 'Bad stuff' so it
must be seeing the 'return false' return value.

Now running with the override

$ LD_PRELOAD=mock.so  LD_LIBRARY_PATH=. ./something 
Good hit
Bad stuff

we see 'Good hit' so it is running our replacement
mocked impl of IsAvailable(), but we still see 'Bad stuff'
so it must be still using the old 'return value' value.

This proves that the "inlining"  optimization is separate
from the "constant return value" optimization.

We're only disabling the former optimization, so our mocks
are not safe.

We've hit this about 6 years ago which lead to using 'weak' annotation
too:

  commit e4b980c853d2114b25fa805a84ea288384416221
  Author: Daniel P. Berrangé <berrange@xxxxxxxxxx>
  Date:   Wed Jul 5 11:46:28 2017 +0100

    Prevent more compiler optimization of mockable functions

    Currently all mockable functions are annotated with the 'noinline'
    attribute. This is insufficient to guarantee that a function can
    be reliably mocked with an LD_PRELOAD. The C language spec allows
    the compiler to assume there is only a single implementation of
    each function. It can thus do things like propagating constant
    return values into the caller at compile time, or creating
    multiple specialized copies of the function body each optimized
    for a different caller. To prevent these optimizations we must
    also set the 'noclone' and 'weak' attributes.

    This fixes the test suite when libvirt.so is built with CLang
    with optimization enabled.

    Signed-off-by: Daniel P. Berrange <berrange@xxxxxxxxxx>

but this caused unwanted side effects:

commit 407a281a8e2b6c5078ba1148535663ea64fd9314
Author: Daniel P. Berrangé <berrange@xxxxxxxxxx>
Date:   Wed Jul 12 11:07:17 2017 +0100

    Revert "Prevent more compiler optimization of mockable functions"

    This reverts commit e4b980c853d2114b25fa805a84ea288384416221.

    When a binary links against a .a archive (as opposed to a shared library),
    any symbols which are marked as 'weak' get silently dropped. As a result
    when the binary later runs, those 'weak' functions have an address of
    0x0 and thus crash when run.

    This happened with virtlogd and virtlockd because they don't link to
    libvirt.so, but instead just libvirt_util.a and libvirt_rpc.a. The
    virRandomBits symbols was weak and so left out of the virtlogd &
    virtlockd binaries, despite being required by virHashTable functions.

    Various other binaries like libvirt_lxc, libvirt_iohelper, etc also
    link directly to .a files instead of libvirt.so, so are potentially
    at risk of dropping symbols leading to a later runtime crash.

    This is normal linker behaviour because a weak symbol is not treated
    as undefined, so nothing forces it to be pulled in from the .a You
    have to force the linker to pull in weak symbols using -u$SYMNAME
    which is not a practical approach.

    This risk is silent bad linkage that affects runtime behaviour is
    not acceptable for a fix that was merely trying to fix the test
    suite. So stop using __weak__ again.

    Signed-off-by: Daniel P. Berrange <berrange@xxxxxxxxxx>

There's a bit of a thread here:

  https://listman.redhat.com/archives/libvir-list/2017-July/150772.html

I tied another approach here:

  https://listman.redhat.com/archives/libvir-list/2017-July/150809.html

which didn't work well enough

and then IIRC we abandoned attempts at that time.

Meanwhile though GCC gained a  'noipa' attribute (no inter procedural
analysis) which can suppress these return value optimizations, amongst
other things. CLang though doesn't support it yet

  https://reviews.llvm.org/D101011

Looking at the compiler flags though there is another option which is
to pass '-fsemantic-interposition'.  With CLang this defaults to off,
which means it is free to assume replacement functions have the same
semantics. Passing -fsemantic-interposition forces it to assume that
replacements might have different semantics, and thus cannot do the
return value optimizations.

The downside is that it would have a impact on performance across
everything we compile with that arg, but I don't think libvirt stuff
is generally that performance sensitive, so we can probably just
go with -fsemantic-interposition on clang unconditionally.

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|