Re: Failure to optimize?

☂Josh Chia (謝任中) via Gcc-help <gcc-help@xxxxxxxxxxx> · Tue, 12 Jan 2021 22:05:31 +0800

I didn't mention it earlier, but I'd like to clarify that this is for "g++
-O3 -std=c++17" on GCC 10.2.

If I use "static thread_local int cursor = 0;" then cursor is thread-local
and such a race should be impossible, but the generated code for bar1()
still has a branch.

The branch persists also if I do this instead for bar1():
char const* bar1() {
    char const* result = foo();
    if (result)
        cursor += 1;
    else
        cursor += 0;
    return result;
}

bar1():
        sub     rsp, 8
        call    foo()
        test    rax, rax
        je      .L1
        add     DWORD PTR fs:cursor@tpoff, 1
.L1:
        add     rsp, 8
        ret

On Tue, Jan 12, 2021 at 9:36 PM Florian Weimer <fweimer@xxxxxxxxxx> wrote:

> * ☂Josh Chia (謝任中) via Gcc-help:
>
> > I have a code snippet that I'm wondering why GCC didn't optimize the way
> I
> > think it should:
> > https://godbolt.org/z/1qKvax
> >
> > bar2() is a variant of bar1() that has been manually tweaked to avoid
> > branches. I haven't done any benchmarks but, I would expect the
> branchless
> > bar2() to perform better than bar1() but GCC does not automatically
> > optimize bar1() to be like bar2(); the generated code for bar1() and
> bar2()
> > are different and the generated code for bar1() contains a branch.
>
> The optimization is probably valid for C99, but not for C11, where the
> memory model prevents the compiler from introducing spurious writes:
> Another thread may modify the variable concurrently, and if this happens
> only if foo returns NULL, the original bar1 function does not contain a
> data race, but the branchless version would.
>
> Thanks,
> Florian
> --
> Red Hat GmbH, https://de.redhat.com/ , Registered seat: Grasbrunn,
> Commercial register: Amtsgericht Muenchen, HRB 153243,
> Managing Directors: Charles Cachera, Brian Klemm, Laurie Krebs, Michael
> O'Neill
>
>