I didn't mention it earlier, but I'd like to clarify that this is for "g++ -O3 -std=c++17" on GCC 10.2. If I use "static thread_local int cursor = 0;" then cursor is thread-local and such a race should be impossible, but the generated code for bar1() still has a branch. The branch persists also if I do this instead for bar1(): char const* bar1() { char const* result = foo(); if (result) cursor += 1; else cursor += 0; return result; } bar1(): sub rsp, 8 call foo() test rax, rax je .L1 add DWORD PTR fs:cursor@tpoff, 1 .L1: add rsp, 8 ret On Tue, Jan 12, 2021 at 9:36 PM Florian Weimer <fweimer@xxxxxxxxxx> wrote: > * ☂Josh Chia (謝任中) via Gcc-help: > > > I have a code snippet that I'm wondering why GCC didn't optimize the way > I > > think it should: > > https://godbolt.org/z/1qKvax > > > > bar2() is a variant of bar1() that has been manually tweaked to avoid > > branches. I haven't done any benchmarks but, I would expect the > branchless > > bar2() to perform better than bar1() but GCC does not automatically > > optimize bar1() to be like bar2(); the generated code for bar1() and > bar2() > > are different and the generated code for bar1() contains a branch. > > The optimization is probably valid for C99, but not for C11, where the > memory model prevents the compiler from introducing spurious writes: > Another thread may modify the variable concurrently, and if this happens > only if foo returns NULL, the original bar1 function does not contain a > data race, but the branchless version would. > > Thanks, > Florian > -- > Red Hat GmbH, https://de.redhat.com/ , Registered seat: Grasbrunn, > Commercial register: Amtsgericht Muenchen, HRB 153243, > Managing Directors: Charles Cachera, Brian Klemm, Laurie Krebs, Michael > O'Neill > >