On Tue, 29 Oct 2024, Wander Lairson Costa wrote: > On Thu, Oct 24, 2024 at 12:26:14PM -0400, John B. Wyatt IV wrote: > > This patch is also an error report seen on RHEL 9 and Fedora 40. > > pi_stress crashes if you run this near full cores with one core > > (2 threads) free for housekeeping. The crash usually happens at > > around 2 hours of pi_stress running. This issue was reproduced > > on a RHEL server with 2 sockets of 10 cores/20 threads (total 40 > > threads) and a Fedora server with 2 sockets of 56 cores/112 > > threads (total 226 threads). > > > > The pthread_barrier_wait should guarantee that only one > > thread at a time interacts with the variables below that > > if block. > > > > GCC -O2 optimization rearranges the two increments above the wait > > function calls. This causes a race issue found by Helgrind. > > You can prove this by commenting out the memory barrier > > and compiling with `-O0`. The crash does not happen with -O0. > > Thank you to Valentin Schneider <vschneid@xxxxxxxxxx> for > > suggesting about -O2. > > > > Add a memory barrier to force GCC to increment the variables > > below the pthread calls. The memory barrier prevented any crashes > > for several 24 hours tests. This function depends on C11. > > > > Reported-by: Valgrind's Helgrind tool > > > > Signed-off-by: "John B. Wyatt IV" <jwyatt@xxxxxxxxxx> > > Signed-off-by: "John B. Wyatt IV" <sageofredondo@xxxxxxxxx> > > --- > > src/pi_tests/pi_stress.c | 22 +++++++++++++++++++++- > > 1 file changed, 21 insertions(+), 1 deletion(-) > > > > diff --git a/src/pi_tests/pi_stress.c b/src/pi_tests/pi_stress.c > > index 371e906..37023a1 100644 > > --- a/src/pi_tests/pi_stress.c > > +++ b/src/pi_tests/pi_stress.c > > @@ -44,6 +44,7 @@ > > #include <stdint.h> > > #include <inttypes.h> > > #include <limits.h> > > +#include <stdatomic.h> > > > > #include "rt-sched.h" > > #include "rt-utils.h" > > @@ -952,12 +953,31 @@ void *high_priority(void *arg) > > ("high_priority[%d]: pthread_barrier_wait(finish): %x", p->id, status); > > return NULL; > > } > > + > > + > > + /** > > + * The pthread_barrier_wait should guarantee that only one > > + * thread at a time interacts with the variables below that > > + * if block. > > + * > > + * GCC -O2 rearranges the two increments above the wait > > + * function calls causing a race issue if you run this > > + * near full cores with one core (2 threads) free for > > + * housekeeping. This causes a crash at around 2 hour of > > + * running. You can prove this by commenting out the barrier > > + * and compiling with `-O0`. The crash does not show with > > + * -O0. > > + * > > + * Add a memory barrier to force GCC to increment the variables > > + * below the pthread calls. This funcion depends on C11. > > + **/ > > I agree with Crystal. AFAIU, the compiler should not reorder before > pthread_barrier_wait(). Something odd is going on here. However, I am > not familiar with the code base. What's the repo for this? This is one of the programs in the rt-tests suite Clone one of the following git://git.kernel.org/pub/scm/utils/rt-tests/rt-tests.git https://git.kernel.org/pub/scm/utils/rt-tests/rt-tests.git https://kernel.googlesource.com/pub/scm/utils/rt-tests/rt-tests.git > > Btw, did you look at the generated assembly code? > > > + atomic_thread_fence(memory_order_seq_cst); > > + > > If the problem is program order, a simple compiler barrier with acquire > semantics should work: > > atomic_signal_fence(memory_order_acquire); > > > /* update the group stats */ > > p->total++; > > > > /* update the watchdog counter */ > > p->watchdog++; > > - > > } > > set_shutdown_flag(); > > > > -- > > 2.47.0 > > > > >