This patch is also an error report seen on RHEL 9 and Fedora 40. pi_stress crashes if you run this near full cores with one core (2 threads) free for housekeeping. The crash usually happens at around 2 hours of pi_stress running. This issue was reproduced on a RHEL server with 2 sockets of 10 cores/20 threads (total 40 threads) and a Fedora server with 2 sockets of 56 cores/112 threads (total 226 threads). The pthread_barrier_wait should guarantee that only one thread at a time interacts with the variables below that if block. GCC -O2 optimization rearranges the two increments above the wait function calls. This causes a race issue found by Helgrind. You can prove this by commenting out the memory barrier and compiling with `-O0`. The crash does not happen with -O0. Thank you to Valentin Schneider <vschneid@xxxxxxxxxx> for suggesting about -O2. Add a memory barrier to force GCC to increment the variables below the pthread calls. The memory barrier prevented any crashes for several 24 hours tests. This function depends on C11. Reported-by: Valgrind's Helgrind tool Signed-off-by: "John B. Wyatt IV" <jwyatt@xxxxxxxxxx> Signed-off-by: "John B. Wyatt IV" <sageofredondo@xxxxxxxxx> --- src/pi_tests/pi_stress.c | 22 +++++++++++++++++++++- 1 file changed, 21 insertions(+), 1 deletion(-) diff --git a/src/pi_tests/pi_stress.c b/src/pi_tests/pi_stress.c index 371e906..37023a1 100644 --- a/src/pi_tests/pi_stress.c +++ b/src/pi_tests/pi_stress.c @@ -44,6 +44,7 @@ #include <stdint.h> #include <inttypes.h> #include <limits.h> +#include <stdatomic.h> #include "rt-sched.h" #include "rt-utils.h" @@ -952,12 +953,31 @@ void *high_priority(void *arg) ("high_priority[%d]: pthread_barrier_wait(finish): %x", p->id, status); return NULL; } + + + /** + * The pthread_barrier_wait should guarantee that only one + * thread at a time interacts with the variables below that + * if block. + * + * GCC -O2 rearranges the two increments above the wait + * function calls causing a race issue if you run this + * near full cores with one core (2 threads) free for + * housekeeping. This causes a crash at around 2 hour of + * running. You can prove this by commenting out the barrier + * and compiling with `-O0`. The crash does not show with + * -O0. + * + * Add a memory barrier to force GCC to increment the variables + * below the pthread calls. This funcion depends on C11. + **/ + atomic_thread_fence(memory_order_seq_cst); + /* update the group stats */ p->total++; /* update the watchdog counter */ p->watchdog++; - } set_shutdown_flag(); -- 2.47.0