On Thu, 2024-10-24 at 12:26 -0400, John B. Wyatt IV wrote: > @@ -952,12 +953,31 @@ void *high_priority(void *arg) > ("high_priority[%d]: > pthread_barrier_wait(finish): %x", p->id, status); > return NULL; > } > + > + > + /** > + * The pthread_barrier_wait should guarantee that > only one > + * thread at a time interacts with the variables > below that > + * if block. How is that guaranteed? > + * > + * GCC -O2 rearranges the two increments above the > wait > + * function calls causing a race issue if you run > this > + * near full cores with one core (2 threads) free > for > + * housekeeping. This causes a crash at around 2 > hour of > + * running. You can prove this by commenting out the > barrier > + * and compiling with `-O0`. The crash does not show > with > + * -O0. > + * > + * Add a memory barrier to force GCC to increment > the variables > + * below the pthread calls. This funcion depends on > C11. > + **/ > + atomic_thread_fence(memory_order_seq_cst); That's kind of nuts that pthread_barrier_wait() doesn't even act as a compile barrier (which a simple external function call should do), much less a proper memory barrier. In fact I'd go beyond that to call it a bug, just as if there were a mutex implementation that required users to do this. And POSIX agrees: https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap04.html#tag_04_12 Is it possible that something else is going on here, and it's just being hidden by changing the timing? What exactly is the race you're seeing? It looks like there should be only one high priority thread per group, so is it racing with a reader, or watchdog_clear(), or...?How does the crash happen? What is the helgrind output? -Crystal