Re: [PATCH] pi_stress: Add memory barrier to resolve crash

Crystal Wood <crwood@xxxxxxxxxx> · Thu, 24 Oct 2024 15:12:40 -0500

On Thu, 2024-10-24 at 12:26 -0400, John B. Wyatt IV wrote:
> @@ -952,12 +953,31 @@ void *high_priority(void *arg)
>  			    ("high_priority[%d]:
> pthread_barrier_wait(finish): %x", p->id, status);
>  			return NULL;
>  		}
> +
> +
> +		/**
> +		 * The pthread_barrier_wait should guarantee that
> only one
> +		 * thread at a time interacts with the variables
> below that
> +		 * if block.

How is that guaranteed?

> +		 *
> +		 * GCC -O2 rearranges the two increments above the
> wait
> +		 * function calls causing a race issue if you run
> this
> +		 * near full cores with one core (2 threads) free
> for
> +		 * housekeeping. This causes a crash at around 2
> hour of
> +		 * running. You can prove this by commenting out the
> barrier
> +		 * and compiling with `-O0`. The crash does not show
> with
> +		 * -O0.
> +		 *
> +		 * Add a memory barrier to force GCC to increment
> the variables
> +		 * below the pthread calls. This funcion depends on
> C11.
> +		 **/
> +		atomic_thread_fence(memory_order_seq_cst);

That's kind of nuts that pthread_barrier_wait() doesn't even act as a
compile barrier (which a simple external function call should do), much
less a proper memory barrier.  In fact I'd go beyond that to call it a
bug, just as if there were a mutex implementation that required users
to do this.

And POSIX agrees:
https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap04.html#tag_04_12

Is it possible that something else is going on here, and it's just
being hidden by changing the timing?  What exactly is the race you're
seeing?  It looks like there should be only one high priority thread
per group, so is it racing with a reader, or watchdog_clear(),
or...?How does the crash happen?  What is the helgrind output?

-Crystal