Re: [PATCH] pi_stress: Add memory barrier to resolve crash

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Thu, Oct 24, 2024 at 12:26:14PM -0400, John B. Wyatt IV wrote:
> This patch is also an error report seen on RHEL 9 and Fedora 40.
> pi_stress crashes if you run this near full cores with one core
> (2 threads) free for housekeeping. The crash usually happens at
> around 2 hours of pi_stress running. This issue was reproduced
> on a RHEL server with 2 sockets of 10 cores/20 threads (total 40
> threads) and a Fedora server with 2 sockets of 56 cores/112
> threads (total 226 threads).
> 
> The pthread_barrier_wait should guarantee that only one
> thread at a time interacts with the variables below that
> if block.
> 
> GCC -O2 optimization rearranges the two increments above the wait
> function calls. This causes a race issue found by Helgrind.
> You can prove this by commenting out the memory barrier
> and compiling with `-O0`. The crash does not happen with -O0.
> Thank you to Valentin Schneider <vschneid@xxxxxxxxxx> for
> suggesting about -O2.
> 
> Add a memory barrier to force GCC to increment the variables
> below the pthread calls. The memory barrier prevented any crashes
> for several 24 hours tests. This function depends on C11.
> 
> Reported-by: Valgrind's Helgrind tool
> 
> Signed-off-by: "John B. Wyatt IV" <jwyatt@xxxxxxxxxx>
> Signed-off-by: "John B. Wyatt IV" <sageofredondo@xxxxxxxxx>
> ---
>  src/pi_tests/pi_stress.c | 22 +++++++++++++++++++++-
>  1 file changed, 21 insertions(+), 1 deletion(-)
> 
> diff --git a/src/pi_tests/pi_stress.c b/src/pi_tests/pi_stress.c
> index 371e906..37023a1 100644
> --- a/src/pi_tests/pi_stress.c
> +++ b/src/pi_tests/pi_stress.c
> @@ -44,6 +44,7 @@
>  #include <stdint.h>
>  #include <inttypes.h>
>  #include <limits.h>
> +#include <stdatomic.h>
>  
>  #include "rt-sched.h"
>  #include "rt-utils.h"
> @@ -952,12 +953,31 @@ void *high_priority(void *arg)
>  			    ("high_priority[%d]: pthread_barrier_wait(finish): %x", p->id, status);
>  			return NULL;
>  		}
> +
> +
> +		/**
> +		 * The pthread_barrier_wait should guarantee that only one
> +		 * thread at a time interacts with the variables below that
> +		 * if block.
> +		 *
> +		 * GCC -O2 rearranges the two increments above the wait
> +		 * function calls causing a race issue if you run this
> +		 * near full cores with one core (2 threads) free for
> +		 * housekeeping. This causes a crash at around 2 hour of
> +		 * running. You can prove this by commenting out the barrier
> +		 * and compiling with `-O0`. The crash does not show with
> +		 * -O0.
> +		 *
> +		 * Add a memory barrier to force GCC to increment the variables
> +		 * below the pthread calls. This funcion depends on C11.
> +		 **/

I agree with Crystal. AFAIU, the compiler should not reorder before
pthread_barrier_wait(). Something odd is going on here. However, I am
not familiar with the code base. What's the repo for this?

Btw, did you look at the generated assembly code?

> +		atomic_thread_fence(memory_order_seq_cst);
> +

If the problem is program order, a simple compiler barrier with acquire
semantics should work:

        atomic_signal_fence(memory_order_acquire);

>  		/* update the group stats */
>  		p->total++;
>  
>  		/* update the watchdog counter */
>  		p->watchdog++;
> -
>  	}
>  	set_shutdown_flag();
>  
> -- 
> 2.47.0
> 





[Index of Archives]     [RT Stable]     [Kernel Newbies]     [IDE]     [Security]     [Git]     [Netfilter]     [Bugtraq]     [Yosemite]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux ATA RAID]     [Samba]     [Video 4 Linux]     [Device Mapper]

  Powered by Linux