Re: Allow multiple GP misses before Panic

"Paul E. McKenney" <paulmck@xxxxxxxxxx> · Thu, 13 Aug 2020 11:19:41 -0700

On Thu, Aug 13, 2020 at 10:22:09AM -0700, Chao Zhou wrote:
> Hi,
> 
> Some RCU stalls are transient and a system is fully capable to recover
> after that, but we do want Panic after certain amount of GP misses.
> 
> Current module parameter rcu_cpu_stall_panic only turn on/off Panic,
> and 1 GP miss will trigger Panic when it is enabled.
> 
> Plan to add a module parameter for users to fine-tune how many GP
> misses are allowed before Panic.
> 
> To save our precious time, a diff has been tested on our systems and
> it works and solves our problem in transient RCU stall events.
> 
> Your insights and guidance is highly appreciated.

Please feel free to post a patch.  I could imagine a number of things
you might be doing from your description above:

1.	Having a different time for panic, so that (for example) an
	RCU CPU stall warning appears at 21 seconds (in mainline), and
	if the grace period still has not ended at some time specified
	by some kernel parameter.  For example, one approach would be
	to make the existing panic_on_rcu_stall sysctl take an integer
	instead of a boolean, and to make that integer specify how old
	the stall-warned grace period must be before panic() is invoked.

2.	Instead use the number of RCU CPU stall warning messages to
	trigger the panic, so that (for example), the panic would happen
	on the tenth message.  Again, the panic_on_rcu_stall sysctl
	might be used for this.

3.	Like #2, but reset the count every time a new grace period
	starts.  So if the panic_on_rcu_stall sysctl was set to
	ten, there would need to be ten RCU CPU stall warnings for
	the same grace period before panic() was invoked.

Of the above three, #1 and #3 seem the most attractive, with a slight
preference for #1.

Or did you have something else in mind?

							Thanx, Paul