Re: Allow multiple GP misses before Panic

Chao Zhou <chaozhou1018@xxxxxxxxx> · Thu, 13 Aug 2020 14:57:20 -0700

Thanks Paul, please see inline @chao.

On Thu, Aug 13, 2020 at 2:21 PM Paul E. McKenney <paulmck@xxxxxxxxxx> wrote:
>
> On Thu, Aug 13, 2020 at 11:50:47AM -0700, Chao Zhou wrote:
> > Thanks Paul for the insights!
> >
> > I studied the 3 options and think that #1+#3 offers both flexibility
> > to users and coverage of boundary user cases.
> >
> > For example, as an user of RCU, we want the warnings to be spilled at
> > the default 21 seconds so that we know such events are happening. At
> > the same time, we want Panic to happen if the stall is long enough to
> > significantly affect available system memory on our system.
>
> OK, good, you want to panic only on long stalls, not on multiple short
> stalls.  Makes sense!
>
> By the way, the short stalls sometimes indicate bugs that should
> be fixed.  For example, there might be a long loop in the kernel
> that needs an added cond_resched().  Fixing those would make your
> systems less noisy.
>
> I am not asking you to do anything I don't do myself:
>
> 0a3b3c253a1e ("mm/mmap.c: Add cond_resched() for exit_mmap() CPU stalls")
> 9f47eb5461aa ("fs/btrfs: Add cond_resched() for try_release_extent_mapping() stalls")

@chao: Yes exactly, any stall will warn, that is why I want to add
miss statistics.
but mostly it is long stalls that hold on to Read Copy and detriment
system's capability to a point that we have to Panic :) .

cond_resched is exactly the API I am asking my team to do as one of
the ways to fix some of the stallers. Feel honored that we are on same
page.

>
> > Here is the plan based on our discussion, please advise if not inline
> > with the idea:
> > 1. modify panic_on_rcu_stall to be the maximum number of consecutive
> > warnings to trigger Panic.
> >     1) change its name to max_rcu_stall_to_panic,
>
> This change would make sense were it not for the possibility that there
> are lots of existing "panic_on_rcu_stall" instances in scripts, boot
> parameters, and so on.  We need the old name for compatibility, and with
> "0" and "1" having the same meanings as before.  We -can- use other
> numbers because those other numbers were not legal before.
>
> So we don't get to change the name.

@chao: Thanks that is why I post a follow up email about this to get
your advice. It is public API, better find another way.

>
> >     2) default value to 1, which is the same behavior as today's.
>
> The default value must be zero, which needs to mean "don't ever panic".
> Setting the value to "1" must mean "panic after the very first RCU CPU
> stall warning.  Other values can have other meanings.
>
> But you could use the existing panic_on_rcu_stall to be (say) the number
> of stall-warning messages that will be tolerated in a given grace period
> before panicking.
>
> Then you could create a new panic_on_rcu_stall_age or similar that
> does the time-based panicking.

@chao: good suggestions, will use in the patch.

>
> Either way, people currently using panic_on_rcu_stall need to be able
> to continue using panic_on_rcu_stall exactly as they are without seeing
> any change in behavior.  We don't break users if we have a choice,
> and here we really do have a choice.  So we don't break users.
>

@chao: sure will do. Also learning the best practices.

> > 2. use ((struct rcu_state *)->gpnum - (struct rcu_data *)->gpnum) >=
> > max_rcu_stall_to_panic as condition to trigger Panic;
>
> This probably won't do what you want.  This would trigger only if
> the rcu_data structure's ->gpnum were to lag behind the rcu_state
> structure's ->gpnum, which happens now whenever a CPU is idle for
> an extended time period.
>
> For the time-based approach (#1 below), the idea would be to compare
> against the rcu_state structure's ->gp_start field, and even then only
> when a grace period is actually in progress.  One way to do this is
> to check against the "gps" local variable in check_cpu_stall() in
> kernel/rcu/tree_stall.h.

@chao: good advice, will do in patch.

>
> > 3. reset (struct rcu_data *)->gpnum to (struct rcu_state *)->gpnum
> > every time a new grace period starts;
>
> Touching every rcu_data structure at the beginning of each grace
> period will slow things down.  This slowdown is avoided in the
> current code by having each CPU update its rcu_data structure's
> ->gpnum when it notices a new grace period.
>
> And ->gpnum and ->completed have been combined into a single ->gp_seq
> in recent kernels.
>
> So it is better to compare against ->gp_start (AKA "gps") as discussed
> above.

@chao: good advice, will do in patch.

>
> > 4. add a new member (struct rcu_data *)->gpmiss that is incremented at
> > each miss to track how many misses so far for statistics/debug
> > purpose.
>
> This is for #3 below, correct?
>
> If so, please instead create an ->nstalls_cur_gp or similar in the
> rcu_state structure.  Zero this at the beginning of every grace
> period (in record_gp_stall_check_time() in kernel/rcu/tree_stall.h)
> and increment and check it in both legs of the "if" statement at the
> end of check_cpu_stall() in this same file.

@chao: sure will do.

>
> > Your insights and advice are highly appreciated.
>
> Please let me know how it goes!

@chao: Thanks Paul, I think we are at a stage where I need to post a
patch for more suggestions.

>
>                                                         Thanx, Paul
>
> > Thanks!
> >
> > Chao
> >
> > On Thu, Aug 13, 2020 at 11:19 AM Paul E. McKenney <paulmck@xxxxxxxxxx> wrote:
> > >
> > > On Thu, Aug 13, 2020 at 10:22:09AM -0700, Chao Zhou wrote:
> > > > Hi,
> > > >
> > > > Some RCU stalls are transient and a system is fully capable to recover
> > > > after that, but we do want Panic after certain amount of GP misses.
> > > >
> > > > Current module parameter rcu_cpu_stall_panic only turn on/off Panic,
> > > > and 1 GP miss will trigger Panic when it is enabled.
> > > >
> > > > Plan to add a module parameter for users to fine-tune how many GP
> > > > misses are allowed before Panic.
> > > >
> > > > To save our precious time, a diff has been tested on our systems and
> > > > it works and solves our problem in transient RCU stall events.
> > > >
> > > > Your insights and guidance is highly appreciated.
> > >
> > > Please feel free to post a patch.  I could imagine a number of things
> > > you might be doing from your description above:
> > >
> > > 1.      Having a different time for panic, so that (for example) an
> > >         RCU CPU stall warning appears at 21 seconds (in mainline), and
> > >         if the grace period still has not ended at some time specified
> > >         by some kernel parameter.  For example, one approach would be
> > >         to make the existing panic_on_rcu_stall sysctl take an integer
> > >         instead of a boolean, and to make that integer specify how old
> > >         the stall-warned grace period must be before panic() is invoked.
> > >
> > > 2.      Instead use the number of RCU CPU stall warning messages to
> > >         trigger the panic, so that (for example), the panic would happen
> > >         on the tenth message.  Again, the panic_on_rcu_stall sysctl
> > >         might be used for this.
> > >
> > > 3.      Like #2, but reset the count every time a new grace period
> > >         starts.  So if the panic_on_rcu_stall sysctl was set to
> > >         ten, there would need to be ten RCU CPU stall warnings for
> > >         the same grace period before panic() was invoked.
> > >
> > > Of the above three, #1 and #3 seem the most attractive, with a slight
> > > preference for #1.
> > >
> > > Or did you have something else in mind?
> > >
> > >                                                         Thanx, Paul