Thanks Paul, please see inline @chao. On Thu, Aug 13, 2020 at 2:21 PM Paul E. McKenney <paulmck@xxxxxxxxxx> wrote: > > On Thu, Aug 13, 2020 at 11:50:47AM -0700, Chao Zhou wrote: > > Thanks Paul for the insights! > > > > I studied the 3 options and think that #1+#3 offers both flexibility > > to users and coverage of boundary user cases. > > > > For example, as an user of RCU, we want the warnings to be spilled at > > the default 21 seconds so that we know such events are happening. At > > the same time, we want Panic to happen if the stall is long enough to > > significantly affect available system memory on our system. > > OK, good, you want to panic only on long stalls, not on multiple short > stalls. Makes sense! > > By the way, the short stalls sometimes indicate bugs that should > be fixed. For example, there might be a long loop in the kernel > that needs an added cond_resched(). Fixing those would make your > systems less noisy. > > I am not asking you to do anything I don't do myself: > > 0a3b3c253a1e ("mm/mmap.c: Add cond_resched() for exit_mmap() CPU stalls") > 9f47eb5461aa ("fs/btrfs: Add cond_resched() for try_release_extent_mapping() stalls") @chao: Yes exactly, any stall will warn, that is why I want to add miss statistics. but mostly it is long stalls that hold on to Read Copy and detriment system's capability to a point that we have to Panic :) . cond_resched is exactly the API I am asking my team to do as one of the ways to fix some of the stallers. Feel honored that we are on same page. > > > Here is the plan based on our discussion, please advise if not inline > > with the idea: > > 1. modify panic_on_rcu_stall to be the maximum number of consecutive > > warnings to trigger Panic. > > 1) change its name to max_rcu_stall_to_panic, > > This change would make sense were it not for the possibility that there > are lots of existing "panic_on_rcu_stall" instances in scripts, boot > parameters, and so on. We need the old name for compatibility, and with > "0" and "1" having the same meanings as before. We -can- use other > numbers because those other numbers were not legal before. > > So we don't get to change the name. @chao: Thanks that is why I post a follow up email about this to get your advice. It is public API, better find another way. > > > 2) default value to 1, which is the same behavior as today's. > > The default value must be zero, which needs to mean "don't ever panic". > Setting the value to "1" must mean "panic after the very first RCU CPU > stall warning. Other values can have other meanings. > > But you could use the existing panic_on_rcu_stall to be (say) the number > of stall-warning messages that will be tolerated in a given grace period > before panicking. > > Then you could create a new panic_on_rcu_stall_age or similar that > does the time-based panicking. @chao: good suggestions, will use in the patch. > > Either way, people currently using panic_on_rcu_stall need to be able > to continue using panic_on_rcu_stall exactly as they are without seeing > any change in behavior. We don't break users if we have a choice, > and here we really do have a choice. So we don't break users. > @chao: sure will do. Also learning the best practices. > > 2. use ((struct rcu_state *)->gpnum - (struct rcu_data *)->gpnum) >= > > max_rcu_stall_to_panic as condition to trigger Panic; > > This probably won't do what you want. This would trigger only if > the rcu_data structure's ->gpnum were to lag behind the rcu_state > structure's ->gpnum, which happens now whenever a CPU is idle for > an extended time period. > > For the time-based approach (#1 below), the idea would be to compare > against the rcu_state structure's ->gp_start field, and even then only > when a grace period is actually in progress. One way to do this is > to check against the "gps" local variable in check_cpu_stall() in > kernel/rcu/tree_stall.h. @chao: good advice, will do in patch. > > > 3. reset (struct rcu_data *)->gpnum to (struct rcu_state *)->gpnum > > every time a new grace period starts; > > Touching every rcu_data structure at the beginning of each grace > period will slow things down. This slowdown is avoided in the > current code by having each CPU update its rcu_data structure's > ->gpnum when it notices a new grace period. > > And ->gpnum and ->completed have been combined into a single ->gp_seq > in recent kernels. > > So it is better to compare against ->gp_start (AKA "gps") as discussed > above. @chao: good advice, will do in patch. > > > 4. add a new member (struct rcu_data *)->gpmiss that is incremented at > > each miss to track how many misses so far for statistics/debug > > purpose. > > This is for #3 below, correct? > > If so, please instead create an ->nstalls_cur_gp or similar in the > rcu_state structure. Zero this at the beginning of every grace > period (in record_gp_stall_check_time() in kernel/rcu/tree_stall.h) > and increment and check it in both legs of the "if" statement at the > end of check_cpu_stall() in this same file. @chao: sure will do. > > > Your insights and advice are highly appreciated. > > Please let me know how it goes! @chao: Thanks Paul, I think we are at a stage where I need to post a patch for more suggestions. > > Thanx, Paul > > > Thanks! > > > > Chao > > > > On Thu, Aug 13, 2020 at 11:19 AM Paul E. McKenney <paulmck@xxxxxxxxxx> wrote: > > > > > > On Thu, Aug 13, 2020 at 10:22:09AM -0700, Chao Zhou wrote: > > > > Hi, > > > > > > > > Some RCU stalls are transient and a system is fully capable to recover > > > > after that, but we do want Panic after certain amount of GP misses. > > > > > > > > Current module parameter rcu_cpu_stall_panic only turn on/off Panic, > > > > and 1 GP miss will trigger Panic when it is enabled. > > > > > > > > Plan to add a module parameter for users to fine-tune how many GP > > > > misses are allowed before Panic. > > > > > > > > To save our precious time, a diff has been tested on our systems and > > > > it works and solves our problem in transient RCU stall events. > > > > > > > > Your insights and guidance is highly appreciated. > > > > > > Please feel free to post a patch. I could imagine a number of things > > > you might be doing from your description above: > > > > > > 1. Having a different time for panic, so that (for example) an > > > RCU CPU stall warning appears at 21 seconds (in mainline), and > > > if the grace period still has not ended at some time specified > > > by some kernel parameter. For example, one approach would be > > > to make the existing panic_on_rcu_stall sysctl take an integer > > > instead of a boolean, and to make that integer specify how old > > > the stall-warned grace period must be before panic() is invoked. > > > > > > 2. Instead use the number of RCU CPU stall warning messages to > > > trigger the panic, so that (for example), the panic would happen > > > on the tenth message. Again, the panic_on_rcu_stall sysctl > > > might be used for this. > > > > > > 3. Like #2, but reset the count every time a new grace period > > > starts. So if the panic_on_rcu_stall sysctl was set to > > > ten, there would need to be ten RCU CPU stall warnings for > > > the same grace period before panic() was invoked. > > > > > > Of the above three, #1 and #3 seem the most attractive, with a slight > > > preference for #1. > > > > > > Or did you have something else in mind? > > > > > > Thanx, Paul