On Tue, Apr 19, 2022 at 01:34:26PM +0800, Zqiang wrote: > If the rcutree.use_softirq is configured, when RCU Stall event > happened, dump status of all rcuc kthreads who due to starvation > prevented grace period ends on CPUs that not report quiescent > state. >Please accept my apologies for the delay, and please let me try >again. ;-) > >Your earlier patch added at most one line and one stack backtrace to >the RCU CPU stall warning text, which is OK. Sort of, anyway. I was >relying on the fact that the people who have (rightly) complained about >RCU CPU stall-warning verbosity never run with !use_softirq. But it is >only a matter of time. Yes, we could argue that they should use faster >console serial lines, faster management-console hardware, faster networks, >faster mass storage, and so on, but I would expect them to in turn ask >us if we were volunteering to pay for all that. > >In contrast, this patch can add one line per stalled CPU on top of the >existing output. Which is better than your earlier patch, which could >add a line plus a stack trace per stalled CPU. But that can still be >a lot of added output, and that added output can cause problems. > >So, could you please merge this rcuc-stalled information into the >existing per-CPU line printed by print_cpu_stall_info()? Right now, >each such line looks something like this: > >rcu: 0-....: (4 ticks this GP) idle=1e6/1/0x4000000000000002 softirq=12470/12470 fqs=2 > >One approach would be to add the number of jiffies that the rcuc >task was stalled to this line, maybe something like this: > >rcu: 0-....: (4 ticks this GP) idle=1e6/1/0x4000000000000002 softirq=12470/12470 fqs=2 rcuc=15384 > >Of course, this "rcuc=" string should only ut only if the stall lasted >for longer than (say) one eighth of the stall timeout. > >Any "(false positive?)" needs to remain at the end of the line: > >rcu: 0-....: (4 ticks this GP) idle=1e6/1/0x4000000000000002 softirq=12470/12470 fqs=2 rcuc=15384 (false positive?) > >Thoughts? Thanks suggestion, I will resend v3. > > Thanx, Paul > > Signed-off-by: Zqiang <qiang1.zhang@xxxxxxxxx> > --- > v1->v2: > rework rcuc_kthread_dump function > > kernel/rcu/tree_stall.h | 32 ++++++++++++++++++++++++++------ > 1 file changed, 26 insertions(+), 6 deletions(-) > > diff --git a/kernel/rcu/tree_stall.h b/kernel/rcu/tree_stall.h > index d7956c03edbd..fcf0b2e1a71c 100644 > --- a/kernel/rcu/tree_stall.h > +++ b/kernel/rcu/tree_stall.h > @@ -465,11 +465,13 @@ static void print_cpu_stall_info(int cpu) > falsepositive ? " (false positive?)" : ""); > } > > -static void rcuc_kthread_dump(struct rcu_data *rdp) > +static void __rcuc_kthread_dump(int cpu) > { > - int cpu; > - unsigned long j; > + struct rcu_data *rdp; > struct task_struct *rcuc; > + unsigned long j; > + > + rdp = per_cpu_ptr(&rcu_data, cpu); > > rcuc = rdp->rcu_cpu_kthread_task; > if (!rcuc) > @@ -488,6 +490,21 @@ static void rcuc_kthread_dump(struct rcu_data *rdp) > dump_cpu_task(cpu); > } > > +static void rcuc_kthread_dump(void) > +{ > + int cpu; > + struct rcu_node *rnp; > + unsigned long flags; > + > + rcu_for_each_leaf_node(rnp) { > + raw_spin_lock_irqsave_rcu_node(rnp, flags); > + for_each_leaf_node_possible_cpu(rnp, cpu) > + if (rnp->qsmask & leaf_node_cpu_bit(rnp, cpu)) > + __rcuc_kthread_dump(cpu); > + raw_spin_unlock_irqrestore_rcu_node(rnp, flags); > + } > +} > + > /* Complain about starvation of grace-period kthread. */ > static void rcu_check_gp_kthread_starvation(void) > { > @@ -597,6 +614,9 @@ static void print_other_cpu_stall(unsigned long gp_seq, unsigned long gps) > if (ndetected) { > rcu_dump_cpu_stacks(); > > + if (!use_softirq) > + rcuc_kthread_dump(); > + > /* Complain about tasks blocking the grace period. */ > rcu_for_each_leaf_node(rnp) > rcu_print_detail_task_stall_rnp(rnp); > @@ -659,11 +679,11 @@ static void print_cpu_stall(unsigned long gps) > rcu_check_gp_kthread_expired_fqs_timer(); > rcu_check_gp_kthread_starvation(); > > - if (!use_softirq) > - rcuc_kthread_dump(rdp); > - > rcu_dump_cpu_stacks(); > > + if (!use_softirq) > + rcuc_kthread_dump(); > + > raw_spin_lock_irqsave_rcu_node(rnp, flags); > /* Rewrite if needed in case of slow consoles. */ > if (ULONG_CMP_GE(jiffies, READ_ONCE(rcu_state.jiffies_stall))) > -- > 2.25.1 >