> > > +config RCU_CPU_STALL_CPUTIME > > > > Since the information might change and grow over time, consider > > calling it "ADDITIONAL_INFO" rather than just "CPUTIME". > > Except that I already redirected Zhen Lei from a generic description > to a more specific one. The reason for this is that I already get > complaints about the large volume of output from RCU CPU stall warnings, > which suggests that unless the additional information is quite small, > it should get its own Kconfig option and kernel boot parameter. > > So we well be keeping RCU_CPU_STALL_CPUTIME. No problem - CPUTIME is easier to search for, and will likely always remain as one of the values reported. > But please tie any such patch to an actual use case. After all, if no > one actually uses that additional information, we have irritates untold > numbers of electrons for no purpose. I've been working on fixing intermittent RCU stalls that turned out to be caused by x86-optimized crypto modules holding kernel_fpu_begin()/end() too long and crypto tests not calling cond_resched() often enough, so am interested in features that will help identify and prevent such problems. > Also, some of those functions are on fastpaths, so adding unconditional > instrumentation overhead might result in an objection or three. > > As always, choose wisely! ;-) > > If I don't see anything from you by this coming Friday, I will fold my > usual wordsmithing into the patch. I applied the series to a tree not including any crypto module fixes, but an overnight run didn't trigger any RCU stalls. So, I modified the tcrypt test module to simulate a problem by running a lengthy loop after kernel_fpu_begin(). Here's an example of how the output looks: [ 1816.698380] rcu: INFO: rcu_preempt self-detected stall on CPU [ 1816.704368] rcu: 0-....: (2999 ticks this GP) idle=1dbc/1/0x4000000000000000 softirq=39683/39683 fqs=751 [ 1816.714200] rcu: hardirqs softirqs csw/system [ 1816.719922] rcu: number: 5 10 0 [ 1816.725643] rcu: cputime: 3 0 1492 ==> 1500(ms) [ 1816.732669] (t=3030 jiffies g=89765 q=308 ncpus=56) [ 1816.737857] CPU: 0 PID: 46826 Comm: modprobe Tainted: G W 6.0.0+ #5 [ 1816.745754] Hardware name: HPE ProLiant DL360 Gen10/ProLiant DL360 Gen10, BIOS U32 03/08/2022 [ 1816.754523] RIP: 0010:rude_sleep_cycles.constprop.0+0x1c/0x30 [tcrypt] [ 1816.761290] Code: 5d 41 5c 41 5d 41 5e 41 5f c3 cc cc cc cc 0f 1f 44 00 00 0f 31 48 c1 e2 20 be ab 90 41 00 48 89 d1 48 c1 e6 0b 48 09 c1 0f 31 <48> c1 e2 20 48 09 c2 48 29 ca 48 39 d6 73 ef c3 cc cc cc cc 48 8b ... That makes me realize what "csw/switch" means: - "csw" context switches applies to the number line - "system" applies the cputime line Maybe this shouldn't be a table? Make it grep-friendly: [ 1816.719922] rcu: half-timeout counts: hardirq =5 softirq=10 csw=0 [ 1816.725643] rcu: half_timeout cputimes (ms): time=1500 hardirq=3 softirq=0 system=1492 The new prints interfere with existing prints, pushing the "t=3030" line further away from its context. That's from a pr_cont. Existing prints after the "self-detect stall" message all start with \t, so they are roughly related to the first line. The new prints should probably do the same. Since the first line ends with \n, the pr_cont will never make it on the same line, so it might be cleaner to use pr_err instead. That way it'll have the "rcu:" prefix of the other lines. That's from: pr_err("INFO: %s self-detected stall on CPU\n", rcu_state.name); raw_spin_lock_irqsave_rcu_node(rdp->mynode, flags); print_cpu_stall_info(smp_processor_id()); [that's where this patch adds more prints] raw_spin_unlock_irqrestore_rcu_node(rdp->mynode, flags); for_each_possible_cpu(cpu) totqlen += rcu_get_n_cbs_cpu(cpu); pr_cont("\t(t=%lu jiffies g=%ld q=%lu ncpus=%d)\n", jiffies - gps, (long)rcu_seq_current(&rcu_state.gp_seq), totqlen, rcu_state.n_online_cpus); One other print has similar construction: pr_err("INFO: %s detected stalls on CPUs/tasks:\n", rcu_state.name); ... print_cpu_stall_info(cpu); ... pr_cont("\t(detected by %d, t=%ld jiffies, g=%ld, q=%lu ncpus=%d)\n", smp_processor_id(), (long)(jiffies - gps), (long)rcu_seq_current(&rcu_state.gp_seq), totqlen, rcu_state.n_online_cpus);