On 2022/11/7 10:10, Elliott, Robert (Servers) wrote: > >>>> +config RCU_CPU_STALL_CPUTIME >>> >>> Since the information might change and grow over time, consider >>> calling it "ADDITIONAL_INFO" rather than just "CPUTIME". >> >> Except that I already redirected Zhen Lei from a generic description >> to a more specific one. The reason for this is that I already get >> complaints about the large volume of output from RCU CPU stall warnings, >> which suggests that unless the additional information is quite small, >> it should get its own Kconfig option and kernel boot parameter. >> >> So we well be keeping RCU_CPU_STALL_CPUTIME. > > No problem - CPUTIME is easier to search for, and will likely > always remain as one of the values reported. > >> But please tie any such patch to an actual use case. After all, if no >> one actually uses that additional information, we have irritates untold >> numbers of electrons for no purpose. > > I've been working on fixing intermittent RCU stalls that turned out to be > caused by x86-optimized crypto modules holding kernel_fpu_begin()/end() > too long and crypto tests not calling cond_resched() often enough, so am > interested in features that will help identify and prevent such > problems. > >> Also, some of those functions are on fastpaths, so adding unconditional >> instrumentation overhead might result in an objection or three. >> >> As always, choose wisely! ;-) >> >> If I don't see anything from you by this coming Friday, I will fold my >> usual wordsmithing into the patch. > > I applied the series to a tree not including any crypto module > fixes, but an overnight run didn't trigger any RCU stalls. So, I > modified the tcrypt test module to simulate a problem by > running a lengthy loop after kernel_fpu_begin(). > Here's an example of how the output looks: > > > [ 1816.698380] rcu: INFO: rcu_preempt self-detected stall on CPU > [ 1816.704368] rcu: 0-....: (2999 ticks this GP) idle=1dbc/1/0x4000000000000000 softirq=39683/39683 fqs=751 > [ 1816.714200] rcu: hardirqs softirqs csw/system > [ 1816.719922] rcu: number: 5 10 0 > [ 1816.725643] rcu: cputime: 3 0 1492 ==> 1500(ms) > [ 1816.732669] (t=3030 jiffies g=89765 q=308 ncpus=56) > [ 1816.737857] CPU: 0 PID: 46826 Comm: modprobe Tainted: G W 6.0.0+ #5 > [ 1816.745754] Hardware name: HPE ProLiant DL360 Gen10/ProLiant DL360 Gen10, BIOS U32 03/08/2022 > [ 1816.754523] RIP: 0010:rude_sleep_cycles.constprop.0+0x1c/0x30 [tcrypt] > [ 1816.761290] Code: 5d 41 5c 41 5d 41 5e 41 5f c3 cc cc cc cc 0f 1f 44 00 00 0f 31 48 c1 e2 20 be ab 90 41 00 48 89 d1 48 c1 e6 0b 48 09 c1 0f 31 <48> c1 e2 20 48 09 c2 48 29 ca 48 39 d6 73 ef c3 cc cc cc cc 48 8b > ... > > That makes me realize what "csw/switch" means: > - "csw" context switches applies to the number line > - "system" applies the cputime line > > Maybe this shouldn't be a table? Make it grep-friendly: > [ 1816.719922] rcu: half-timeout counts: hardirq =5 softirq=10 csw=0 > [ 1816.725643] rcu: half_timeout cputimes (ms): time=1500 hardirq=3 softirq=0 system=1492 I prefer the table. Table look clearer and easier to compare. > > The new prints interfere with existing prints, pushing the "t=3030" line > further away from its context. That's from a pr_cont. Existing prints after > the "self-detect stall" message all start with \t, so they are roughly > related to the first line. The new prints should probably do the same. > > Since the first line ends with \n, the pr_cont will never make it on the > same line, so it might be cleaner to use pr_err instead. That way it'll have > the "rcu:" prefix of the other lines. > > That's from: > pr_err("INFO: %s self-detected stall on CPU\n", rcu_state.name); > raw_spin_lock_irqsave_rcu_node(rdp->mynode, flags); > print_cpu_stall_info(smp_processor_id()); > [that's where this patch adds more prints] > raw_spin_unlock_irqrestore_rcu_node(rdp->mynode, flags); > for_each_possible_cpu(cpu) > totqlen += rcu_get_n_cbs_cpu(cpu); > pr_cont("\t(t=%lu jiffies g=%ld q=%lu ncpus=%d)\n", > jiffies - gps, > (long)rcu_seq_current(&rcu_state.gp_seq), totqlen, rcu_state.n_online_cpus); > > > One other print has similar construction: > pr_err("INFO: %s detected stalls on CPUs/tasks:\n", rcu_state.name); > ... > print_cpu_stall_info(cpu); > ... > pr_cont("\t(detected by %d, t=%ld jiffies, g=%ld, q=%lu ncpus=%d)\n", > smp_processor_id(), (long)(jiffies - gps), > (long)rcu_seq_current(&rcu_state.gp_seq), totqlen, rcu_state.n_online_cpus); > > > > . > -- Regards, Zhen Lei