On 04/08/2022 18:30, Paul E. McKenney wrote: > On Thu, Aug 04, 2022 at 04:54:14PM +0200, Dietmar Eggemann wrote: >> Hi Paul, > > Adding the rcu list on CC in case someone with more ARM experience > than I have has additional insights. many thanks for you swift response! >> one of my colleagues here in Arm approached me with an RCU stall issue >> on `5.4.0-66 Low Latency (Ubuntu)` (on Arm64 Ampere Altra server) he >> gets when he tries to bring-up a network card for which he only has the >> binary module for this kernel version. I tried to help him understanding >> it but after checking all the kernel config switches and studying the >> RCU code we still haven't found the culprit here. I was hoping you can >> give us some advice on this matter. > > Do other kernel versions work better? If so, I suggest manually forcing > an RCU CPU stall warning and bisecting. (But you knew that already!) Hard to say. IIUC he only has those binaries for this particular version. >> (1) What he gets when he launches the card is: >> >> rcu: INFO: rcu_preempt detected stalls on CPUs/tasks: >> [15766.032270] rcu: 0-...0: (2 GPs behind) idle=866/0/0x1 softirq=109892/109892 fqs=7186 >> [15766.040260] rcu: 7-...0: (51 ticks this GP) idle=bc6/1/0x4000000000000000 softirq=7973/7975 fqs=7187 >> [15766.049552] rcu: 19-...0: (18 ticks this GP) idle=842/1/0x4000000000000000 softirq=1289/1289 fqs=7187 >> [15766.058931] rcu: 33-...0: (1 GPs behind) idle=36e/1/0x4000000000000000 softirq=132936/132936 fqs=7187 >> [15766.068309] rcu: 37-...0: (2 GPs behind) idle=c86/0/0x1 softirq=117951/117951 fqs=7187 >> >> w/o any task or stack information (1a). Even the line: >> >> [ X.XXXXXX] (detected by X, t=XXX jiffies, g=XXX, q=XX) >> >> is missing (1b). > > I never have seen this, aside from the usual printk() messages being > lost due to overrunning the console device. But then you will at least > sometimes get messages saying that output was lost. I asked him to check. He told me that there are no such lines in the log. [...] >> My question now is how can this happen and is there a way to convince >> the system to hand out the missing information? >> Other folks were suggesting to use kdump kernel w/ panic_on_rcu_stall=1 >> and then use the crash tool. I haven't use this so far. > > That makes a lot of sense to me! > > You could then look through the rcu_state data structure, including > the rcu_state.node[] array to see what is stalling the grace period. I convinced him to go down this path for now to find the issue. >> I was hoping you can shed some light on this issue. > > Another possibility would be to set hbreak breakpoints in gdb. [...]