Re: 5.4.13-rt7 stall on CPU?

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 2020-06-27 15:30:20 [+0200], Udo van den Heuvel wrote:
> Hello,
Hi,

> Found this in /var/log/messages:
> 
> Jun 25 16:31:39 vuurmuur pppd[1522583]: local  LL address fe80::ed36:3ac4:4115:e23e
> Jun 25 16:31:39 vuurmuur pppd[1522583]: remote LL address fe80::2a8a:1cff:fee0:9484
> Jun 26 04:50:24 vuurmuur kernel: 002: rcu: INFO: rcu_preempt self-detected stall on CPU
> Jun 26 04:50:24 vuurmuur kernel: 002: rcu:      2-....: (5336 ticks this GP) idle=f6a/1/0x4000000000000002 softirq=347363113/347363115 fqs=2430
> Jun 26 04:50:24 vuurmuur kernel: 002:   (t=5250 jiffies g=608224341 q=1297)
> Jun 26 04:50:24 vuurmuur kernel: 002: NMI backtrace for cpu 2
…
> Jun 26 04:50:24 vuurmuur kernel: 002: RIP: 0010:__fget_light+0x3d/0x60
> Jun 26 04:50:24 vuurmuur kernel: 002: Code: ca 75 2e 48 8b 50 50 8b 02 39 c7 73 21 89 f9 48 39 c1 48 19 c0 21 c7 48 8b 42 08 48 8d 04 f8 48 8b 00 48 85 c0 74 07 85 70 7c <75> 02 f3 c3 31 c0 c3 ba 01 00 00 00 e8 22 fe ff ff 48 85 c0 74 ee
…
> Jun 26 04:50:24 vuurmuur kernel: 002:  do_select+0x350/0x7a0
> Jun 26 04:50:24 vuurmuur kernel: 002:  core_sys_select+0x1d0/0x380
> Jun 26 04:50:24 vuurmuur kernel: 002:  __x64_sys_pselect6+0x141/0x190
…
> Jun 26 05:03:01 vuurmuur named[1433212]: received control channel command 'flush'
> 
> 
> What went wrong?

ntpq entered into kernel via pselect(). In that syscall it looped at
somepoint and RCU couldn't make any progress. Assuming you have
CONFIG_HZ=250 then it didn't make any progress for 5250/250 = 21
seconds. This stall piled 1297 callbacks up. The situation resolved by
itself later because this "rcu_preempt self-detected stall" did not
appear again.

> How bad is this?
Each callback would free a data structure i.e. give back memory to the
system. Since ntpq lead to a RCU stall, the system could no release
memory. You will run eventually out of memory if this situation does not
get resolved.

> How to avoid?
Can you reproduce this or was this one a time thing?
I *think* this happened within the loop in __fget_files(). This function
is inlined by __fget_light() and the loop has a RCU-section so it would
make sense.
Do you run something at an elevated priority in the system? I don't know
what the other part was doing but somehow one of the file descriptors
(network sockets probably) was about to be closed while the other side
tried to poll() on it.

> Kind regards,
> Udo

Sebastian




[Index of Archives]     [RT Stable]     [Kernel Newbies]     [IDE]     [Security]     [Git]     [Netfilter]     [Bugtraq]     [Yosemite]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux ATA RAID]     [Samba]     [Video 4 Linux]     [Device Mapper]

  Powered by Linux