On 2020-06-27 15:30:20 [+0200], Udo van den Heuvel wrote: > Hello, Hi, > Found this in /var/log/messages: > > Jun 25 16:31:39 vuurmuur pppd[1522583]: local LL address fe80::ed36:3ac4:4115:e23e > Jun 25 16:31:39 vuurmuur pppd[1522583]: remote LL address fe80::2a8a:1cff:fee0:9484 > Jun 26 04:50:24 vuurmuur kernel: 002: rcu: INFO: rcu_preempt self-detected stall on CPU > Jun 26 04:50:24 vuurmuur kernel: 002: rcu: 2-....: (5336 ticks this GP) idle=f6a/1/0x4000000000000002 softirq=347363113/347363115 fqs=2430 > Jun 26 04:50:24 vuurmuur kernel: 002: (t=5250 jiffies g=608224341 q=1297) > Jun 26 04:50:24 vuurmuur kernel: 002: NMI backtrace for cpu 2 … > Jun 26 04:50:24 vuurmuur kernel: 002: RIP: 0010:__fget_light+0x3d/0x60 > Jun 26 04:50:24 vuurmuur kernel: 002: Code: ca 75 2e 48 8b 50 50 8b 02 39 c7 73 21 89 f9 48 39 c1 48 19 c0 21 c7 48 8b 42 08 48 8d 04 f8 48 8b 00 48 85 c0 74 07 85 70 7c <75> 02 f3 c3 31 c0 c3 ba 01 00 00 00 e8 22 fe ff ff 48 85 c0 74 ee … > Jun 26 04:50:24 vuurmuur kernel: 002: do_select+0x350/0x7a0 > Jun 26 04:50:24 vuurmuur kernel: 002: core_sys_select+0x1d0/0x380 > Jun 26 04:50:24 vuurmuur kernel: 002: __x64_sys_pselect6+0x141/0x190 … > Jun 26 05:03:01 vuurmuur named[1433212]: received control channel command 'flush' > > > What went wrong? ntpq entered into kernel via pselect(). In that syscall it looped at somepoint and RCU couldn't make any progress. Assuming you have CONFIG_HZ=250 then it didn't make any progress for 5250/250 = 21 seconds. This stall piled 1297 callbacks up. The situation resolved by itself later because this "rcu_preempt self-detected stall" did not appear again. > How bad is this? Each callback would free a data structure i.e. give back memory to the system. Since ntpq lead to a RCU stall, the system could no release memory. You will run eventually out of memory if this situation does not get resolved. > How to avoid? Can you reproduce this or was this one a time thing? I *think* this happened within the loop in __fget_files(). This function is inlined by __fget_light() and the loop has a RCU-section so it would make sense. Do you run something at an elevated priority in the system? I don't know what the other part was doing but somehow one of the file descriptors (network sockets probably) was about to be closed while the other side tried to poll() on it. > Kind regards, > Udo Sebastian