On 7/21/23 18:08, Paul E. McKenney wrote: > On Fri, Jul 21, 2023 at 03:20:48PM -0400, Joel Fernandes wrote: >> (Trimming the CC list a bit) >> On 7/21/23 08:13, Joel Fernandes wrote: >>> On 7/20/23 15:47, Paul E. McKenney wrote: >>>> On Thu, Jul 20, 2023 at 03:32:35PM -0400, Joel Fernandes wrote: >>>>> On 7/20/23 15:04, Paul E. McKenney wrote: >>>>>> On Thu, Jul 20, 2023 at 12:31:13PM -0400, Joel Fernandes wrote: >>>>>>> Hi Paul, >>>>>>> >>>>>>> On Thu, Jul 20, 2023 at 11:55 AM Paul E. McKenney >>>>>>> <paulmck@xxxxxxxxxx> wrote: >>>>>>>> >>>>>>>> On Thu, Jul 20, 2023 at 01:27:14PM +0000, Joel Fernandes wrote: >>>>> [...] >>>>>>>>> >>>>>>>>> So likely RCU boosting is failing: >>>>>>>>> >>>>>>>>> The full TREE03 splat: >>>>>>>>> [ 54.243588] ------------[ cut here ]------------ >>>>>>>>> [ 54.244547] rcu-torture: rcu_torture_boost started >>>>> [...] >>>>>>>>> [ 54.300499] RIP: 0010:rcu_torture_stats_print+0x5b2/0x620 >>>>> [...] >>>>>>>>> [ 2169.481783] rcu_torture_writer: rtort_pipe_count: 9 >>>>>>>>> >>>>>>>>> However, if we are to believe the '9', it appears the object did >>>>>>>>> made it >>>>>>>>> quite some till the end of the pipe array but not until the free >>>>>>>>> pool. >>>>>>>> >>>>>>>> This is from this if/for statement, correct? >>>>>>>> >>>>>>>> stutter_waited = >>>>>>>> stutter_wait("rcu_torture_writer"); >>>>>>>> if (stutter_waited && >>>>>>>> !atomic_read(&rcu_fwd_cb_nodelay) && >>>>>>>> !cur_ops->slow_gps && >>>>>>>> !torture_must_stop() && >>>>>>>> boot_ended) >>>>>>>> for (i = 0; i < >>>>>>>> ARRAY_SIZE(rcu_tortures); i++) >>>>>>>> if >>>>>>>> (list_empty(&rcu_tortures[i].rtort_free) && >>>>>>>> >>>>>>>> rcu_access_pointer(rcu_torture_current) != >>>>>>>> &rcu_tortures[i]) { >>>>>>>> tracing_off(); >>>>>>>> show_rcu_gp_kthreads(); >>>>>>>> WARN(1, "%s: >>>>>>>> rtort_pipe_count: >>>>>>>> rcu_ftrace_dump(DUMP_ALL); >>>>>>>> } >>>>>>> >>>>>>> Yes, that's right. >>>>>>> >>>>>>>> If so, this happens when there was a stutter wait, but RCU grace >>>>>>>> periods failed to clear out the backlog during the several seconds >>>>>>>> that >>>>>>>> rcutorture was forced idle. This might be related to the RCU >>>>>>>> priority >>>>>>>> boosting failure, in which a preempted reader persisted across the >>>>>>>> stutter interval. >>>>>>> >>>>>>> When RCU is operating normally, shouldn't the check >>>>>>> "(list_empty(&rcu_tortures[i].rtort_free)" not run until the preempted >>>>>>> reader unblocks and exits its RCU read-side critical section? >>>>>> >>>>>> Yes, but not just "until", but rather "long after". If RCU is doing >>>>>> grace periods correctly, an active reader on a given rcu_tortures[] >>>>>> element will prevent .rtort_pipe_count from exceeding the value 2. >>>>> >>>>> Ah ok, so the rtort_pipe_count being 9 is a sign RCU isn't making >>>>> progress >>>>> thus making it absent from the free list. >>>> >>>> Yes, though RCU is -just- -barely- too slow, as one more grace period >>>> would have done it. >>>> >>>>>> The element will not be put on a list until .rtort_pipe_count is equal >>>>>> to RCU_TORTURE_PIPE_LEN, which is 10. >>>>>> >>>>>> This warning usually appears when something is holding up the >>>>>> grace-period >>>>>> kthread. Historically, this has included deadlocks, missed timers, >>>>>> and whatever else can prevent the grace-period kthread from running. >>>>> >>>>> Makes sense. >>>>> >>>>>>> One thing that confuses me, in the case of >>>>>>> "cur_ops->deferred_free(old_rp);" , the earlier do-while loop may exit >>>>>>> before the async callbacks can finish. So what prevents the >>>>>>> "(list_empty(&rcu_tortures[i].rtort_free)" check from happening before >>>>>>> grace periods happen? Thanks for any clarification. >>>>>> >>>>>> We only enter this code if the stutter_wait() actually waited, and by >>>>>> default this function will wait about five seconds. Since the >>>>>> rcutorture >>>>>> testing goes idle during this time period (or is supposed to!), if >>>>>> things >>>>>> are working properly, knocking off ten grace periods during that time >>>>>> should be pretty much a given. >>>>> >>>>> Sure, makes sense. And this is not Lazy-RCU so 5 seconds should be >>>>> plenty >>>>> ;). I think I was subconsciously expecting an rcu_barrier() somewhere >>>>> in the >>>>> code before those checks, but that's not needed as you pointed that the >>>>> stutter should be giving enough time for RCU to make progress. >>>> >>>> And there might need to be a call_rcu_hurry() in there somewhere, >>>> now that you mention it. Which would pretty much defeat any sort of >>>> lazy-RCU-callback testing in rcutorture, but testing of laziness might >>>> need to be separate anyway. >>>> >>>>> So hmm, the count being 9 means that not enough RCU grace periods have >>>>> passed for the rcu_torture object in question thus keeping it always >>>>> allocated. The GP thread not getting CPU can do that indeed, or perhaps >>>>> something else stalling RCU like a preempted reader, length preemption >>>>> disabling on a CPU and so forth.. I'll try to collect a trace when it >>>>> happens. >>>> >>>> Looking forward to seeing what you come up with! >>> >>> So far I found this. Before the crash, GPs took about 50ms, during the >>> crash it took 5 seconds before the warning which aligns with what you >>> mentioned about stutter. >>> >>> >>> The GP that never completed is at this line: >>> >>> [ 2816.041082] <...>-13 13d.... 1237016139us : >>> rcu_grace_period: rcu_sched 144681 start >>> >>> And fqs loop got a "dti" for CPUs: >>> 1 >>> 2 >>> 12 >>> >>> And I see cpuqs for: >>> 13 >>> 6 >>> 10 >>> 5 >>> 4 >>> 11 >>> >>> No idea what happened to 3, 8, 9, 14, 15. Maybe the "dti" for those did >>> not show in the trace? >>> >>> However, I see that CPU 7 did this: >>> [ 2816.205277] <idle>-0 7d.... 1237016284us : sched_switch: >>> prev_comm=swapper/7 prev_pid=0 prev_prio=120 prev_state=R ==> >>> next_comm=rcu_torture_rea next_pid=149 next_prio=139 >>> >>> >>> and then did this about 3 seconds later: >>> [ 2819.394111] rcu_tort-149 7dNs.. 1237025699us : >>> rcu_grace_period: rcu_sched 144677 cpuend >>> [ 2819.412745] rcu_tort-149 7dNs.. 1237025699us : >>> rcu_grace_period: rcu_sched 144681 cpustart >>> >>> >>> Which means it never switched out from the CPU all this while. Neither >>> did it report cpuqs. Ok it did notice the new GP started, welcome to the >>> party mate ;) >>> >>> That points the issue I think. I wonder if the emergency provisions for >>> forcing quiescent states on NOHZ_FULL CPUs kicked in. ISTR, we had >>> issues with that in the past where we had to fix the tick path to report >>> a QS. >>> >>> I'll add some more traces. Especially around preempted readers, the >>> emergency provisions for forcing a QS and so forth and see if I can dig >>> more information. >> >> In another instance, I caught a stack trace of CPU5 when it did not >> check-in with RCU for 3 seconds or so: >> >> [ 1127.067889] kmalloc_trace+0x25/0x90 >> [ 1127.072823] rcu_torture_fwd_prog+0x3d8/0xa60 >> [ 1127.078749] ? __pfx_rcu_torture_fwd_prog+0x10/0x10 >> [ 1127.085468] ? kthread+0xcb/0xf0 >> [ 1127.090040] ? __pfx_rcu_torture_fwd_prog+0x10/0x10 >> [ 1127.096626] kthread+0xcb/0xf0 >> [ 1127.100803] ? __pfx_kthread+0x10/0x10 >> [ 1127.107207] ret_from_fork+0x2c/0x50 >> [ 1127.113662] </TASK> >> [ 1127.117818] Kernel panic - not syncing: kernel: panic_on_warn set ... >> >> It appears it was in "rcu_torture_fwd_prog" at the time. It got stuck >> there for 2 seconds. >> >> Interestingly, I did find the scheduler tick was trying to get in touch >> with the CPU and it was running code on it even: >> [ 5858.463102] rcu_tort-174 5d.h.. 1131149320us : rcu_utilization: >> Start scheduler-tick >> [ 5858.472903] rcu_tort-174 5d.h.. 1131149320us : rcu_utilization: >> End scheduler-tick >> >> So I could add some traces there to figure out what is it upto. ISTR, we >> also send IPIs to these CPUs? So maybe adding some traces there too is >> in order. >> >> Also earlier in dmesg it did this: >> [ 1124.825504] rcu_torture_fwd_prog n_max_cbs: 44181 >> [ 1124.832158] rcu_torture_fwd_prog: Starting forward-progress test 0 >> [ 1124.840977] rcu_torture_fwd_prog_cr: Starting forward-progress test 0 >> [ 1126.726621] ------------[ cut here ]------------ >> [ 1126.733296] rcutorture_oom_notify invoked upon OOM during >> forward-progress testing. >> >> Before that "cut here", there is a 2 second gap which tells me it got >> stuck there. >> >> So this function rcu_torture_fwd_prog() is keeping the CPU unresponsive >> from an RCU perspective for some reason? Because I never saw the >> "rcu_torture_fwd_prog_cr: Waiting for CBs" message before it panicked. > > Thank you for digging into this! > > Yes, rcu_torture_fwd_prog() is supposed to monopolize CPUs. But it > is supposed to respect stuttering. It is quite possible that the > synchronization is insufficient. Or that the "supposed to" never > made it into the code. > >> Collecting traces again.. ;) thanks, > > ;-) ;-) ;-) I narrowed it down to the timer softirq taking too long and doing this for 4 seconds: [ 6781.411834] ksoftirq-12 0dNs.. 206231531us : sched_wakeup: comm=rcu_torture_rea pid=153 prio=139 target_cpu=000 [ 6781.424854] ksoftirq-12 0dNs.. 206231531us : sched_waking: comm=rcu_torture_rea pid=164 prio=139 target_cpu=006 [ 6781.437650] ksoftirq-12 0dNs.. 206231533us : sched_wakeup: comm=rcu_torture_rea pid=164 prio=139 target_cpu=000 [ 6781.451019] ksoftirq-12 0dNs.. 206231533us : sched_waking: comm=rcu_torture_rea pid=162 prio=139 target_cpu=002 [ 6781.463825] ksoftirq-12 0dNs.. 206231535us : sched_wakeup: comm=rcu_torture_rea pid=162 prio=139 target_cpu=000 [ 6781.476290] ksoftirq-12 0dNs.. 206231535us : sched_waking: comm=rcu_torture_wri pid=143 prio=139 target_cpu=004 [ 6781.489257] ksoftirq-12 0dNs.. 206231537us : sched_wakeup: comm=rcu_torture_wri pid=143 prio=139 target_cpu=000 [ 6781.502559] ksoftirq-12 0dNs.. 206231538us : sched_waking: comm=rcu_torture_rea pid=154 prio=139 target_cpu=006 [ 6781.515093] ksoftirq-12 0dNs.. 206231539us : sched_wakeup: comm=rcu_torture_rea pid=154 prio=139 target_cpu=000 [ 6781.527622] ksoftirq-12 0dNs.. 206231539us : sched_waking: comm=rcu_torture_fak pid=149 prio=139 target_cpu=003 [ 6781.539979] ksoftirq-12 0dNs.. 206231541us : sched_wakeup: comm=rcu_torture_fak pid=149 prio=139 target_cpu=000 [ 6781.552717] ksoftirq-12 0dNs.. 206231541us : sched_waking: comm=rcu_torture_fak pid=145 prio=139 target_cpu=005 [ 6781.565074] ksoftirq-12 0dNs.. 206231543us : sched_wakeup: comm=rcu_torture_fak pid=145 prio=139 target_cpu=000 [ 6781.577580] ksoftirq-12 0dNs.. 206231543us : sched_waking: comm=rcu_torture_rea pid=166 prio=139 target_cpu=003 [ 6781.590211] ksoftirq-12 0dNs.. 206231547us : sched_wakeup: comm=rcu_torture_rea pid=166 prio=139 target_cpu=000 And I caught the softirq exit which told me it is coming from timer softirq: [ 6781.603150] ksoftirq-12 0.Ns.. 206231552us : __do_softirq: softirq exit: name: TIMER, action handler: run_timer_softirq Maybe something to do with calls to stuttering or lack of cond_resched() in the torture code somewhere? Unfortunately I was not able to get a stack trace and what is doing the wake ups. This is at least one of the issues. I do believe the fwd prog functions taking too long in the kernel without doing a cond_resched() is another issue but I need to reduce the RCU stall timeout to 2 seconds to see those happen otherwise I think those just get "hidden". Both these issues appear to be threads spending too much time in kernel mode in non-preemptible kernels. The RCU GP thread was able to get CPU so that wasn't the issue in these instances. Thoughts? Next plan of action is to get sched_waking stack traces since I have a very reliable repro of this now. thanks, - Joel