On Thu, Jul 20, 2023 at 03:32:35PM -0400, Joel Fernandes wrote: > On 7/20/23 15:04, Paul E. McKenney wrote: > > On Thu, Jul 20, 2023 at 12:31:13PM -0400, Joel Fernandes wrote: > >> Hi Paul, > >> > >> On Thu, Jul 20, 2023 at 11:55 AM Paul E. McKenney <paulmck@xxxxxxxxxx> wrote: > >>> > >>> On Thu, Jul 20, 2023 at 01:27:14PM +0000, Joel Fernandes wrote: > [...] > >>>> > >>>> So likely RCU boosting is failing: > >>>> > >>>> The full TREE03 splat: > >>>> [ 54.243588] ------------[ cut here ]------------ > >>>> [ 54.244547] rcu-torture: rcu_torture_boost started > [...] > >>>> [ 54.300499] RIP: 0010:rcu_torture_stats_print+0x5b2/0x620 > [...] > >>>> [ 2169.481783] rcu_torture_writer: rtort_pipe_count: 9 > >>>> > >>>> However, if we are to believe the '9', it appears the object did made it > >>>> quite some till the end of the pipe array but not until the free pool. > >>> > >>> This is from this if/for statement, correct? > >>> > >>> stutter_waited = stutter_wait("rcu_torture_writer"); > >>> if (stutter_waited && > >>> !atomic_read(&rcu_fwd_cb_nodelay) && > >>> !cur_ops->slow_gps && > >>> !torture_must_stop() && > >>> boot_ended) > >>> for (i = 0; i < ARRAY_SIZE(rcu_tortures); i++) > >>> if (list_empty(&rcu_tortures[i].rtort_free) && > >>> rcu_access_pointer(rcu_torture_current) != > >>> &rcu_tortures[i]) { > >>> tracing_off(); > >>> show_rcu_gp_kthreads(); > >>> WARN(1, "%s: rtort_pipe_count: > >>> rcu_ftrace_dump(DUMP_ALL); > >>> } > >> > >> Yes, that's right. > >> > >>> If so, this happens when there was a stutter wait, but RCU grace > >>> periods failed to clear out the backlog during the several seconds that > >>> rcutorture was forced idle. This might be related to the RCU priority > >>> boosting failure, in which a preempted reader persisted across the > >>> stutter interval. > >> > >> When RCU is operating normally, shouldn't the check > >> "(list_empty(&rcu_tortures[i].rtort_free)" not run until the preempted > >> reader unblocks and exits its RCU read-side critical section? > > > > Yes, but not just "until", but rather "long after". If RCU is doing > > grace periods correctly, an active reader on a given rcu_tortures[] > > element will prevent .rtort_pipe_count from exceeding the value 2. > > Ah ok, so the rtort_pipe_count being 9 is a sign RCU isn't making progress > thus making it absent from the free list. Yes, though RCU is -just- -barely- too slow, as one more grace period would have done it. > > The element will not be put on a list until .rtort_pipe_count is equal > > to RCU_TORTURE_PIPE_LEN, which is 10. > > > > This warning usually appears when something is holding up the grace-period > > kthread. Historically, this has included deadlocks, missed timers, > > and whatever else can prevent the grace-period kthread from running. > > Makes sense. > > >> One thing that confuses me, in the case of > >> "cur_ops->deferred_free(old_rp);" , the earlier do-while loop may exit > >> before the async callbacks can finish. So what prevents the > >> "(list_empty(&rcu_tortures[i].rtort_free)" check from happening before > >> grace periods happen? Thanks for any clarification. > > > > We only enter this code if the stutter_wait() actually waited, and by > > default this function will wait about five seconds. Since the rcutorture > > testing goes idle during this time period (or is supposed to!), if things > > are working properly, knocking off ten grace periods during that time > > should be pretty much a given. > > Sure, makes sense. And this is not Lazy-RCU so 5 seconds should be plenty > ;). I think I was subconsciously expecting an rcu_barrier() somewhere in the > code before those checks, but that's not needed as you pointed that the > stutter should be giving enough time for RCU to make progress. And there might need to be a call_rcu_hurry() in there somewhere, now that you mention it. Which would pretty much defeat any sort of lazy-RCU-callback testing in rcutorture, but testing of laziness might need to be separate anyway. > So hmm, the count being 9 means that not enough RCU grace periods have > passed for the rcu_torture object in question thus keeping it always > allocated. The GP thread not getting CPU can do that indeed, or perhaps > something else stalling RCU like a preempted reader, length preemption > disabling on a CPU and so forth.. I'll try to collect a trace when it > happens. Looking forward to seeing what you come up with! Thanx, Paul