On Mon, Oct 18, 2021 at 07:42:42PM +0200, Frederic Weisbecker wrote: > On Mon, Oct 18, 2021 at 09:18:14AM -0700, Paul E. McKenney wrote: > > On Mon, Oct 18, 2021 at 01:32:59PM +0200, Frederic Weisbecker wrote: > > > When an rdp is in the process of (de-)offloading, rcu_core() and the > > > nocb kthreads can process callbacks at the same time. This leaves many > > > possible scenarios leading to an rcu barrier to execute before > > > the preceding callbacks. Here is one such example: > > > > > > CPU 0 CPU 1 > > > -------------- --------------- > > > call_rcu(callbacks1) > > > call_rcu(callbacks2) > > > // move callbacks1 and callbacks2 on the done list > > > rcu_advance_callbacks() > > > call_rcu(callbacks3) > > > rcu_barrier_func() > > > rcu_segcblist_entrain(...) > > > nocb_cb_wait() > > > rcu_do_batch() > > > callbacks1() > > > cond_resched_tasks_rcu_qs() > > > // move callbacks3 and rcu_barrier_callback() > > > // on the done list > > > rcu_advance_callbacks() > > > rcu_core() > > > rcu_do_batch() > > > callbacks3() > > > rcu_barrier_callback() > > > //MISORDERING > > > callbacks2() > > > > > > Fix this with preventing two concurrent rcu_do_batch() on a same rdp > > > as long as an rcu barrier callback is pending somewhere. > > > > > > Reported-by: Paul E. McKenney <paulmck@xxxxxxxxxx> > > > Signed-off-by: Frederic Weisbecker <frederic@xxxxxxxxxx> > > > Cc: Josh Triplett <josh@xxxxxxxxxxxxxxxx> > > > Cc: Joel Fernandes <joel@xxxxxxxxxxxxxxxxx> > > > Cc: Boqun Feng <boqun.feng@xxxxxxxxx> > > > Cc: Neeraj Upadhyay <neeraju@xxxxxxxxxxxxxx> > > > Cc: Uladzislau Rezki <urezki@xxxxxxxxx> > > > > Yow! > > > > But how does the (de-)offloading procedure's acquisition of > > rcu_state.barrier_mutex play into this? In theory, that mutex was > > supposed to prevent these sorts of scenarios. In practice, it sounds > > like the shortcomings in this theory should be fully explained so that > > we don't get similar bugs in the future. ;-) > > I think you're right. The real issue is something I wanted to > fix next: RCU_SEGCBLIST_RCU_CORE isn't cleared when nocb is enabled on > boot so rcu_core() always run concurrently with nocb kthreads in TREE04, > without holding rcu_barrier mutex of course (I mean with the latest patchset). That would do it! > Ok forget this patch, I'm testing again with simply clearing > RCU_SEGCBLIST_RCU_CORE on boot. Sounds good, looking forward to it! Thanx, Paul