On Sat, Jun 08, 2024 at 08:25:53PM -0700, Paul E. McKenney wrote: > There is a grace period in progress ("read state: 1") and that grace > period is the last one that has been requested ("gp state: 573/576"). > > Had there been callbacks pending, there would have been a warning from > "if (WARN_ON(rcu_segcblist_n_cbs(&sdp->srcu_cblist)))", so srcu_barrier() > having no effect is expected behavior. Which also suggests that the > unfinished grace period was started by start_poll_synchronize_srcu(). I'm surprised that srcu_barrier() has no effect; I would have exppected the underlying machinery to be the same for explicit callbacks/barriers as well as polling, so I think I'm missing something. So I think there's something I'm missing; it sounds like something's not getting kicked, and if you say srcu_barrier() is expected to have no effect than that seems to imply there's something else I should be calling? > Could you please try something like this just before the call to > cleanup_srcu_struct()? > > WARN_ON_ONCE(poll_state_synchronize_srcu(&c->btree_trans_barrier, ck->btree_trans_barrier_seq); Added, I'll check the results in the morning but they'll be here: https://evilpiepirate.org/~testdashboard/ci?branch=bcachefs-testing > > If there is some chance that start_poll_synchronize_srcu() was never > ever invoked, this check will of course need some additional help. start_poll_synchronize_srcu() is the only thing that version of my code uses. > I am curious about your use of ULONG_CMP_GE() on return values from > different calls to start_poll_synchronize_srcu(), but that is not urgent. The freelists are intended to in the order in which they can be reclaimed - is that not actually a sequence number? I'm actually in the process of redoing (and simplifying) that code. Basically, the code is supposed to be tracking objects pending freeing in exactly the same manner as which RCU tracks pending callbacks - except that by doing it ourself we can allocate from those pending lists and not be hosed if reclaim is delayed because of an srcu lock held too long. As an aside - I've been considering ripping that out and just freeing objects via call_srcu(), it would definitely simplify things, but some workloads cycle through a _lot_ of these objects and memory reclaim stalling is a real concern. And after I redo it, it should be if anything slightly more efficient than freeing objects via call_srcu() like normal (elimination of indirect function calls), so perhaps a technique we'll want to keep in mind.