Re: srcu_cleanup warning

"Paul E. McKenney" <paulmck@xxxxxxxxxx> · Fri, 14 Jun 2024 07:11:50 -0700

On Thu, Jun 13, 2024 at 08:22:19PM -0700, Paul E. McKenney wrote:
> On Thu, Jun 13, 2024 at 02:32:27PM -0400, Kent Overstreet wrote:

[ . . . ]

After sleeping on it, backing up a level of abstraction, and thus perhaps
responding a bit more constructively...

And also, thank you very much for looking into this!

> > commit 3bb5d34d6d6e2d5433cce5d248fd81211afedec0
> > Author: Kent Overstreet <kent.overstreet@xxxxxxxxx>
> > Date:   Mon Jun 10 20:47:03 2024 -0400
> > 
> >     bcachefs: pending_rcu_items
> >     
> >     Generic (?) data structure for explicitly tracking pending SRCU items
> >     
> >     - Can we make this generic across other RCU flavors?

It should be possible, at least for current RCU and SRCU.  I think I owe
you SRCU variants of a couple of existing RCU functions.  Later today!

> >     - Do we need to add a fallback to linked lists when darray allocation
> >       fails?

In general, as in kfree_rcu(), yes.  In your specific case here, maybe not.

> >     - We wish to avoid the overhead of start_poll_synchronize_srcu() if we
> >       already have items for a given sequence number, but it appears
> >       get_poll_synchronize_srcu(); start_poll_synchronize_srcu() can
> >       actually return differert sequence numbers within the same read-side
> >       critical section. Is this the correct way of doing this?

You can always buffer up items that have not yet been assigned a sequence
number, then after you have enough or after enough time has passed, do a
single get_poll_synchronize_srcu() for the group.

> >     - Do we need to ensure the rearming RCU callback is correctly flushed
> >       when exiting? (Probably)

I believe so, and setting a flag then doing an srcu_barrier() should suffice.

> >     Signed-off-by: Kent Overstreet <kent.overstreet@xxxxxxxxx>
> > 
> > diff --git a/fs/bcachefs/btree_key_cache.c b/fs/bcachefs/btree_key_cache.c
> > index ebed60a43647..46f7fa549d59 100644
> > --- a/fs/bcachefs/btree_key_cache.c
> > +++ b/fs/bcachefs/btree_key_cache.c
> > @@ -83,18 +83,109 @@ static bool bkey_cached_evict(struct btree_key_cache *c,
> >  	return ret;
> >  }
> >  
> > -static void __bkey_cached_free(struct rcu_head *rcu)
> > +#define RCU_SEQ_SKIP_SHIFT	2
> > +
> > +static void pending_rcu_items_exit(struct pending_rcu_items *pending)
> > +{
> > +	darray_exit(&pending->objs[0]);
> > +	darray_exit(&pending->objs[1]);
> > +}
> > +
> > +static void pending_rcu_items_init(struct pending_rcu_items *pending,
> > +				   struct srcu_struct *srcu,
> > +				   pending_rcu_item_process_fn process)
> > +{
> > +	memset(pending, 0, sizeof(*pending));
> > +	pending->srcu		= srcu;
> > +	pending->process	= process;
> > +	spin_lock_init(&pending->lock);
> > +}
> > +
> > +static bool pending_rcu_has_items(struct pending_rcu_items *pending)
> > +{
> > +	return pending->objs[0].nr || pending->objs[1].nr;
> > +}
> > +
> > +static void pending_rcu_items_process(struct pending_rcu_items *pending)
> >  {
> > -	struct bkey_cached *ck = container_of(rcu, struct bkey_cached, rcu);
> > +	while (pending_rcu_has_items(pending) &&
> > +	       poll_state_synchronize_srcu(pending->srcu, pending->seq)) {
> > +		darray_for_each(pending->objs[0], i)
> > +			pending->process(pending, *i);

Given that you have interrupts and/or bh disabled, there needs to be a
third set of callbacks, those whose grace periods are already elapsed.

Unioning one of the ->objs[] sets onto this third set, leaving the source
->objs[] set empty, should strongly be O(1) without the need for memory
allocation.  I might be missing something, but at first glance darray
does not support this.

The pages-of-pointers structure named kvfree_rcu_bulk_data, along with
the associated structures in kernel/rcu/tree.c does this work, but (1)
is open-coded and (2) has to deal with the full range of what the kernel
can throw at it.  It falls back to an embedded rcu_head structure, in
response to your second question in the commit log.

But perhaps this code could be generalized?  Or maybe darray can be
appropriately extended?

And one thing that I missed the first time around is that you are using
a single ->seq for two sets of objects (->objs[0] and ->objs[1]).
Each set needs its own sequence number.

> > +		darray_exit(&pending->objs[0]);
> > +		swap(pending->objs[0], pending->objs[1]);
> > +		pending->seq += 1 << RCU_SEQ_SKIP_SHIFT;
> > +	}
> > +}
> > +
> > +static void pending_rcu_items_rcu_cb(struct rcu_head *rcu)
> > +{
> > +	struct pending_rcu_items *pending =
> > +		container_of(rcu, struct pending_rcu_items, rcu);
> > +
> > +	unsigned long flags;
> > +	spin_lock_irqsave(&pending->lock, flags);
> > +	pending_rcu_items_process(pending);
> 
> SRCU callbacks execute in softirq context.  You therefore might need to
> push this call to pending_rcu_items_process() to a workqueue handler,
> at least if there are very many objects in need of processing.

Plus you have irqs disabled, which I somehow missed on the first pass.

Again, please union the objects whose grace period has ended into a set
that is ready to be processed, and then do the processing in a workqueue
handler, kthread, or similar.  This avoids causing real-time latency
problems, or, in extreme cases, softlockups or RCU CPU stall warnings.

> > -	kmem_cache_free(bch2_key_cache, ck);
> > +	pending->rcu_armed = pending_rcu_has_items(pending);
> > +	if (pending->rcu_armed)
> > +		call_srcu(pending->srcu, &pending->rcu, pending_rcu_items_rcu_cb);
> > +	spin_unlock_irqrestore(&pending->lock, flags);
> > +}
> > +
> > +static void pending_rcu_item_enqueue(struct pending_rcu_items *pending, void *obj)
> > +{
> > +	darray_voidp preallocated = {};
> > +retry:
> > +	spin_lock_irq(&pending->lock);
> > +	pending_rcu_items_process(pending);
> 
> Suppose that poll_state_synchronize_srcu() returns false so that
> start_poll_synchronize_srcu() does nothing...
> 
> ... then we get vCPU preemption here so that some number of SRCU grace
> periods elapse (which looks to be limited to just one of them, but that
> is enough because we might already have requested the next one courtesy of
> the previous call to either call_srcu() or start_poll_synchronize_srcu()) ...
> 
> > +	unsigned long seq = start_poll_synchronize_srcu(pending->srcu);
> 
> ... so that we get a seq that is 8 or 12 larger than pending->seq ...
> 
> > +	if (!pending_rcu_has_items(pending))
> > +		pending->seq = seq;
> 
> ... and we don't update pending->seq because there are still items there ...
> 
> > +	unsigned idx = (seq - pending->seq) >> RCU_SEQ_SKIP_SHIFT;
> > +	BUG_ON(idx > ARRAY_SIZE(pending->objs));

And more on this later today.

							Thanx, Paul

> ... then can't this BUG_ON() trigger?
> 
> And why this ugly reliance on the cookie format?  Why not just have
> the pending_rcu_items structure's ->seq be a two-element array?
> 
> Then all you need to do is to check each ->seq[] element to see if it
> is equal to seq, while remembering the index of the first that is empty.
> If you find a ->seq[] match, just enqueue into the corresponding ->objs[]
> element.  Otherwise, you are guaranteed to have found an empty ->obj[],
> at least if you move the start_poll_synchronize_srcu() to precede
> the pending_rcu_items_process().
> 
> *Much* looser coupling with SRCU and rather simpler.
> 
> > +	darray_voidp *objs = pending->objs + idx;
> > +
> > +	if (!darray_room(*objs)) {
> > +		if (preallocated.size > objs->size) {
> > +			memcpy(preallocated.data,
> > +			       objs->data,
> > +			       sizeof(objs->data[0]) * objs->nr);
> > +			swap(preallocated.data, objs->data);
> > +			swap(preallocated.size, objs->size);
> > +		} else if (darray_make_room_gfp(objs, 1, GFP_ATOMIC|__GFP_NOWARN)) {
> > +			spin_unlock_irq(&pending->lock);
> > +			darray_exit(&preallocated);
> > +			darray_make_room_gfp(&preallocated, objs->nr + 1, GFP_KERNEL|__GFP_NOFAIL);
> > +			goto retry;
> > +		}
> > +	}
> > +
> > +	BUG_ON(darray_push(objs, obj)); /* preallocated */
> > +
> > +	if (!pending->rcu_armed)
> > +		call_srcu(pending->srcu, &pending->rcu, pending_rcu_items_rcu_cb);
> 
> The start_poll_synchronize_srcu() above might have started a new
> SRCU grace period.  If so, this call_srcu() will start another one.
> Which should not be a problem.  If it does prove to be, you might use
> get_state_synchronize_srcu() instead, depending.
> 
> On the other hand, if this code is prone to bkey_cached_free() floods, the
> current combination of start_poll_synchronize_srcu() and call_srcu()
> might avoid OOM, given that this function checks for pending objects
> whose grace period has ended.
> 
> > +	pending->rcu_armed = true;
> > +	spin_unlock_irq(&pending->lock);
> > +
> > +	darray_exit(&preallocated);
> > +}
> > +
> > +static void __bkey_cached_free(struct pending_rcu_items *pending, void *obj)
> > +{
> > +	struct bch_fs *c = container_of(pending->srcu, struct bch_fs, btree_trans_barrier);
> > +
> > +	this_cpu_dec(*c->btree_key_cache.nr_pending);
> > +	kmem_cache_free(bch2_key_cache, obj);
> >  }
> >  
> >  static void bkey_cached_free(struct btree_key_cache *bc,
> >  			     struct bkey_cached *ck)
> >  {
> > -	struct bch_fs *c = container_of(bc, struct bch_fs, btree_key_cache);
> > -
> >  	kfree(ck->k);
> >  	ck->k		= NULL;
> >  	ck->u64s	= 0;
> > @@ -102,7 +193,15 @@ static void bkey_cached_free(struct btree_key_cache *bc,
> >  	six_unlock_write(&ck->c.lock);
> >  	six_unlock_intent(&ck->c.lock);
> >  
> > -	call_srcu(&c->btree_trans_barrier, &ck->rcu, __bkey_cached_free);
> > +	preempt_disable();
> > +	struct pending_rcu_items *pending = this_cpu_ptr(bc->pending);
> > +	preempt_enable();
> > +	/*
> > +	 * pending is protected by a lock, we don't need preemption disabled -
> > +	 * and it expects to run in sleepable context:
> > +	 */
> > +	pending_rcu_item_enqueue(pending, ck);
> > +	this_cpu_inc(*bc->nr_pending);
> >  }
> >  
> >  static struct bkey_cached *
> > @@ -778,6 +877,13 @@ void bch2_fs_btree_key_cache_exit(struct btree_key_cache *bc)
> >  
> >  	if (bc->table_init_done)
> >  		rhashtable_destroy(&bc->table);
> > +
> > +	free_percpu(bc->nr_pending);
> > +
> > +	int cpu;
> > +	for_each_possible_cpu(cpu)
> > +		pending_rcu_items_exit(per_cpu_ptr(bc->pending, cpu));
> > +	free_percpu(bc->pending);
> >  }
> >  
> >  void bch2_fs_btree_key_cache_init_early(struct btree_key_cache *c)
> > @@ -789,6 +895,20 @@ int bch2_fs_btree_key_cache_init(struct btree_key_cache *bc)
> >  	struct bch_fs *c = container_of(bc, struct bch_fs, btree_key_cache);
> >  	struct shrinker *shrink;
> >  
> > +#ifdef __KERNEL__
> > +	bc->pending = alloc_percpu(struct pending_rcu_items);
> > +	if (!bc->pending)
> > +		return -BCH_ERR_ENOMEM_fs_btree_cache_init;
> > +#endif
> > +	int cpu;
> > +	for_each_possible_cpu(cpu)
> > +		pending_rcu_items_init(per_cpu_ptr(bc->pending, cpu),
> > +				       &c->btree_trans_barrier, __bkey_cached_free);
> > +
> > +	bc->nr_pending = alloc_percpu(size_t);
> > +	if (!bc->nr_pending)
> > +		return -BCH_ERR_ENOMEM_fs_btree_cache_init;
> > +
> >  	if (rhashtable_init(&bc->table, &bch2_btree_key_cache_params))
> >  		return -BCH_ERR_ENOMEM_fs_btree_cache_init;
> >  
> > @@ -815,13 +935,15 @@ void bch2_btree_key_cache_to_text(struct printbuf *out, struct btree_key_cache *
> >  	prt_printf(out, "keys:\t%lu\r\n",		atomic_long_read(&bc->nr_keys));
> >  	prt_printf(out, "dirty:\t%lu\r\n",		atomic_long_read(&bc->nr_dirty));
> >  	prt_printf(out, "table size:\t%u\r\n",		bc->table.tbl->size);
> > -
> > -	prt_printf(out, "\nshrinker:\n");
> > +	prt_newline(out);
> > +	prt_printf(out, "shrinker:\n");
> >  	prt_printf(out, "requested_to_free:\t%lu\r\n",	bc->requested_to_free);
> >  	prt_printf(out, "freed:\t%lu\r\n",		bc->freed);
> >  	prt_printf(out, "skipped_dirty:\t%lu\r\n",	bc->skipped_dirty);
> >  	prt_printf(out, "skipped_accessed:\t%lu\r\n",	bc->skipped_accessed);
> >  	prt_printf(out, "skipped_lock_fail:\t%lu\r\n",	bc->skipped_lock_fail);
> > +	prt_newline(out);
> > +	prt_printf(out, "pending:\t%lu\r\n",		per_cpu_sum(bc->nr_pending));
> >  }
> >  
> >  void bch2_btree_key_cache_exit(void)
> > diff --git a/fs/bcachefs/btree_key_cache_types.h b/fs/bcachefs/btree_key_cache_types.h
> > index e026c65f54e1..8d11cbac4c37 100644
> > --- a/fs/bcachefs/btree_key_cache_types.h
> > +++ b/fs/bcachefs/btree_key_cache_types.h
> > @@ -2,12 +2,29 @@
> >  #ifndef _BCACHEFS_BTREE_KEY_CACHE_TYPES_H
> >  #define _BCACHEFS_BTREE_KEY_CACHE_TYPES_H
> >  
> > +#include "darray.h"
> > +
> > +struct pending_rcu_items;
> > +typedef void (*pending_rcu_item_process_fn)(struct pending_rcu_items *, void *);
> > +
> > +struct pending_rcu_items {
> > +	struct srcu_struct		*srcu;
> > +	pending_rcu_item_process_fn	process;
> > +	spinlock_t			lock;
> > +	bool				rcu_armed;
> > +	unsigned long			seq;
> > +	darray_voidp			objs[2];
> > +	struct rcu_head			rcu;
> > +};
> > +
> >  struct btree_key_cache {
> >  	struct rhashtable	table;
> >  	bool			table_init_done;
> >  
> >  	struct shrinker		*shrink;
> >  	unsigned		shrink_iter;
> > +	struct pending_rcu_items __percpu *pending;
> > +	size_t __percpu		*nr_pending;
> >  
> >  	atomic_long_t		nr_keys;
> >  	atomic_long_t		nr_dirty;
> > diff --git a/fs/bcachefs/btree_types.h b/fs/bcachefs/btree_types.h
> > index 67f3efd9a007..56046f09ac65 100644
> > --- a/fs/bcachefs/btree_types.h
> > +++ b/fs/bcachefs/btree_types.h
> > @@ -396,8 +396,6 @@ struct bkey_cached {
> >  	u64			seq;
> >  
> >  	struct bkey_i		*k;
> > -
> > -	struct rcu_head		rcu;
> >  };
> >  
> >  static inline struct bpos btree_node_pos(struct btree_bkey_cached_common *b)
> > diff --git a/fs/bcachefs/darray.h b/fs/bcachefs/darray.h
> > index 4b340d13caac..cdace10f4572 100644
> > --- a/fs/bcachefs/darray.h
> > +++ b/fs/bcachefs/darray.h
> > @@ -19,6 +19,7 @@ struct {								\
> >  
> >  #define DARRAY(_type) DARRAY_PREALLOCATED(_type, 0)
> >  
> > +typedef DARRAY(void *)	darray_voidp;
> >  typedef DARRAY(char)	darray_char;
> >  typedef DARRAY(char *) darray_str;
> >