On Thu, Jun 13, 2024 at 08:22:19PM -0700, Paul E. McKenney wrote: > On Thu, Jun 13, 2024 at 02:32:27PM -0400, Kent Overstreet wrote: [ . . . ] After sleeping on it, backing up a level of abstraction, and thus perhaps responding a bit more constructively... And also, thank you very much for looking into this! > > commit 3bb5d34d6d6e2d5433cce5d248fd81211afedec0 > > Author: Kent Overstreet <kent.overstreet@xxxxxxxxx> > > Date: Mon Jun 10 20:47:03 2024 -0400 > > > > bcachefs: pending_rcu_items > > > > Generic (?) data structure for explicitly tracking pending SRCU items > > > > - Can we make this generic across other RCU flavors? It should be possible, at least for current RCU and SRCU. I think I owe you SRCU variants of a couple of existing RCU functions. Later today! > > - Do we need to add a fallback to linked lists when darray allocation > > fails? In general, as in kfree_rcu(), yes. In your specific case here, maybe not. > > - We wish to avoid the overhead of start_poll_synchronize_srcu() if we > > already have items for a given sequence number, but it appears > > get_poll_synchronize_srcu(); start_poll_synchronize_srcu() can > > actually return differert sequence numbers within the same read-side > > critical section. Is this the correct way of doing this? You can always buffer up items that have not yet been assigned a sequence number, then after you have enough or after enough time has passed, do a single get_poll_synchronize_srcu() for the group. > > - Do we need to ensure the rearming RCU callback is correctly flushed > > when exiting? (Probably) I believe so, and setting a flag then doing an srcu_barrier() should suffice. > > Signed-off-by: Kent Overstreet <kent.overstreet@xxxxxxxxx> > > > > diff --git a/fs/bcachefs/btree_key_cache.c b/fs/bcachefs/btree_key_cache.c > > index ebed60a43647..46f7fa549d59 100644 > > --- a/fs/bcachefs/btree_key_cache.c > > +++ b/fs/bcachefs/btree_key_cache.c > > @@ -83,18 +83,109 @@ static bool bkey_cached_evict(struct btree_key_cache *c, > > return ret; > > } > > > > -static void __bkey_cached_free(struct rcu_head *rcu) > > +#define RCU_SEQ_SKIP_SHIFT 2 > > + > > +static void pending_rcu_items_exit(struct pending_rcu_items *pending) > > +{ > > + darray_exit(&pending->objs[0]); > > + darray_exit(&pending->objs[1]); > > +} > > + > > +static void pending_rcu_items_init(struct pending_rcu_items *pending, > > + struct srcu_struct *srcu, > > + pending_rcu_item_process_fn process) > > +{ > > + memset(pending, 0, sizeof(*pending)); > > + pending->srcu = srcu; > > + pending->process = process; > > + spin_lock_init(&pending->lock); > > +} > > + > > +static bool pending_rcu_has_items(struct pending_rcu_items *pending) > > +{ > > + return pending->objs[0].nr || pending->objs[1].nr; > > +} > > + > > +static void pending_rcu_items_process(struct pending_rcu_items *pending) > > { > > - struct bkey_cached *ck = container_of(rcu, struct bkey_cached, rcu); > > + while (pending_rcu_has_items(pending) && > > + poll_state_synchronize_srcu(pending->srcu, pending->seq)) { > > + darray_for_each(pending->objs[0], i) > > + pending->process(pending, *i); Given that you have interrupts and/or bh disabled, there needs to be a third set of callbacks, those whose grace periods are already elapsed. Unioning one of the ->objs[] sets onto this third set, leaving the source ->objs[] set empty, should strongly be O(1) without the need for memory allocation. I might be missing something, but at first glance darray does not support this. The pages-of-pointers structure named kvfree_rcu_bulk_data, along with the associated structures in kernel/rcu/tree.c does this work, but (1) is open-coded and (2) has to deal with the full range of what the kernel can throw at it. It falls back to an embedded rcu_head structure, in response to your second question in the commit log. But perhaps this code could be generalized? Or maybe darray can be appropriately extended? And one thing that I missed the first time around is that you are using a single ->seq for two sets of objects (->objs[0] and ->objs[1]). Each set needs its own sequence number. > > + darray_exit(&pending->objs[0]); > > + swap(pending->objs[0], pending->objs[1]); > > + pending->seq += 1 << RCU_SEQ_SKIP_SHIFT; > > + } > > +} > > + > > +static void pending_rcu_items_rcu_cb(struct rcu_head *rcu) > > +{ > > + struct pending_rcu_items *pending = > > + container_of(rcu, struct pending_rcu_items, rcu); > > + > > + unsigned long flags; > > + spin_lock_irqsave(&pending->lock, flags); > > + pending_rcu_items_process(pending); > > SRCU callbacks execute in softirq context. You therefore might need to > push this call to pending_rcu_items_process() to a workqueue handler, > at least if there are very many objects in need of processing. Plus you have irqs disabled, which I somehow missed on the first pass. Again, please union the objects whose grace period has ended into a set that is ready to be processed, and then do the processing in a workqueue handler, kthread, or similar. This avoids causing real-time latency problems, or, in extreme cases, softlockups or RCU CPU stall warnings. > > - kmem_cache_free(bch2_key_cache, ck); > > + pending->rcu_armed = pending_rcu_has_items(pending); > > + if (pending->rcu_armed) > > + call_srcu(pending->srcu, &pending->rcu, pending_rcu_items_rcu_cb); > > + spin_unlock_irqrestore(&pending->lock, flags); > > +} > > + > > +static void pending_rcu_item_enqueue(struct pending_rcu_items *pending, void *obj) > > +{ > > + darray_voidp preallocated = {}; > > +retry: > > + spin_lock_irq(&pending->lock); > > + pending_rcu_items_process(pending); > > Suppose that poll_state_synchronize_srcu() returns false so that > start_poll_synchronize_srcu() does nothing... > > ... then we get vCPU preemption here so that some number of SRCU grace > periods elapse (which looks to be limited to just one of them, but that > is enough because we might already have requested the next one courtesy of > the previous call to either call_srcu() or start_poll_synchronize_srcu()) ... > > > + unsigned long seq = start_poll_synchronize_srcu(pending->srcu); > > ... so that we get a seq that is 8 or 12 larger than pending->seq ... > > > + if (!pending_rcu_has_items(pending)) > > + pending->seq = seq; > > ... and we don't update pending->seq because there are still items there ... > > > + unsigned idx = (seq - pending->seq) >> RCU_SEQ_SKIP_SHIFT; > > + BUG_ON(idx > ARRAY_SIZE(pending->objs)); And more on this later today. Thanx, Paul > ... then can't this BUG_ON() trigger? > > And why this ugly reliance on the cookie format? Why not just have > the pending_rcu_items structure's ->seq be a two-element array? > > Then all you need to do is to check each ->seq[] element to see if it > is equal to seq, while remembering the index of the first that is empty. > If you find a ->seq[] match, just enqueue into the corresponding ->objs[] > element. Otherwise, you are guaranteed to have found an empty ->obj[], > at least if you move the start_poll_synchronize_srcu() to precede > the pending_rcu_items_process(). > > *Much* looser coupling with SRCU and rather simpler. > > > + darray_voidp *objs = pending->objs + idx; > > + > > + if (!darray_room(*objs)) { > > + if (preallocated.size > objs->size) { > > + memcpy(preallocated.data, > > + objs->data, > > + sizeof(objs->data[0]) * objs->nr); > > + swap(preallocated.data, objs->data); > > + swap(preallocated.size, objs->size); > > + } else if (darray_make_room_gfp(objs, 1, GFP_ATOMIC|__GFP_NOWARN)) { > > + spin_unlock_irq(&pending->lock); > > + darray_exit(&preallocated); > > + darray_make_room_gfp(&preallocated, objs->nr + 1, GFP_KERNEL|__GFP_NOFAIL); > > + goto retry; > > + } > > + } > > + > > + BUG_ON(darray_push(objs, obj)); /* preallocated */ > > + > > + if (!pending->rcu_armed) > > + call_srcu(pending->srcu, &pending->rcu, pending_rcu_items_rcu_cb); > > The start_poll_synchronize_srcu() above might have started a new > SRCU grace period. If so, this call_srcu() will start another one. > Which should not be a problem. If it does prove to be, you might use > get_state_synchronize_srcu() instead, depending. > > On the other hand, if this code is prone to bkey_cached_free() floods, the > current combination of start_poll_synchronize_srcu() and call_srcu() > might avoid OOM, given that this function checks for pending objects > whose grace period has ended. > > > + pending->rcu_armed = true; > > + spin_unlock_irq(&pending->lock); > > + > > + darray_exit(&preallocated); > > +} > > + > > +static void __bkey_cached_free(struct pending_rcu_items *pending, void *obj) > > +{ > > + struct bch_fs *c = container_of(pending->srcu, struct bch_fs, btree_trans_barrier); > > + > > + this_cpu_dec(*c->btree_key_cache.nr_pending); > > + kmem_cache_free(bch2_key_cache, obj); > > } > > > > static void bkey_cached_free(struct btree_key_cache *bc, > > struct bkey_cached *ck) > > { > > - struct bch_fs *c = container_of(bc, struct bch_fs, btree_key_cache); > > - > > kfree(ck->k); > > ck->k = NULL; > > ck->u64s = 0; > > @@ -102,7 +193,15 @@ static void bkey_cached_free(struct btree_key_cache *bc, > > six_unlock_write(&ck->c.lock); > > six_unlock_intent(&ck->c.lock); > > > > - call_srcu(&c->btree_trans_barrier, &ck->rcu, __bkey_cached_free); > > + preempt_disable(); > > + struct pending_rcu_items *pending = this_cpu_ptr(bc->pending); > > + preempt_enable(); > > + /* > > + * pending is protected by a lock, we don't need preemption disabled - > > + * and it expects to run in sleepable context: > > + */ > > + pending_rcu_item_enqueue(pending, ck); > > + this_cpu_inc(*bc->nr_pending); > > } > > > > static struct bkey_cached * > > @@ -778,6 +877,13 @@ void bch2_fs_btree_key_cache_exit(struct btree_key_cache *bc) > > > > if (bc->table_init_done) > > rhashtable_destroy(&bc->table); > > + > > + free_percpu(bc->nr_pending); > > + > > + int cpu; > > + for_each_possible_cpu(cpu) > > + pending_rcu_items_exit(per_cpu_ptr(bc->pending, cpu)); > > + free_percpu(bc->pending); > > } > > > > void bch2_fs_btree_key_cache_init_early(struct btree_key_cache *c) > > @@ -789,6 +895,20 @@ int bch2_fs_btree_key_cache_init(struct btree_key_cache *bc) > > struct bch_fs *c = container_of(bc, struct bch_fs, btree_key_cache); > > struct shrinker *shrink; > > > > +#ifdef __KERNEL__ > > + bc->pending = alloc_percpu(struct pending_rcu_items); > > + if (!bc->pending) > > + return -BCH_ERR_ENOMEM_fs_btree_cache_init; > > +#endif > > + int cpu; > > + for_each_possible_cpu(cpu) > > + pending_rcu_items_init(per_cpu_ptr(bc->pending, cpu), > > + &c->btree_trans_barrier, __bkey_cached_free); > > + > > + bc->nr_pending = alloc_percpu(size_t); > > + if (!bc->nr_pending) > > + return -BCH_ERR_ENOMEM_fs_btree_cache_init; > > + > > if (rhashtable_init(&bc->table, &bch2_btree_key_cache_params)) > > return -BCH_ERR_ENOMEM_fs_btree_cache_init; > > > > @@ -815,13 +935,15 @@ void bch2_btree_key_cache_to_text(struct printbuf *out, struct btree_key_cache * > > prt_printf(out, "keys:\t%lu\r\n", atomic_long_read(&bc->nr_keys)); > > prt_printf(out, "dirty:\t%lu\r\n", atomic_long_read(&bc->nr_dirty)); > > prt_printf(out, "table size:\t%u\r\n", bc->table.tbl->size); > > - > > - prt_printf(out, "\nshrinker:\n"); > > + prt_newline(out); > > + prt_printf(out, "shrinker:\n"); > > prt_printf(out, "requested_to_free:\t%lu\r\n", bc->requested_to_free); > > prt_printf(out, "freed:\t%lu\r\n", bc->freed); > > prt_printf(out, "skipped_dirty:\t%lu\r\n", bc->skipped_dirty); > > prt_printf(out, "skipped_accessed:\t%lu\r\n", bc->skipped_accessed); > > prt_printf(out, "skipped_lock_fail:\t%lu\r\n", bc->skipped_lock_fail); > > + prt_newline(out); > > + prt_printf(out, "pending:\t%lu\r\n", per_cpu_sum(bc->nr_pending)); > > } > > > > void bch2_btree_key_cache_exit(void) > > diff --git a/fs/bcachefs/btree_key_cache_types.h b/fs/bcachefs/btree_key_cache_types.h > > index e026c65f54e1..8d11cbac4c37 100644 > > --- a/fs/bcachefs/btree_key_cache_types.h > > +++ b/fs/bcachefs/btree_key_cache_types.h > > @@ -2,12 +2,29 @@ > > #ifndef _BCACHEFS_BTREE_KEY_CACHE_TYPES_H > > #define _BCACHEFS_BTREE_KEY_CACHE_TYPES_H > > > > +#include "darray.h" > > + > > +struct pending_rcu_items; > > +typedef void (*pending_rcu_item_process_fn)(struct pending_rcu_items *, void *); > > + > > +struct pending_rcu_items { > > + struct srcu_struct *srcu; > > + pending_rcu_item_process_fn process; > > + spinlock_t lock; > > + bool rcu_armed; > > + unsigned long seq; > > + darray_voidp objs[2]; > > + struct rcu_head rcu; > > +}; > > + > > struct btree_key_cache { > > struct rhashtable table; > > bool table_init_done; > > > > struct shrinker *shrink; > > unsigned shrink_iter; > > + struct pending_rcu_items __percpu *pending; > > + size_t __percpu *nr_pending; > > > > atomic_long_t nr_keys; > > atomic_long_t nr_dirty; > > diff --git a/fs/bcachefs/btree_types.h b/fs/bcachefs/btree_types.h > > index 67f3efd9a007..56046f09ac65 100644 > > --- a/fs/bcachefs/btree_types.h > > +++ b/fs/bcachefs/btree_types.h > > @@ -396,8 +396,6 @@ struct bkey_cached { > > u64 seq; > > > > struct bkey_i *k; > > - > > - struct rcu_head rcu; > > }; > > > > static inline struct bpos btree_node_pos(struct btree_bkey_cached_common *b) > > diff --git a/fs/bcachefs/darray.h b/fs/bcachefs/darray.h > > index 4b340d13caac..cdace10f4572 100644 > > --- a/fs/bcachefs/darray.h > > +++ b/fs/bcachefs/darray.h > > @@ -19,6 +19,7 @@ struct { \ > > > > #define DARRAY(_type) DARRAY_PREALLOCATED(_type, 0) > > > > +typedef DARRAY(void *) darray_voidp; > > typedef DARRAY(char) darray_char; > > typedef DARRAY(char *) darray_str; > >