On Tue, Feb 01, 2022 at 02:00:28PM -0800, Paul E. McKenney wrote: > On Mon, Jan 31, 2022 at 08:22:43AM -0500, Brian Foster wrote: > > On Fri, Jan 28, 2022 at 01:39:11PM -0800, Paul E. McKenney wrote: > > > On Thu, Jan 27, 2022 at 02:01:25PM -0500, Brian Foster wrote: > > > > On Thu, Jan 27, 2022 at 04:26:09PM +1100, Dave Chinner wrote: > > > > > On Thu, Jan 27, 2022 at 04:19:34AM +0000, Al Viro wrote: > > > > > > On Wed, Jan 26, 2022 at 09:45:51AM +1100, Dave Chinner wrote: > > > > > > > > > > > > > Right, background inactivation does not improve performance - it's > > > > > > > necessary to get the transactions out of the evict() path. All we > > > > > > > wanted was to ensure that there were no performance degradations as > > > > > > > a result of background inactivation, not that it was faster. > > > > > > > > > > > > > > If you want to confirm that there is an increase in cold cache > > > > > > > access when the batch size is increased, cpu profiles with 'perf > > > > > > > top'/'perf record/report' and CPU cache performance metric reporting > > > > > > > via 'perf stat -dddd' are your friend. See elsewhere in the thread > > > > > > > where I mention those things to Paul. > > > > > > > > > > > > Dave, do you see a plausible way to eventually drop Ian's bandaid? > > > > > > I'm not asking for that to happen this cycle and for backports Ian's > > > > > > patch is obviously fine. > > > > > > > > > > Yes, but not in the near term. > > > > > > > > > > > What I really want to avoid is the situation when we are stuck with > > > > > > keeping that bandaid in fs/namei.c, since all ways to avoid seeing > > > > > > reused inodes would hurt XFS too badly. And the benchmarks in this > > > > > > thread do look like that. > > > > > > > > > > The simplest way I think is to have the XFS inode allocation track > > > > > "busy inodes" in the same way we track "busy extents". A busy extent > > > > > is an extent that has been freed by the user, but is not yet marked > > > > > free in the journal/on disk. If we try to reallocate that busy > > > > > extent, we either select a different free extent to allocate, or if > > > > > we can't find any we force the journal to disk, wait for it to > > > > > complete (hence unbusying the extents) and retry the allocation > > > > > again. > > > > > > > > > > We can do something similar for inode allocation - it's actually a > > > > > lockless tag lookup on the radix tree entry for the candidate inode > > > > > number. If we find the reclaimable radix tree tag set, the we select > > > > > a different inode. If we can't allocate a new inode, then we kick > > > > > synchronize_rcu() and retry the allocation, allowing inodes to be > > > > > recycled this time. > > > > > > > > I'm starting to poke around this area since it's become clear that the > > > > currently proposed scheme just involves too much latency (unless Paul > > > > chimes in with his expedited grace period variant, at which point I will > > > > revisit) in the fast allocation/recycle path. ISTM so far that a simple > > > > "skip inodes in the radix tree, sync rcu if unsuccessful" algorithm will > > > > have pretty much the same pattern of behavior as this patch: one > > > > synchronize_rcu() per batch. > > > > > > Apologies for being slow, but there have been some distractions. > > > One of the distractions was trying to put together atheoretically > > > attractive but massively overcomplicated implementation of > > > poll_state_synchronize_rcu_expedited(). It currently looks like a > > > somewhat suboptimal but much simpler approach is available. This > > > assumes that XFS is not in the picture until after both the scheduler > > > and workqueues are operational. > > > > > > > No worries.. I don't think that would be a roadblock for us. ;) > > > > > And yes, the complicated version might prove necessary, but let's > > > see if this whole thing is even useful first. ;-) > > > > > > > Indeed. This patch only really requires a single poll/sync pair of > > calls, so assuming the expedited grace period usage plays nice enough > > with typical !expedited usage elsewhere in the kernel for some basic > > tests, it would be fairly trivial to port this over and at least get an > > idea of what the worst case behavior might be with expedited grace > > periods, whether it satisfies the existing latency requirements, etc. > > > > Brian > > > > > In the meantime, if you want to look at an extremely unbaked view, > > > here you go: > > > > > > https://docs.google.com/document/d/1RNKWW9jQyfjxw2E8dsXVTdvZYh0HnYeSHDKog9jhdN8/edit?usp=sharing > > And here is a version that passes moderate rcutorture testing. So no > obvious bugs. Probably a few non-obvious ones, though! ;-) > > This commit is on -rcu's "dev" branch along with this rcutorture > addition: > > cd7bd64af59f ("EXP rcutorture: Test polled expedited grace-period primitives") > > I will carry these in -rcu's "dev" branch until at least the upcoming > merge window, fixing bugs as and when they becom apparent. If I don't > hear otherwise by that time, I will create a tag for it and leave > it behind. > > The backport to v5.17-rc2 just requires removing: > > mutex_init(&rnp->boost_kthread_mutex); > > From rcu_init_one(). This line is added by this -rcu commit: > > 02a50b09c31f ("rcu: Add mutex for rcu boost kthread spawning and affinity setting") And with some alleged fixes of issues Neeraj found when reviewing this, perhaps most notably the ability to run on real-time kernels booted with rcupdate.rcu_normal=1. This version passes reasonably heavy-duty rcutorture testing. Must mean bugs in rcutorture... :-/ f93fa07011bd ("EXP rcu: Add polled expedited grace-period primitives") Again, please let me know how it goes! Thanx, Paul ------------------------------------------------------------------------ commit f93fa07011bd2460f222e570d17968baff21fa90 Author: Paul E. McKenney <paulmck@xxxxxxxxxx> Date: Mon Jan 31 16:55:52 2022 -0800 EXP rcu: Add polled expedited grace-period primitives This is an experimental proof of concept of polled expedited grace-period functions. These functions are get_state_synchronize_rcu_expedited(), start_poll_synchronize_rcu_expedited(), poll_state_synchronize_rcu_expedited(), and cond_synchronize_rcu_expedited(), which are similar to get_state_synchronize_rcu(), start_poll_synchronize_rcu(), poll_state_synchronize_rcu(), and cond_synchronize_rcu(), respectively. One limitation is that start_poll_synchronize_rcu_expedited() cannot be invoked before workqueues are initialized. Link: https://lore.kernel.org/all/20220121142454.1994916-1-bfoster@xxxxxxxxxx/ Link: https://docs.google.com/document/d/1RNKWW9jQyfjxw2E8dsXVTdvZYh0HnYeSHDKog9jhdN8/edit?usp=sharing Cc: Brian Foster <bfoster@xxxxxxxxxx> Cc: Dave Chinner <david@xxxxxxxxxxxxx> Cc: Al Viro <viro@xxxxxxxxxxxxxxxxxx> Cc: Ian Kent <raven@xxxxxxxxxx> Signed-off-by: Paul E. McKenney <paulmck@xxxxxxxxxx> diff --git a/include/linux/rcutiny.h b/include/linux/rcutiny.h index 858f4d429946d..ca139b4b2d25f 100644 --- a/include/linux/rcutiny.h +++ b/include/linux/rcutiny.h @@ -23,6 +23,26 @@ static inline void cond_synchronize_rcu(unsigned long oldstate) might_sleep(); } +static inline unsigned long get_state_synchronize_rcu_expedited(void) +{ + return get_state_synchronize_rcu(); +} + +static inline unsigned long start_poll_synchronize_rcu_expedited(void) +{ + return start_poll_synchronize_rcu(); +} + +static inline bool poll_state_synchronize_rcu_expedited(unsigned long oldstate) +{ + return poll_state_synchronize_rcu(oldstate); +} + +static inline void cond_synchronize_rcu_expedited(unsigned long oldstate) +{ + cond_synchronize_rcu(oldstate); +} + extern void rcu_barrier(void); static inline void synchronize_rcu_expedited(void) diff --git a/include/linux/rcutree.h b/include/linux/rcutree.h index 76665db179fa1..eb774e9be21bf 100644 --- a/include/linux/rcutree.h +++ b/include/linux/rcutree.h @@ -40,6 +40,10 @@ bool rcu_eqs_special_set(int cpu); void rcu_momentary_dyntick_idle(void); void kfree_rcu_scheduler_running(void); bool rcu_gp_might_be_stalled(void); +unsigned long get_state_synchronize_rcu_expedited(void); +unsigned long start_poll_synchronize_rcu_expedited(void); +bool poll_state_synchronize_rcu_expedited(unsigned long oldstate); +void cond_synchronize_rcu_expedited(unsigned long oldstate); unsigned long get_state_synchronize_rcu(void); unsigned long start_poll_synchronize_rcu(void); bool poll_state_synchronize_rcu(unsigned long oldstate); diff --git a/kernel/rcu/rcu.h b/kernel/rcu/rcu.h index 24b5f2c2de87b..5b61cf20c91e9 100644 --- a/kernel/rcu/rcu.h +++ b/kernel/rcu/rcu.h @@ -23,6 +23,13 @@ #define RCU_SEQ_CTR_SHIFT 2 #define RCU_SEQ_STATE_MASK ((1 << RCU_SEQ_CTR_SHIFT) - 1) +/* + * Low-order bit definitions for polled grace-period APIs. + */ +#define RCU_GET_STATE_FROM_EXPEDITED 0x1 +#define RCU_GET_STATE_USE_NORMAL 0x2 +#define RCU_GET_STATE_BAD_FOR_NORMAL (RCU_GET_STATE_FROM_EXPEDITED | RCU_GET_STATE_USE_NORMAL) + /* * Return the counter portion of a sequence number previously returned * by rcu_seq_snap() or rcu_seq_current(). diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index e6ad532cffe78..135d5e2bce879 100644 --- a/kernel/rcu/tree.c +++ b/kernel/rcu/tree.c @@ -3871,7 +3871,8 @@ EXPORT_SYMBOL_GPL(start_poll_synchronize_rcu); */ bool poll_state_synchronize_rcu(unsigned long oldstate) { - if (rcu_seq_done(&rcu_state.gp_seq, oldstate)) { + if (rcu_seq_done(&rcu_state.gp_seq, oldstate) && + !WARN_ON_ONCE(oldstate & RCU_GET_STATE_BAD_FOR_NORMAL)) { smp_mb(); /* Ensure GP ends before subsequent accesses. */ return true; } @@ -3900,7 +3901,8 @@ EXPORT_SYMBOL_GPL(poll_state_synchronize_rcu); */ void cond_synchronize_rcu(unsigned long oldstate) { - if (!poll_state_synchronize_rcu(oldstate)) + if (!poll_state_synchronize_rcu(oldstate) || + WARN_ON_ONCE(oldstate & RCU_GET_STATE_BAD_FOR_NORMAL)) synchronize_rcu(); } EXPORT_SYMBOL_GPL(cond_synchronize_rcu); @@ -4593,6 +4595,9 @@ static void __init rcu_init_one(void) init_waitqueue_head(&rnp->exp_wq[3]); spin_lock_init(&rnp->exp_lock); mutex_init(&rnp->boost_kthread_mutex); + raw_spin_lock_init(&rnp->exp_poll_lock); + rnp->exp_seq_poll_rq = 0x1; + INIT_WORK(&rnp->exp_poll_wq, sync_rcu_do_polled_gp); } } diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h index 926673ebe355f..19fc9acce3ce2 100644 --- a/kernel/rcu/tree.h +++ b/kernel/rcu/tree.h @@ -128,6 +128,10 @@ struct rcu_node { wait_queue_head_t exp_wq[4]; struct rcu_exp_work rew; bool exp_need_flush; /* Need to flush workitem? */ + raw_spinlock_t exp_poll_lock; + /* Lock and data for polled expedited grace periods. */ + unsigned long exp_seq_poll_rq; + struct work_struct exp_poll_wq; } ____cacheline_internodealigned_in_smp; /* @@ -476,3 +480,6 @@ static void rcu_iw_handler(struct irq_work *iwp); static void check_cpu_stall(struct rcu_data *rdp); static void rcu_check_gp_start_stall(struct rcu_node *rnp, struct rcu_data *rdp, const unsigned long gpssdelay); + +/* Forward declarations for tree_exp.h. */ +static void sync_rcu_do_polled_gp(struct work_struct *wp); diff --git a/kernel/rcu/tree_exp.h b/kernel/rcu/tree_exp.h index 1a45667402260..4041988086830 100644 --- a/kernel/rcu/tree_exp.h +++ b/kernel/rcu/tree_exp.h @@ -871,3 +871,154 @@ void synchronize_rcu_expedited(void) destroy_work_on_stack(&rew.rew_work); } EXPORT_SYMBOL_GPL(synchronize_rcu_expedited); + +/** + * get_state_synchronize_rcu_expedited - Snapshot current expedited RCU state + * + * Returns a cookie to pass to a call to cond_synchronize_rcu_expedited() + * or poll_state_synchronize_rcu_expedited(), allowing them to determine + * whether or not a full expedited grace period has elapsed in the meantime. + */ +unsigned long get_state_synchronize_rcu_expedited(void) +{ + if (rcu_gp_is_normal()) + return get_state_synchronize_rcu() | + RCU_GET_STATE_FROM_EXPEDITED | RCU_GET_STATE_USE_NORMAL; + + // Any prior manipulation of RCU-protected data must happen + // before the load from ->expedited_sequence, and this ordering is + // provided by rcu_exp_gp_seq_snap(). + return rcu_exp_gp_seq_snap() | RCU_GET_STATE_FROM_EXPEDITED; +} +EXPORT_SYMBOL_GPL(get_state_synchronize_rcu_expedited); + +/* + * Ensure that start_poll_synchronize_rcu_expedited() has the expedited + * RCU grace periods that it needs. + */ +static void sync_rcu_do_polled_gp(struct work_struct *wp) +{ + unsigned long flags; + struct rcu_node *rnp = container_of(wp, struct rcu_node, exp_poll_wq); + unsigned long s; + + raw_spin_lock_irqsave(&rnp->exp_poll_lock, flags); + s = rnp->exp_seq_poll_rq; + rnp->exp_seq_poll_rq |= 0x1; + raw_spin_unlock_irqrestore(&rnp->exp_poll_lock, flags); + if (s & 0x1) + return; + while (!sync_exp_work_done(s)) + synchronize_rcu_expedited(); + raw_spin_lock_irqsave(&rnp->exp_poll_lock, flags); + s = rnp->exp_seq_poll_rq; + if (!(s & 0x1) && !sync_exp_work_done(s)) + queue_work(rcu_gp_wq, &rnp->exp_poll_wq); + else + rnp->exp_seq_poll_rq |= 0x1; + raw_spin_unlock_irqrestore(&rnp->exp_poll_lock, flags); +} + +/** + * start_poll_synchronize_rcu_expedited - Snapshot current expedited RCU state and start grace period + * + * Returns a cookie to pass to a call to cond_synchronize_rcu_expedited() + * or poll_state_synchronize_rcu_expedited(), allowing them to determine + * whether or not a full expedited grace period has elapsed in the meantime. + * If the needed grace period is not already slated to start, initiates + * that grace period. + */ + +unsigned long start_poll_synchronize_rcu_expedited(void) +{ + unsigned long flags; + struct rcu_data *rdp; + struct rcu_node *rnp; + unsigned long s; + + if (rcu_gp_is_normal()) + return start_poll_synchronize_rcu() | + RCU_GET_STATE_FROM_EXPEDITED | RCU_GET_STATE_USE_NORMAL; + + s = rcu_exp_gp_seq_snap(); + rdp = per_cpu_ptr(&rcu_data, raw_smp_processor_id()); + rnp = rdp->mynode; + raw_spin_lock_irqsave(&rnp->exp_poll_lock, flags); + if ((rnp->exp_seq_poll_rq & 0x1) || ULONG_CMP_LT(rnp->exp_seq_poll_rq, s)) { + rnp->exp_seq_poll_rq = s; + queue_work(rcu_gp_wq, &rnp->exp_poll_wq); + } + raw_spin_unlock_irqrestore(&rnp->exp_poll_lock, flags); + + return s | RCU_GET_STATE_FROM_EXPEDITED; +} +EXPORT_SYMBOL_GPL(start_poll_synchronize_rcu_expedited); + +/** + * poll_state_synchronize_rcu_expedited - Conditionally wait for an expedited RCU grace period + * + * @oldstate: value from get_state_synchronize_rcu_expedited() or start_poll_synchronize_rcu_expedited() + * + * If a full expedited RCU grace period has elapsed since the earlier call + * from which oldstate was obtained, return @true, otherwise return @false. + * If @false is returned, it is the caller's responsibility to invoke + * this function later on until it does return @true. Alternatively, + * the caller can explicitly wait for a grace period, for example, by + * passing @oldstate to cond_synchronize_rcu_expedited() or by directly + * invoking synchronize_rcu_expedited(). + * + * Yes, this function does not take counter wrap into account. + * But counter wrap is harmless. If the counter wraps, we have waited for + * more than 2 billion grace periods (and way more on a 64-bit system!). + * Those needing to keep oldstate values for very long time periods + * (several hours even on 32-bit systems) should check them occasionally + * and either refresh them or set a flag indicating that the grace period + * has completed. + * + * This function provides the same memory-ordering guarantees that would + * be provided by a synchronize_rcu_expedited() that was invoked at the + * call to the function that provided @oldstate, and that returned at the + * end of this function. + */ +bool poll_state_synchronize_rcu_expedited(unsigned long oldstate) +{ + WARN_ON_ONCE(!(oldstate & RCU_GET_STATE_FROM_EXPEDITED)); + if (oldstate & RCU_GET_STATE_USE_NORMAL) + return poll_state_synchronize_rcu(oldstate & ~RCU_GET_STATE_BAD_FOR_NORMAL); + if (!rcu_exp_gp_seq_done(oldstate & ~RCU_SEQ_STATE_MASK)) + return false; + smp_mb(); /* Ensure GP ends before subsequent accesses. */ + return true; +} +EXPORT_SYMBOL_GPL(poll_state_synchronize_rcu_expedited); + +/** + * cond_synchronize_rcu_expedited - Conditionally wait for an expedited RCU grace period + * + * @oldstate: value from get_state_synchronize_rcu_expedited() or start_poll_synchronize_rcu_expedited() + * + * If a full expedited RCU grace period has elapsed since the earlier + * call from which oldstate was obtained, just return. Otherwise, invoke + * synchronize_rcu_expedited() to wait for a full grace period. + * + * Yes, this function does not take counter wrap into account. But + * counter wrap is harmless. If the counter wraps, we have waited for + * more than 2 billion grace periods (and way more on a 64-bit system!), + * so waiting for one additional grace period should be just fine. + * + * This function provides the same memory-ordering guarantees that would + * be provided by a synchronize_rcu_expedited() that was invoked at the + * call to the function that provided @oldstate, and that returned at the + * end of this function. + */ +void cond_synchronize_rcu_expedited(unsigned long oldstate) +{ + WARN_ON_ONCE(!(oldstate & RCU_GET_STATE_FROM_EXPEDITED)); + if (poll_state_synchronize_rcu_expedited(oldstate)) + return; + if (oldstate & RCU_GET_STATE_USE_NORMAL) + synchronize_rcu(); + else + synchronize_rcu_expedited(); +} +EXPORT_SYMBOL_GPL(cond_synchronize_rcu_expedited);