Re: [PATCH] xfs: require an rcu grace period before inode recycle

"Paul E. McKenney" <paulmck@xxxxxxxxxx> · Tue, 1 Feb 2022 14:00:28 -0800

On Mon, Jan 31, 2022 at 08:22:43AM -0500, Brian Foster wrote:
> On Fri, Jan 28, 2022 at 01:39:11PM -0800, Paul E. McKenney wrote:
> > On Thu, Jan 27, 2022 at 02:01:25PM -0500, Brian Foster wrote:
> > > On Thu, Jan 27, 2022 at 04:26:09PM +1100, Dave Chinner wrote:
> > > > On Thu, Jan 27, 2022 at 04:19:34AM +0000, Al Viro wrote:
> > > > > On Wed, Jan 26, 2022 at 09:45:51AM +1100, Dave Chinner wrote:
> > > > > 
> > > > > > Right, background inactivation does not improve performance - it's
> > > > > > necessary to get the transactions out of the evict() path. All we
> > > > > > wanted was to ensure that there were no performance degradations as
> > > > > > a result of background inactivation, not that it was faster.
> > > > > > 
> > > > > > If you want to confirm that there is an increase in cold cache
> > > > > > access when the batch size is increased, cpu profiles with 'perf
> > > > > > top'/'perf record/report' and CPU cache performance metric reporting
> > > > > > via 'perf stat -dddd' are your friend. See elsewhere in the thread
> > > > > > where I mention those things to Paul.
> > > > > 
> > > > > Dave, do you see a plausible way to eventually drop Ian's bandaid?
> > > > > I'm not asking for that to happen this cycle and for backports Ian's
> > > > > patch is obviously fine.
> > > > 
> > > > Yes, but not in the near term.
> > > > 
> > > > > What I really want to avoid is the situation when we are stuck with
> > > > > keeping that bandaid in fs/namei.c, since all ways to avoid seeing
> > > > > reused inodes would hurt XFS too badly.  And the benchmarks in this
> > > > > thread do look like that.
> > > > 
> > > > The simplest way I think is to have the XFS inode allocation track
> > > > "busy inodes" in the same way we track "busy extents". A busy extent
> > > > is an extent that has been freed by the user, but is not yet marked
> > > > free in the journal/on disk. If we try to reallocate that busy
> > > > extent, we either select a different free extent to allocate, or if
> > > > we can't find any we force the journal to disk, wait for it to
> > > > complete (hence unbusying the extents) and retry the allocation
> > > > again.
> > > > 
> > > > We can do something similar for inode allocation - it's actually a
> > > > lockless tag lookup on the radix tree entry for the candidate inode
> > > > number. If we find the reclaimable radix tree tag set, the we select
> > > > a different inode. If we can't allocate a new inode, then we kick
> > > > synchronize_rcu() and retry the allocation, allowing inodes to be
> > > > recycled this time.
> > > 
> > > I'm starting to poke around this area since it's become clear that the
> > > currently proposed scheme just involves too much latency (unless Paul
> > > chimes in with his expedited grace period variant, at which point I will
> > > revisit) in the fast allocation/recycle path. ISTM so far that a simple
> > > "skip inodes in the radix tree, sync rcu if unsuccessful" algorithm will
> > > have pretty much the same pattern of behavior as this patch: one
> > > synchronize_rcu() per batch.
> > 
> > Apologies for being slow, but there have been some distractions.
> > One of the distractions was trying to put together atheoretically
> > attractive but massively overcomplicated implementation of
> > poll_state_synchronize_rcu_expedited().  It currently looks like a
> > somewhat suboptimal but much simpler approach is available.  This
> > assumes that XFS is not in the picture until after both the scheduler
> > and workqueues are operational.
> > 
> 
> No worries.. I don't think that would be a roadblock for us. ;)
> 
> > And yes, the complicated version might prove necessary, but let's
> > see if this whole thing is even useful first.  ;-)
> > 
> 
> Indeed. This patch only really requires a single poll/sync pair of
> calls, so assuming the expedited grace period usage plays nice enough
> with typical !expedited usage elsewhere in the kernel for some basic
> tests, it would be fairly trivial to port this over and at least get an
> idea of what the worst case behavior might be with expedited grace
> periods, whether it satisfies the existing latency requirements, etc.
> 
> Brian
> 
> > In the meantime, if you want to look at an extremely unbaked view,
> > here you go:
> > 
> > https://docs.google.com/document/d/1RNKWW9jQyfjxw2E8dsXVTdvZYh0HnYeSHDKog9jhdN8/edit?usp=sharing

And here is a version that passes moderate rcutorture testing.  So no
obvious bugs.  Probably a few non-obvious ones, though!  ;-)

This commit is on -rcu's "dev" branch along with this rcutorture
addition:

cd7bd64af59f ("EXP rcutorture: Test polled expedited grace-period primitives")

I will carry these in -rcu's "dev" branch until at least the upcoming
merge window, fixing bugs as and when they becom apparent.  If I don't
hear otherwise by that time, I will create a tag for it and leave
it behind.

The backport to v5.17-rc2 just requires removing:

	mutex_init(&rnp->boost_kthread_mutex);

>From rcu_init_one().  This line is added by this -rcu commit:

02a50b09c31f ("rcu: Add mutex for rcu boost kthread spawning and affinity setting")

Please let me know how it goes!

							Thanx, Paul

------------------------------------------------------------------------

commit dd896a86aebc5b225ceee13fcf1375c7542a5e2d
Author: Paul E. McKenney <paulmck@xxxxxxxxxx>
Date:   Mon Jan 31 16:55:52 2022 -0800

    EXP rcu: Add polled expedited grace-period primitives
    
    This is an experimental proof of concept of polled expedited grace-period
    functions.  These functions are get_state_synchronize_rcu_expedited(),
    start_poll_synchronize_rcu_expedited(), poll_state_synchronize_rcu_expedited(),
    and cond_synchronize_rcu_expedited(), which are similar to
    get_state_synchronize_rcu(), start_poll_synchronize_rcu(),
    poll_state_synchronize_rcu(), and cond_synchronize_rcu(), respectively.
    
    One limitation is that start_poll_synchronize_rcu_expedited() cannot
    be invoked before workqueues are initialized.
    
    Cc: Brian Foster <bfoster@xxxxxxxxxx>
    Cc: Dave Chinner <david@xxxxxxxxxxxxx>
    Cc: Al Viro <viro@xxxxxxxxxxxxxxxxxx>
    Cc: Ian Kent <raven@xxxxxxxxxx>
    Signed-off-by: Paul E. McKenney <paulmck@xxxxxxxxxx>

diff --git a/include/linux/rcutiny.h b/include/linux/rcutiny.h
index 858f4d429946d..ca139b4b2d25f 100644
--- a/include/linux/rcutiny.h
+++ b/include/linux/rcutiny.h
@@ -23,6 +23,26 @@ static inline void cond_synchronize_rcu(unsigned long oldstate)
 	might_sleep();
 }
 
+static inline unsigned long get_state_synchronize_rcu_expedited(void)
+{
+	return get_state_synchronize_rcu();
+}
+
+static inline unsigned long start_poll_synchronize_rcu_expedited(void)
+{
+	return start_poll_synchronize_rcu();
+}
+
+static inline bool poll_state_synchronize_rcu_expedited(unsigned long oldstate)
+{
+	return poll_state_synchronize_rcu(oldstate);
+}
+
+static inline void cond_synchronize_rcu_expedited(unsigned long oldstate)
+{
+	cond_synchronize_rcu(oldstate);
+}
+
 extern void rcu_barrier(void);
 
 static inline void synchronize_rcu_expedited(void)
diff --git a/include/linux/rcutree.h b/include/linux/rcutree.h
index 76665db179fa1..eb774e9be21bf 100644
--- a/include/linux/rcutree.h
+++ b/include/linux/rcutree.h
@@ -40,6 +40,10 @@ bool rcu_eqs_special_set(int cpu);
 void rcu_momentary_dyntick_idle(void);
 void kfree_rcu_scheduler_running(void);
 bool rcu_gp_might_be_stalled(void);
+unsigned long get_state_synchronize_rcu_expedited(void);
+unsigned long start_poll_synchronize_rcu_expedited(void);
+bool poll_state_synchronize_rcu_expedited(unsigned long oldstate);
+void cond_synchronize_rcu_expedited(unsigned long oldstate);
 unsigned long get_state_synchronize_rcu(void);
 unsigned long start_poll_synchronize_rcu(void);
 bool poll_state_synchronize_rcu(unsigned long oldstate);
diff --git a/kernel/rcu/rcu.h b/kernel/rcu/rcu.h
index 24b5f2c2de87b..5b61cf20c91e9 100644
--- a/kernel/rcu/rcu.h
+++ b/kernel/rcu/rcu.h
@@ -23,6 +23,13 @@
 #define RCU_SEQ_CTR_SHIFT	2
 #define RCU_SEQ_STATE_MASK	((1 << RCU_SEQ_CTR_SHIFT) - 1)
 
+/*
+ * Low-order bit definitions for polled grace-period APIs.
+ */
+#define RCU_GET_STATE_FROM_EXPEDITED	0x1
+#define RCU_GET_STATE_USE_NORMAL	0x2
+#define RCU_GET_STATE_BAD_FOR_NORMAL	(RCU_GET_STATE_FROM_EXPEDITED | RCU_GET_STATE_USE_NORMAL)
+
 /*
  * Return the counter portion of a sequence number previously returned
  * by rcu_seq_snap() or rcu_seq_current().
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index e6ad532cffe78..5de36abcd7da1 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -3871,7 +3871,8 @@ EXPORT_SYMBOL_GPL(start_poll_synchronize_rcu);
  */
 bool poll_state_synchronize_rcu(unsigned long oldstate)
 {
-	if (rcu_seq_done(&rcu_state.gp_seq, oldstate)) {
+	if (rcu_seq_done(&rcu_state.gp_seq, oldstate) &&
+	    !WARN_ON_ONCE(oldstate & RCU_GET_STATE_BAD_FOR_NORMAL)) {
 		smp_mb(); /* Ensure GP ends before subsequent accesses. */
 		return true;
 	}
@@ -3900,7 +3901,8 @@ EXPORT_SYMBOL_GPL(poll_state_synchronize_rcu);
  */
 void cond_synchronize_rcu(unsigned long oldstate)
 {
-	if (!poll_state_synchronize_rcu(oldstate))
+	if (!poll_state_synchronize_rcu(oldstate) &&
+	    !WARN_ON_ONCE(oldstate & RCU_GET_STATE_BAD_FOR_NORMAL))
 		synchronize_rcu();
 }
 EXPORT_SYMBOL_GPL(cond_synchronize_rcu);
@@ -4593,6 +4595,9 @@ static void __init rcu_init_one(void)
 			init_waitqueue_head(&rnp->exp_wq[3]);
 			spin_lock_init(&rnp->exp_lock);
 			mutex_init(&rnp->boost_kthread_mutex);
+			raw_spin_lock_init(&rnp->exp_poll_lock);
+			rnp->exp_seq_poll_rq = 0x1;
+			INIT_WORK(&rnp->exp_poll_wq, sync_rcu_do_polled_gp);
 		}
 	}
 
diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h
index 926673ebe355f..19fc9acce3ce2 100644
--- a/kernel/rcu/tree.h
+++ b/kernel/rcu/tree.h
@@ -128,6 +128,10 @@ struct rcu_node {
 	wait_queue_head_t exp_wq[4];
 	struct rcu_exp_work rew;
 	bool exp_need_flush;	/* Need to flush workitem? */
+	raw_spinlock_t exp_poll_lock;
+				/* Lock and data for polled expedited grace periods. */
+	unsigned long exp_seq_poll_rq;
+	struct work_struct exp_poll_wq;
 } ____cacheline_internodealigned_in_smp;
 
 /*
@@ -476,3 +480,6 @@ static void rcu_iw_handler(struct irq_work *iwp);
 static void check_cpu_stall(struct rcu_data *rdp);
 static void rcu_check_gp_start_stall(struct rcu_node *rnp, struct rcu_data *rdp,
 				     const unsigned long gpssdelay);
+
+/* Forward declarations for tree_exp.h. */
+static void sync_rcu_do_polled_gp(struct work_struct *wp);
diff --git a/kernel/rcu/tree_exp.h b/kernel/rcu/tree_exp.h
index 1a45667402260..728896f374fee 100644
--- a/kernel/rcu/tree_exp.h
+++ b/kernel/rcu/tree_exp.h
@@ -871,3 +871,154 @@ void synchronize_rcu_expedited(void)
 		destroy_work_on_stack(&rew.rew_work);
 }
 EXPORT_SYMBOL_GPL(synchronize_rcu_expedited);
+
+/**
+ * get_state_synchronize_rcu_expedited - Snapshot current expedited RCU state
+ *
+ * Returns a cookie to pass to a call to cond_synchronize_rcu_expedited()
+ * or poll_state_synchronize_rcu_expedited(), allowing them to determine
+ * whether or not a full expedited grace period has elapsed in the meantime.
+ */
+unsigned long get_state_synchronize_rcu_expedited(void)
+{
+	if (rcu_gp_is_normal())
+	return get_state_synchronize_rcu() |
+	       RCU_GET_STATE_FROM_EXPEDITED | RCU_GET_STATE_USE_NORMAL;
+
+	// Any prior manipulation of RCU-protected data must happen
+	// before the load from ->expedited_sequence.
+	smp_mb();  /* ^^^ */
+	return rcu_exp_gp_seq_snap() | RCU_GET_STATE_FROM_EXPEDITED;
+}
+EXPORT_SYMBOL_GPL(get_state_synchronize_rcu_expedited);
+
+/*
+ * Ensure that start_poll_synchronize_rcu_expedited() has the expedited
+ * RCU grace periods that it needs.
+ */
+static void sync_rcu_do_polled_gp(struct work_struct *wp)
+{
+	unsigned long flags;
+	struct rcu_node *rnp = container_of(wp, struct rcu_node, exp_poll_wq);
+	unsigned long s;
+
+	raw_spin_lock_irqsave(&rnp->exp_poll_lock, flags);
+	s = rnp->exp_seq_poll_rq;
+	rnp->exp_seq_poll_rq |= 0x1;
+	raw_spin_unlock_irqrestore(&rnp->exp_poll_lock, flags);
+	if (s & 0x1)
+		return;
+	while (!sync_exp_work_done(s))
+		synchronize_rcu_expedited();
+	raw_spin_lock_irqsave(&rnp->exp_poll_lock, flags);
+	s = rnp->exp_seq_poll_rq;
+	if (!(s & 0x1) && !sync_exp_work_done(s))
+		queue_work(rcu_gp_wq, &rnp->exp_poll_wq);
+	else
+		rnp->exp_seq_poll_rq |= 0x1;
+	raw_spin_unlock_irqrestore(&rnp->exp_poll_lock, flags);
+}
+
+/**
+ * start_poll_synchronize_rcu_expedited - Snapshot current expedited RCU state and start grace period
+ *
+ * Returns a cookie to pass to a call to cond_synchronize_rcu_expedited()
+ * or poll_state_synchronize_rcu_expedited(), allowing them to determine
+ * whether or not a full expedited grace period has elapsed in the meantime.
+ * If the needed grace period is not already slated to start, initiates
+ * that grace period.
+ */
+
+unsigned long start_poll_synchronize_rcu_expedited(void)
+{
+	unsigned long flags;
+	struct rcu_data *rdp;
+	struct rcu_node *rnp;
+	unsigned long s;
+
+	if (rcu_gp_is_normal())
+		return start_poll_synchronize_rcu_expedited() |
+		       RCU_GET_STATE_FROM_EXPEDITED | RCU_GET_STATE_USE_NORMAL;
+
+	s = rcu_exp_gp_seq_snap();
+	rdp = per_cpu_ptr(&rcu_data, raw_smp_processor_id());
+	rnp = rdp->mynode;
+	raw_spin_lock_irqsave(&rnp->exp_poll_lock, flags);
+	if ((rnp->exp_seq_poll_rq & 0x1) || ULONG_CMP_LT(rnp->exp_seq_poll_rq, s)) {
+		rnp->exp_seq_poll_rq = s;
+		queue_work(rcu_gp_wq, &rnp->exp_poll_wq);
+	}
+	raw_spin_unlock_irqrestore(&rnp->exp_poll_lock, flags);
+
+	return s | RCU_GET_STATE_FROM_EXPEDITED;
+}
+EXPORT_SYMBOL_GPL(start_poll_synchronize_rcu_expedited);
+
+/**
+ * poll_state_synchronize_rcu_expedited - Conditionally wait for an expedited RCU grace period
+ *
+ * @oldstate: value from get_state_synchronize_rcu_expedited() or start_poll_synchronize_rcu_expedited()
+ *
+ * If a full expedited RCU grace period has elapsed since the earlier call
+ * from which oldstate was obtained, return @true, otherwise return @false.
+ * If @false is returned, it is the caller's responsibility to invoke
+ * this function later on until it does return @true.  Alternatively,
+ * the caller can explicitly wait for a grace period, for example, by
+ * passing @oldstate to cond_synchronize_rcu_expedited() or by directly
+ * invoking synchronize_rcu_expedited().
+ *
+ * Yes, this function does not take counter wrap into account.
+ * But counter wrap is harmless.  If the counter wraps, we have waited for
+ * more than 2 billion grace periods (and way more on a 64-bit system!).
+ * Those needing to keep oldstate values for very long time periods
+ * (several hours even on 32-bit systems) should check them occasionally
+ * and either refresh them or set a flag indicating that the grace period
+ * has completed.
+ *
+ * This function provides the same memory-ordering guarantees that would
+ * be provided by a synchronize_rcu_expedited() that was invoked at the
+ * call to the function that provided @oldstate, and that returned at the
+ * end of this function.
+ */
+bool poll_state_synchronize_rcu_expedited(unsigned long oldstate)
+{
+	WARN_ON_ONCE(!(oldstate & RCU_GET_STATE_FROM_EXPEDITED));
+	if (oldstate & RCU_GET_STATE_USE_NORMAL)
+		return poll_state_synchronize_rcu(oldstate & ~RCU_GET_STATE_BAD_FOR_NORMAL);
+	if (!rcu_exp_gp_seq_done(oldstate & ~RCU_SEQ_STATE_MASK))
+		return false;
+	smp_mb(); /* Ensure GP ends before subsequent accesses. */
+	return true;
+}
+EXPORT_SYMBOL_GPL(poll_state_synchronize_rcu_expedited);
+
+/**
+ * cond_synchronize_rcu_expedited - Conditionally wait for an expedited RCU grace period
+ *
+ * @oldstate: value from get_state_synchronize_rcu_expedited() or start_poll_synchronize_rcu_expedited()
+ *
+ * If a full expedited RCU grace period has elapsed since the earlier
+ * call from which oldstate was obtained, just return.  Otherwise, invoke
+ * synchronize_rcu_expedited() to wait for a full grace period.
+ *
+ * Yes, this function does not take counter wrap into account.  But
+ * counter wrap is harmless.  If the counter wraps, we have waited for
+ * more than 2 billion grace periods (and way more on a 64-bit system!),
+ * so waiting for one additional grace period should be just fine.
+ *
+ * This function provides the same memory-ordering guarantees that would
+ * be provided by a synchronize_rcu_expedited() that was invoked at the
+ * call to the function that provided @oldstate, and that returned at the
+ * end of this function.
+ */
+void cond_synchronize_rcu_expedited(unsigned long oldstate)
+{
+	WARN_ON_ONCE(!(oldstate & RCU_GET_STATE_FROM_EXPEDITED));
+	if (poll_state_synchronize_rcu_expedited(oldstate))
+		return;
+	if (oldstate & RCU_GET_STATE_USE_NORMAL)
+		synchronize_rcu_expedited();
+	else
+		synchronize_rcu();
+}
+EXPORT_SYMBOL_GPL(cond_synchronize_rcu_expedited);