Re: [PATCH] xfs: require an rcu grace period before inode recycle

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Mon, Feb 07, 2022 at 08:30:03AM -0500, Brian Foster wrote:
> On Tue, Feb 01, 2022 at 02:00:28PM -0800, Paul E. McKenney wrote:
> > On Mon, Jan 31, 2022 at 08:22:43AM -0500, Brian Foster wrote:
> > > On Fri, Jan 28, 2022 at 01:39:11PM -0800, Paul E. McKenney wrote:
> > > > On Thu, Jan 27, 2022 at 02:01:25PM -0500, Brian Foster wrote:
> > > > > On Thu, Jan 27, 2022 at 04:26:09PM +1100, Dave Chinner wrote:
> > > > > > On Thu, Jan 27, 2022 at 04:19:34AM +0000, Al Viro wrote:
> > > > > > > On Wed, Jan 26, 2022 at 09:45:51AM +1100, Dave Chinner wrote:
> > > > > > > 
> > > > > > > > Right, background inactivation does not improve performance - it's
> > > > > > > > necessary to get the transactions out of the evict() path. All we
> > > > > > > > wanted was to ensure that there were no performance degradations as
> > > > > > > > a result of background inactivation, not that it was faster.
> > > > > > > > 
> > > > > > > > If you want to confirm that there is an increase in cold cache
> > > > > > > > access when the batch size is increased, cpu profiles with 'perf
> > > > > > > > top'/'perf record/report' and CPU cache performance metric reporting
> > > > > > > > via 'perf stat -dddd' are your friend. See elsewhere in the thread
> > > > > > > > where I mention those things to Paul.
> > > > > > > 
> > > > > > > Dave, do you see a plausible way to eventually drop Ian's bandaid?
> > > > > > > I'm not asking for that to happen this cycle and for backports Ian's
> > > > > > > patch is obviously fine.
> > > > > > 
> > > > > > Yes, but not in the near term.
> > > > > > 
> > > > > > > What I really want to avoid is the situation when we are stuck with
> > > > > > > keeping that bandaid in fs/namei.c, since all ways to avoid seeing
> > > > > > > reused inodes would hurt XFS too badly.  And the benchmarks in this
> > > > > > > thread do look like that.
> > > > > > 
> > > > > > The simplest way I think is to have the XFS inode allocation track
> > > > > > "busy inodes" in the same way we track "busy extents". A busy extent
> > > > > > is an extent that has been freed by the user, but is not yet marked
> > > > > > free in the journal/on disk. If we try to reallocate that busy
> > > > > > extent, we either select a different free extent to allocate, or if
> > > > > > we can't find any we force the journal to disk, wait for it to
> > > > > > complete (hence unbusying the extents) and retry the allocation
> > > > > > again.
> > > > > > 
> > > > > > We can do something similar for inode allocation - it's actually a
> > > > > > lockless tag lookup on the radix tree entry for the candidate inode
> > > > > > number. If we find the reclaimable radix tree tag set, the we select
> > > > > > a different inode. If we can't allocate a new inode, then we kick
> > > > > > synchronize_rcu() and retry the allocation, allowing inodes to be
> > > > > > recycled this time.
> > > > > 
> > > > > I'm starting to poke around this area since it's become clear that the
> > > > > currently proposed scheme just involves too much latency (unless Paul
> > > > > chimes in with his expedited grace period variant, at which point I will
> > > > > revisit) in the fast allocation/recycle path. ISTM so far that a simple
> > > > > "skip inodes in the radix tree, sync rcu if unsuccessful" algorithm will
> > > > > have pretty much the same pattern of behavior as this patch: one
> > > > > synchronize_rcu() per batch.
> > > > 
> > > > Apologies for being slow, but there have been some distractions.
> > > > One of the distractions was trying to put together atheoretically
> > > > attractive but massively overcomplicated implementation of
> > > > poll_state_synchronize_rcu_expedited().  It currently looks like a
> > > > somewhat suboptimal but much simpler approach is available.  This
> > > > assumes that XFS is not in the picture until after both the scheduler
> > > > and workqueues are operational.
> > > > 
> > > 
> > > No worries.. I don't think that would be a roadblock for us. ;)
> > > 
> > > > And yes, the complicated version might prove necessary, but let's
> > > > see if this whole thing is even useful first.  ;-)
> > > > 
> > > 
> > > Indeed. This patch only really requires a single poll/sync pair of
> > > calls, so assuming the expedited grace period usage plays nice enough
> > > with typical !expedited usage elsewhere in the kernel for some basic
> > > tests, it would be fairly trivial to port this over and at least get an
> > > idea of what the worst case behavior might be with expedited grace
> > > periods, whether it satisfies the existing latency requirements, etc.
> > > 
> > > Brian
> > > 
> > > > In the meantime, if you want to look at an extremely unbaked view,
> > > > here you go:
> > > > 
> > > > https://docs.google.com/document/d/1RNKWW9jQyfjxw2E8dsXVTdvZYh0HnYeSHDKog9jhdN8/edit?usp=sharing
> > 
> > And here is a version that passes moderate rcutorture testing.  So no
> > obvious bugs.  Probably a few non-obvious ones, though!  ;-)
> > 
> > This commit is on -rcu's "dev" branch along with this rcutorture
> > addition:
> > 
> > cd7bd64af59f ("EXP rcutorture: Test polled expedited grace-period primitives")
> > 
> > I will carry these in -rcu's "dev" branch until at least the upcoming
> > merge window, fixing bugs as and when they becom apparent.  If I don't
> > hear otherwise by that time, I will create a tag for it and leave
> > it behind.
> > 
> > The backport to v5.17-rc2 just requires removing:
> > 
> > 	mutex_init(&rnp->boost_kthread_mutex);
> > 
> > From rcu_init_one().  This line is added by this -rcu commit:
> > 
> > 02a50b09c31f ("rcu: Add mutex for rcu boost kthread spawning and affinity setting")
> > 
> > Please let me know how it goes!
> > 
> 
> Thanks Paul. I gave this a whirl with a ported variant of this patch on
> top. There is definitely a notable improvement with the expedited grace
> periods. A few quick runs of the same batched alloc/free test (i.e. 10
> sample) I had run against the original version:
> 
> batch	baseline	baseline+bg	test	test+bg
> 
> 1	889954		210075		552911	25540
> 4	879540		212740		575356	24624
> 8	924928		213568		496992	26080
> 16	922960		211504		518496	24592
> 32	844832		219744		524672	28608
> 64	579968		196544		358720	24128
> 128	667392		195840		397696	22400
> 256	624896		197888		376320	31232
> 512	572928		204800		382464	46080
> 1024	549888		174080		379904	73728
> 2048	522240		174080		350208	106496
> 4096	536576		167936		360448	131072
> 
> So this shows a major improvement in the case where the system is
> otherwise idle. We still aren't quite at the baseline numbers, but
> that's not really the goal here because those numbers are partly driven
> by the fact that we unsafely reuse recently freed inodes in cases where
> proper behavior would be to allocate new inode chunks for a period of
> time. The core test numbers are much closer to the single threaded
> allocation rate (55k-65k inodes/sec) on this setup, so that is quite
> positive.
> 
> The "bg" variants are the same tests with 64 tasks doing unrelated
> pathwalk listings on a kernel source tree (on separate storage)
> concurrently in the background. The purpose of this was just to generate
> background (rcu) activity in the form of pathname lookups and whatnot
> and see how that impacts the results. This clearly affects both kernels,
> but the test kernel drops down closer to numbers reminiscent of the
> non-expedited grace period variant. Note that this impact seems to scale
> with increased background workload. With a similar test running only 8
> background tasks, the test kernel is pretty consistently in the
> 225k-250k (per 10s) range across the set of batch sizes. That's about
> half the core test rate, so still not as terrible as the original
> variant. ;)
> 
> In any event, this probably requires some thought/discussion (and more
> testing) on whether this is considered an acceptable change or whether
> we want to explore options to mitigate this further. I am still playing
> with some ideas to potentially mitigate grace period latency, so it
> might be worth seeing if anything useful falls out of that as well.
> Thoughts appreciated...

So this fixes a bug, but results in many 10s of percent performance
degradation?  Ouch...

Another approach is to use SLAB_TYPESAFE_BY_RCU.  This allows immediate
reuse of freed memory, but also requires pointer traversals to the memory
to do a revalidation operation.  (Sorry, no free lunch here!)

							Thanx, Paul

> Brian
> 
> > 							Thanx, Paul
> > 
> > ------------------------------------------------------------------------
> > 
> > commit dd896a86aebc5b225ceee13fcf1375c7542a5e2d
> > Author: Paul E. McKenney <paulmck@xxxxxxxxxx>
> > Date:   Mon Jan 31 16:55:52 2022 -0800
> > 
> >     EXP rcu: Add polled expedited grace-period primitives
> >     
> >     This is an experimental proof of concept of polled expedited grace-period
> >     functions.  These functions are get_state_synchronize_rcu_expedited(),
> >     start_poll_synchronize_rcu_expedited(), poll_state_synchronize_rcu_expedited(),
> >     and cond_synchronize_rcu_expedited(), which are similar to
> >     get_state_synchronize_rcu(), start_poll_synchronize_rcu(),
> >     poll_state_synchronize_rcu(), and cond_synchronize_rcu(), respectively.
> >     
> >     One limitation is that start_poll_synchronize_rcu_expedited() cannot
> >     be invoked before workqueues are initialized.
> >     
> >     Cc: Brian Foster <bfoster@xxxxxxxxxx>
> >     Cc: Dave Chinner <david@xxxxxxxxxxxxx>
> >     Cc: Al Viro <viro@xxxxxxxxxxxxxxxxxx>
> >     Cc: Ian Kent <raven@xxxxxxxxxx>
> >     Signed-off-by: Paul E. McKenney <paulmck@xxxxxxxxxx>
> > 
> > diff --git a/include/linux/rcutiny.h b/include/linux/rcutiny.h
> > index 858f4d429946d..ca139b4b2d25f 100644
> > --- a/include/linux/rcutiny.h
> > +++ b/include/linux/rcutiny.h
> > @@ -23,6 +23,26 @@ static inline void cond_synchronize_rcu(unsigned long oldstate)
> >  	might_sleep();
> >  }
> >  
> > +static inline unsigned long get_state_synchronize_rcu_expedited(void)
> > +{
> > +	return get_state_synchronize_rcu();
> > +}
> > +
> > +static inline unsigned long start_poll_synchronize_rcu_expedited(void)
> > +{
> > +	return start_poll_synchronize_rcu();
> > +}
> > +
> > +static inline bool poll_state_synchronize_rcu_expedited(unsigned long oldstate)
> > +{
> > +	return poll_state_synchronize_rcu(oldstate);
> > +}
> > +
> > +static inline void cond_synchronize_rcu_expedited(unsigned long oldstate)
> > +{
> > +	cond_synchronize_rcu(oldstate);
> > +}
> > +
> >  extern void rcu_barrier(void);
> >  
> >  static inline void synchronize_rcu_expedited(void)
> > diff --git a/include/linux/rcutree.h b/include/linux/rcutree.h
> > index 76665db179fa1..eb774e9be21bf 100644
> > --- a/include/linux/rcutree.h
> > +++ b/include/linux/rcutree.h
> > @@ -40,6 +40,10 @@ bool rcu_eqs_special_set(int cpu);
> >  void rcu_momentary_dyntick_idle(void);
> >  void kfree_rcu_scheduler_running(void);
> >  bool rcu_gp_might_be_stalled(void);
> > +unsigned long get_state_synchronize_rcu_expedited(void);
> > +unsigned long start_poll_synchronize_rcu_expedited(void);
> > +bool poll_state_synchronize_rcu_expedited(unsigned long oldstate);
> > +void cond_synchronize_rcu_expedited(unsigned long oldstate);
> >  unsigned long get_state_synchronize_rcu(void);
> >  unsigned long start_poll_synchronize_rcu(void);
> >  bool poll_state_synchronize_rcu(unsigned long oldstate);
> > diff --git a/kernel/rcu/rcu.h b/kernel/rcu/rcu.h
> > index 24b5f2c2de87b..5b61cf20c91e9 100644
> > --- a/kernel/rcu/rcu.h
> > +++ b/kernel/rcu/rcu.h
> > @@ -23,6 +23,13 @@
> >  #define RCU_SEQ_CTR_SHIFT	2
> >  #define RCU_SEQ_STATE_MASK	((1 << RCU_SEQ_CTR_SHIFT) - 1)
> >  
> > +/*
> > + * Low-order bit definitions for polled grace-period APIs.
> > + */
> > +#define RCU_GET_STATE_FROM_EXPEDITED	0x1
> > +#define RCU_GET_STATE_USE_NORMAL	0x2
> > +#define RCU_GET_STATE_BAD_FOR_NORMAL	(RCU_GET_STATE_FROM_EXPEDITED | RCU_GET_STATE_USE_NORMAL)
> > +
> >  /*
> >   * Return the counter portion of a sequence number previously returned
> >   * by rcu_seq_snap() or rcu_seq_current().
> > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > index e6ad532cffe78..5de36abcd7da1 100644
> > --- a/kernel/rcu/tree.c
> > +++ b/kernel/rcu/tree.c
> > @@ -3871,7 +3871,8 @@ EXPORT_SYMBOL_GPL(start_poll_synchronize_rcu);
> >   */
> >  bool poll_state_synchronize_rcu(unsigned long oldstate)
> >  {
> > -	if (rcu_seq_done(&rcu_state.gp_seq, oldstate)) {
> > +	if (rcu_seq_done(&rcu_state.gp_seq, oldstate) &&
> > +	    !WARN_ON_ONCE(oldstate & RCU_GET_STATE_BAD_FOR_NORMAL)) {
> >  		smp_mb(); /* Ensure GP ends before subsequent accesses. */
> >  		return true;
> >  	}
> > @@ -3900,7 +3901,8 @@ EXPORT_SYMBOL_GPL(poll_state_synchronize_rcu);
> >   */
> >  void cond_synchronize_rcu(unsigned long oldstate)
> >  {
> > -	if (!poll_state_synchronize_rcu(oldstate))
> > +	if (!poll_state_synchronize_rcu(oldstate) &&
> > +	    !WARN_ON_ONCE(oldstate & RCU_GET_STATE_BAD_FOR_NORMAL))
> >  		synchronize_rcu();
> >  }
> >  EXPORT_SYMBOL_GPL(cond_synchronize_rcu);
> > @@ -4593,6 +4595,9 @@ static void __init rcu_init_one(void)
> >  			init_waitqueue_head(&rnp->exp_wq[3]);
> >  			spin_lock_init(&rnp->exp_lock);
> >  			mutex_init(&rnp->boost_kthread_mutex);
> > +			raw_spin_lock_init(&rnp->exp_poll_lock);
> > +			rnp->exp_seq_poll_rq = 0x1;
> > +			INIT_WORK(&rnp->exp_poll_wq, sync_rcu_do_polled_gp);
> >  		}
> >  	}
> >  
> > diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h
> > index 926673ebe355f..19fc9acce3ce2 100644
> > --- a/kernel/rcu/tree.h
> > +++ b/kernel/rcu/tree.h
> > @@ -128,6 +128,10 @@ struct rcu_node {
> >  	wait_queue_head_t exp_wq[4];
> >  	struct rcu_exp_work rew;
> >  	bool exp_need_flush;	/* Need to flush workitem? */
> > +	raw_spinlock_t exp_poll_lock;
> > +				/* Lock and data for polled expedited grace periods. */
> > +	unsigned long exp_seq_poll_rq;
> > +	struct work_struct exp_poll_wq;
> >  } ____cacheline_internodealigned_in_smp;
> >  
> >  /*
> > @@ -476,3 +480,6 @@ static void rcu_iw_handler(struct irq_work *iwp);
> >  static void check_cpu_stall(struct rcu_data *rdp);
> >  static void rcu_check_gp_start_stall(struct rcu_node *rnp, struct rcu_data *rdp,
> >  				     const unsigned long gpssdelay);
> > +
> > +/* Forward declarations for tree_exp.h. */
> > +static void sync_rcu_do_polled_gp(struct work_struct *wp);
> > diff --git a/kernel/rcu/tree_exp.h b/kernel/rcu/tree_exp.h
> > index 1a45667402260..728896f374fee 100644
> > --- a/kernel/rcu/tree_exp.h
> > +++ b/kernel/rcu/tree_exp.h
> > @@ -871,3 +871,154 @@ void synchronize_rcu_expedited(void)
> >  		destroy_work_on_stack(&rew.rew_work);
> >  }
> >  EXPORT_SYMBOL_GPL(synchronize_rcu_expedited);
> > +
> > +/**
> > + * get_state_synchronize_rcu_expedited - Snapshot current expedited RCU state
> > + *
> > + * Returns a cookie to pass to a call to cond_synchronize_rcu_expedited()
> > + * or poll_state_synchronize_rcu_expedited(), allowing them to determine
> > + * whether or not a full expedited grace period has elapsed in the meantime.
> > + */
> > +unsigned long get_state_synchronize_rcu_expedited(void)
> > +{
> > +	if (rcu_gp_is_normal())
> > +	return get_state_synchronize_rcu() |
> > +	       RCU_GET_STATE_FROM_EXPEDITED | RCU_GET_STATE_USE_NORMAL;
> > +
> > +	// Any prior manipulation of RCU-protected data must happen
> > +	// before the load from ->expedited_sequence.
> > +	smp_mb();  /* ^^^ */
> > +	return rcu_exp_gp_seq_snap() | RCU_GET_STATE_FROM_EXPEDITED;
> > +}
> > +EXPORT_SYMBOL_GPL(get_state_synchronize_rcu_expedited);
> > +
> > +/*
> > + * Ensure that start_poll_synchronize_rcu_expedited() has the expedited
> > + * RCU grace periods that it needs.
> > + */
> > +static void sync_rcu_do_polled_gp(struct work_struct *wp)
> > +{
> > +	unsigned long flags;
> > +	struct rcu_node *rnp = container_of(wp, struct rcu_node, exp_poll_wq);
> > +	unsigned long s;
> > +
> > +	raw_spin_lock_irqsave(&rnp->exp_poll_lock, flags);
> > +	s = rnp->exp_seq_poll_rq;
> > +	rnp->exp_seq_poll_rq |= 0x1;
> > +	raw_spin_unlock_irqrestore(&rnp->exp_poll_lock, flags);
> > +	if (s & 0x1)
> > +		return;
> > +	while (!sync_exp_work_done(s))
> > +		synchronize_rcu_expedited();
> > +	raw_spin_lock_irqsave(&rnp->exp_poll_lock, flags);
> > +	s = rnp->exp_seq_poll_rq;
> > +	if (!(s & 0x1) && !sync_exp_work_done(s))
> > +		queue_work(rcu_gp_wq, &rnp->exp_poll_wq);
> > +	else
> > +		rnp->exp_seq_poll_rq |= 0x1;
> > +	raw_spin_unlock_irqrestore(&rnp->exp_poll_lock, flags);
> > +}
> > +
> > +/**
> > + * start_poll_synchronize_rcu_expedited - Snapshot current expedited RCU state and start grace period
> > + *
> > + * Returns a cookie to pass to a call to cond_synchronize_rcu_expedited()
> > + * or poll_state_synchronize_rcu_expedited(), allowing them to determine
> > + * whether or not a full expedited grace period has elapsed in the meantime.
> > + * If the needed grace period is not already slated to start, initiates
> > + * that grace period.
> > + */
> > +
> > +unsigned long start_poll_synchronize_rcu_expedited(void)
> > +{
> > +	unsigned long flags;
> > +	struct rcu_data *rdp;
> > +	struct rcu_node *rnp;
> > +	unsigned long s;
> > +
> > +	if (rcu_gp_is_normal())
> > +		return start_poll_synchronize_rcu_expedited() |
> > +		       RCU_GET_STATE_FROM_EXPEDITED | RCU_GET_STATE_USE_NORMAL;
> > +
> > +	s = rcu_exp_gp_seq_snap();
> > +	rdp = per_cpu_ptr(&rcu_data, raw_smp_processor_id());
> > +	rnp = rdp->mynode;
> > +	raw_spin_lock_irqsave(&rnp->exp_poll_lock, flags);
> > +	if ((rnp->exp_seq_poll_rq & 0x1) || ULONG_CMP_LT(rnp->exp_seq_poll_rq, s)) {
> > +		rnp->exp_seq_poll_rq = s;
> > +		queue_work(rcu_gp_wq, &rnp->exp_poll_wq);
> > +	}
> > +	raw_spin_unlock_irqrestore(&rnp->exp_poll_lock, flags);
> > +
> > +	return s | RCU_GET_STATE_FROM_EXPEDITED;
> > +}
> > +EXPORT_SYMBOL_GPL(start_poll_synchronize_rcu_expedited);
> > +
> > +/**
> > + * poll_state_synchronize_rcu_expedited - Conditionally wait for an expedited RCU grace period
> > + *
> > + * @oldstate: value from get_state_synchronize_rcu_expedited() or start_poll_synchronize_rcu_expedited()
> > + *
> > + * If a full expedited RCU grace period has elapsed since the earlier call
> > + * from which oldstate was obtained, return @true, otherwise return @false.
> > + * If @false is returned, it is the caller's responsibility to invoke
> > + * this function later on until it does return @true.  Alternatively,
> > + * the caller can explicitly wait for a grace period, for example, by
> > + * passing @oldstate to cond_synchronize_rcu_expedited() or by directly
> > + * invoking synchronize_rcu_expedited().
> > + *
> > + * Yes, this function does not take counter wrap into account.
> > + * But counter wrap is harmless.  If the counter wraps, we have waited for
> > + * more than 2 billion grace periods (and way more on a 64-bit system!).
> > + * Those needing to keep oldstate values for very long time periods
> > + * (several hours even on 32-bit systems) should check them occasionally
> > + * and either refresh them or set a flag indicating that the grace period
> > + * has completed.
> > + *
> > + * This function provides the same memory-ordering guarantees that would
> > + * be provided by a synchronize_rcu_expedited() that was invoked at the
> > + * call to the function that provided @oldstate, and that returned at the
> > + * end of this function.
> > + */
> > +bool poll_state_synchronize_rcu_expedited(unsigned long oldstate)
> > +{
> > +	WARN_ON_ONCE(!(oldstate & RCU_GET_STATE_FROM_EXPEDITED));
> > +	if (oldstate & RCU_GET_STATE_USE_NORMAL)
> > +		return poll_state_synchronize_rcu(oldstate & ~RCU_GET_STATE_BAD_FOR_NORMAL);
> > +	if (!rcu_exp_gp_seq_done(oldstate & ~RCU_SEQ_STATE_MASK))
> > +		return false;
> > +	smp_mb(); /* Ensure GP ends before subsequent accesses. */
> > +	return true;
> > +}
> > +EXPORT_SYMBOL_GPL(poll_state_synchronize_rcu_expedited);
> > +
> > +/**
> > + * cond_synchronize_rcu_expedited - Conditionally wait for an expedited RCU grace period
> > + *
> > + * @oldstate: value from get_state_synchronize_rcu_expedited() or start_poll_synchronize_rcu_expedited()
> > + *
> > + * If a full expedited RCU grace period has elapsed since the earlier
> > + * call from which oldstate was obtained, just return.  Otherwise, invoke
> > + * synchronize_rcu_expedited() to wait for a full grace period.
> > + *
> > + * Yes, this function does not take counter wrap into account.  But
> > + * counter wrap is harmless.  If the counter wraps, we have waited for
> > + * more than 2 billion grace periods (and way more on a 64-bit system!),
> > + * so waiting for one additional grace period should be just fine.
> > + *
> > + * This function provides the same memory-ordering guarantees that would
> > + * be provided by a synchronize_rcu_expedited() that was invoked at the
> > + * call to the function that provided @oldstate, and that returned at the
> > + * end of this function.
> > + */
> > +void cond_synchronize_rcu_expedited(unsigned long oldstate)
> > +{
> > +	WARN_ON_ONCE(!(oldstate & RCU_GET_STATE_FROM_EXPEDITED));
> > +	if (poll_state_synchronize_rcu_expedited(oldstate))
> > +		return;
> > +	if (oldstate & RCU_GET_STATE_USE_NORMAL)
> > +		synchronize_rcu_expedited();
> > +	else
> > +		synchronize_rcu();
> > +}
> > +EXPORT_SYMBOL_GPL(cond_synchronize_rcu_expedited);
> > 
> 



[Index of Archives]     [XFS Filesystem Development (older mail)]     [Linux Filesystem Development]     [Linux Audio Users]     [Yosemite Trails]     [Linux Kernel]     [Linux RAID]     [Linux SCSI]


  Powered by Linux