Re: [PATCH 2/3] xfs: use per-mount cpumask to track nonempty percpu inodegc lists

"Darrick J. Wong" <djwong@xxxxxxxxxx> · Thu, 24 Aug 2023 21:07:30 -0700

On Fri, Aug 25, 2023 at 10:12:49AM +1000, Dave Chinner wrote:
> On Thu, Aug 24, 2023 at 04:21:25PM -0700, Darrick J. Wong wrote:
> > From: Darrick J. Wong <djwong@xxxxxxxxxx>
> > 
> > Directly track which CPUs have contributed to the inodegc percpu lists
> > instead of trusting the cpu online mask.  This eliminates a theoretical
> > problem where the inodegc flush functions might fail to flush a CPU's
> > inodes if that CPU happened to be dying at exactly the same time.  Most
> > likely nobody's noticed this because the CPU dead hook moves the percpu
> > inodegc list to another CPU and schedules that worker immediately.  But
> > it's quite possible that this is a subtle race leading to UAF if the
> > inodegc flush were part of an unmount.
> > 
> > Further benefits: This reduces the overhead of the inodegc flush code
> > slightly by allowing us to ignore CPUs that have empty lists.  Better
> > yet, it reduces our dependence on the cpu online masks, which have been
> > the cause of confusion and drama lately.
> > 
> > Fixes: ab23a7768739 ("xfs: per-cpu deferred inode inactivation queues")
> > Signed-off-by: Darrick J. Wong <djwong@xxxxxxxxxx>
> > ---
> >  fs/xfs/xfs_icache.c |   60 +++++++++++----------------------------------------
> >  fs/xfs/xfs_icache.h |    1 -
> >  fs/xfs/xfs_mount.h  |    6 +++--
> >  fs/xfs/xfs_super.c  |    4 +--
> >  4 files changed, 18 insertions(+), 53 deletions(-)
> > 
> > 
> > diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
> > index e541f5c0bc25..7fd876e94ecb 100644
> > --- a/fs/xfs/xfs_icache.c
> > +++ b/fs/xfs/xfs_icache.c
> > @@ -443,7 +443,7 @@ xfs_inodegc_queue_all(
> >  	int			cpu;
> >  	bool			ret = false;
> >  
> > -	for_each_online_cpu(cpu) {
> > +	for_each_cpu(cpu, &mp->m_inodegc_cpumask) {
> >  		gc = per_cpu_ptr(mp->m_inodegc, cpu);
> >  		if (!llist_empty(&gc->list)) {
> >  			mod_delayed_work_on(cpu, mp->m_inodegc_wq, &gc->work, 0);
> > @@ -463,7 +463,7 @@ xfs_inodegc_wait_all(
> >  	int			error = 0;
> >  
> >  	flush_workqueue(mp->m_inodegc_wq);
> > -	for_each_online_cpu(cpu) {
> > +	for_each_cpu(cpu, &mp->m_inodegc_cpumask) {
> >  		struct xfs_inodegc	*gc;
> >  
> >  		gc = per_cpu_ptr(mp->m_inodegc, cpu);
> > @@ -1845,10 +1845,12 @@ xfs_inodegc_worker(
> >  						struct xfs_inodegc, work);
> >  	struct llist_node	*node = llist_del_all(&gc->list);
> >  	struct xfs_inode	*ip, *n;
> > +	struct xfs_mount	*mp = gc->mp;
> >  	unsigned int		nofs_flag;
> >  
> >  	ASSERT(gc->cpu == smp_processor_id());
> >  
> > +	cpumask_test_and_clear_cpu(gc->cpu, &mp->m_inodegc_cpumask);
> 
> Why does this need to be a test-and-clear operation? If it is set,
> we clear it. If it is not set, clearing it is a no-op. Hence we
> don't need to test whether the bit is set first. Also,
> cpumask_clear_cpu() uses clear_bit(), which is an atomic operation,
> so clearing the bit isn't going to race with any other updates.

Aha, I didn't realize that clear_bit is atomic.  Oh, I guess
atomic_bitops.txt mentions that an RMW operation is atomic if there's a
separate '__' version.  Subtle. :(

> As it is, we probably want acquire semantics for the gc structure
> here (see below), so I think this likely should be:
> 
> 	/*
> 	 * Clear the cpu mask bit and ensure that we have seen the
> 	 * latest update of the gc structure associated with this
> 	 * CPU. This matches with the release semantics used when
> 	 * setting the cpumask bit in xfs_inodegc_queue.
> 	 */
> 	cpumask_clear_cpu(gc->cpu, &mp->m_inodegc_cpumask);
> 	smp_mb__after_atomic();

<nod>

> >  	WRITE_ONCE(gc->items, 0);
> >  
> >  	if (!node)
> > @@ -1862,7 +1864,7 @@ xfs_inodegc_worker(
> >  	nofs_flag = memalloc_nofs_save();
> >  
> >  	ip = llist_entry(node, struct xfs_inode, i_gclist);
> > -	trace_xfs_inodegc_worker(ip->i_mount, READ_ONCE(gc->shrinker_hits));
> > +	trace_xfs_inodegc_worker(mp, READ_ONCE(gc->shrinker_hits));
> >  
> >  	WRITE_ONCE(gc->shrinker_hits, 0);
> >  	llist_for_each_entry_safe(ip, n, node, i_gclist) {
> > @@ -2057,6 +2059,7 @@ xfs_inodegc_queue(
> >  	struct xfs_inodegc	*gc;
> >  	int			items;
> >  	unsigned int		shrinker_hits;
> > +	unsigned int		cpu_nr;
> >  	unsigned long		queue_delay = 1;
> >  
> >  	trace_xfs_inode_set_need_inactive(ip);
> > @@ -2064,12 +2067,16 @@ xfs_inodegc_queue(
> >  	ip->i_flags |= XFS_NEED_INACTIVE;
> >  	spin_unlock(&ip->i_flags_lock);
> >  
> > -	gc = get_cpu_ptr(mp->m_inodegc);
> > +	cpu_nr = get_cpu();
> > +	gc = this_cpu_ptr(mp->m_inodegc);
> >  	llist_add(&ip->i_gclist, &gc->list);
> >  	items = READ_ONCE(gc->items);
> >  	WRITE_ONCE(gc->items, items + 1);
> >  	shrinker_hits = READ_ONCE(gc->shrinker_hits);
> >  
> > +	if (!cpumask_test_cpu(cpu_nr, &mp->m_inodegc_cpumask))
> > +		cpumask_test_and_set_cpu(cpu_nr, &mp->m_inodegc_cpumask);
> > +
> >  	/*
> >  	 * We queue the work while holding the current CPU so that the work
> >  	 * is scheduled to run on this CPU.
> 
> I think we need release/acquire memory ordering on this atomic bit
> set now. i.e. to guarantee that if the worker sees the cpumask bit
> set (with acquire semantics), it will always see the latest item
> added to the list. i.e.
> 
> 	/*
> 	 * Ensure the list add is always seen by anyone that
> 	 * find the cpumask bit set. This effectively gives
> 	 * the cpumask bit set operation release ordering semantics.
> 	 */
> 	smp_mb__before_atomic();
> 	if (!cpumask_test_cpu(cpu_nr, &mp->m_inodegc_cpumask))
> 		cpumask_test_and_set_cpu(cpu_nr, &mp->m_inodegc_cpumask);

<nod>

> Also, same comment about put_cpu() vs put_cpu_var() as the last patch.

<nod>

> Otherwise this seems sane.

Thanks!
--D

> -Dave.
> 
> -- 
> Dave Chinner
> david@xxxxxxxxxxxxx