Re: [PATCH 1/2] xfs: flush inodegc before swapon

"Darrick J. Wong" <djwong@xxxxxxxxxx> · Wed, 5 Feb 2025 14:05:41 -0800

On Thu, Feb 06, 2025 at 08:53:59AM +1100, Dave Chinner wrote:
> On Wed, Feb 05, 2025 at 01:16:59PM -0800, Darrick J. Wong wrote:
> > On Thu, Feb 06, 2025 at 08:08:14AM +1100, Dave Chinner wrote:
> > > On Wed, Feb 05, 2025 at 05:28:00PM +0100, Christoph Hellwig wrote:
> > > > Fix the brand new xfstest that tries to swapon on a recently unshared
> > > > file and use the chance to document the other bit of magic in this
> > > > function.
> > > 
> > > You haven't documented the magic at all - I have no clue what the
> > > bug being fixed is nor how adding an inodegc flush fixes anything
> > > to do with swap file activation....
> > > 
> > > > Signed-off-by: Christoph Hellwig <hch@xxxxxx>
> > > > ---
> > > >  fs/xfs/xfs_aops.c | 18 +++++++++++++++++-
> > > >  1 file changed, 17 insertions(+), 1 deletion(-)
> > > > 
> > > > diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
> > > > index 69b8c2d1937d..c792297aa0a3 100644
> > > > --- a/fs/xfs/xfs_aops.c
> > > > +++ b/fs/xfs/xfs_aops.c
> > > > @@ -21,6 +21,7 @@
> > > >  #include "xfs_error.h"
> > > >  #include "xfs_zone_alloc.h"
> > > >  #include "xfs_rtgroup.h"
> > > > +#include "xfs_icache.h"
> > > >  
> > > >  struct xfs_writepage_ctx {
> > > >  	struct iomap_writepage_ctx ctx;
> > > > @@ -685,7 +686,22 @@ xfs_iomap_swapfile_activate(
> > > >  	struct file			*swap_file,
> > > >  	sector_t			*span)
> > > >  {
> > > > -	sis->bdev = xfs_inode_buftarg(XFS_I(file_inode(swap_file)))->bt_bdev;
> > > > +	struct xfs_inode		*ip = XFS_I(file_inode(swap_file));
> > > > +
> > > > +	/*
> > > > +	 * Ensure inode GC has finished to remove unmapped extents, as the
> > > > +	 * reflink bit is only cleared once all previously shared extents
> > > > +	 * are unmapped.  Otherwise swapon could incorrectly fail on a
> > > > +	 * very recently unshare file.
> > > > +	 */
> > > > +	xfs_inodegc_flush(ip->i_mount);
> > > 
> > > The comment doesn't explains what this actually fixes. Inodes that
> > > are processed by inodegc *must* be unreferenced by the VFS, so it
> > > is not clear exactly what this is actually doing.
> > > 
> > > I'm guessing that the test in question is doing something like this:
> > > 
> > > 	file2 = clone(file1)
> > > 	unlink(file1)
> > > 	swapon(file2)
> > > 
> > > and so the swap file activation is racing with the background
> > > inactivation and extent removal of file1?
> > 
> > Yes, I think hch is referring to this:
> > https://lore.kernel.org/fstests/2c9ff99c2bcaec4412b0903e03949d5a3ad0d817.1736783467.git.fdmanana@xxxxxxxx/
> > 
> > > But in that case, the extents are being removed from file1, and at
> > > no time does that remove the reflink bit on file2. i.e. even if the
> > > inactivation of file1 results in all the extents in file2 no longer
> > > being shared, that only results in refcountbt updates and it does
> > > not get propagated back to file2's inode. i.e. file2 will still be
> > > marked as a reflink file containing shared extents.
> > 
> > Right, but the (iomap) swapfile activation code only errors out if the
> > filesystem gives it a mapping that is marked as shared.  So the reflink
> > flag isn't relevant here.
> > 
> > How about this for a better comment:
> > 
> > "Ensure inode GC has finished so that unlinked clones of this file have
> > been truncated and inactivated fully.  This is to ensure that walking
> > the swap file does not find any shared extents."
> 
> Even talking about it in terms on "inodegc" seems like
> misdirection to me. Now I understand what this flush is working
> around, it is clear to me that swapon could race the same way with
> any other operation that removes extents from cloned files (e.g.
> hole punch, truncate, etc).
> 
> however, from a user perspective, the only one that matters -right
> now- is unlink because of the deferred processing of extent removal.
> 
> But even that isn't a guarantee - if something else has that cloned
> file open, then the unlinked inode won't be queued for inodegc
> and so swapon will still fail regardless of the inodegc flush.
> 
> Hence I think this needs to explain the race with extent removal and
> cloned files, then explain that the inodegc flush is a workaround
> that applies only to a specific corner case w.r.t. unlinking clones
> before swapon is run. 
> 
> Something like:
> 
> /*
>  * Swap file activation is can race against concurrent shared extent

"..can race..."

>  * removal in files that have been cloned. If this happens,
>  * iomap_swapfile_iter() can fail because it encountered a shared
>  * extent even though an operation is in progress to remove those
>  * shared extents.
>  *
>  * This race becomes problematic when we defer extent removal
>  * operations beyond the end of a syscall (i.e. use async background
>  * processing algorithms). Users think the extents are no longer
>  * shared, but iomap_swapfile_iter() still sees them as shared
>  * because the refcountbt entries for the extents being removed have
>  * not yet been updated. Hence the swapon call fails unexpectedly.
>  *
>  * The race condition is currently most obvious from the unlink()
>  * operation as extent removal is deferred until after the last
>  * reference to the inode goes away. We then process the extent
>  * removal asynchronously, hence triggers the "syscall completed but
>  * work not done" condition mentioned above. To close this race
>  * window, we need to flush any pending inodegc operations to ensure
>  * they have updated the refcountbt records before we try to map the
>  * swapfile.

Yes, this is a good explanation.

>  */
> 
> This explains the race condition we are working around, and it gives
> enough information to document that any other refcountbt updates we
> defer to background processing (either removals or inserts!) are
> going to need to be synchronised here.

There shouldn't be any refcount increments involving the swapfile
because the mm already took IOLOCK_EXCL for us.  But yes, there could
someday be more asynchronous decrements elsewhere in the filesystem.

--D

> 
> -Dave.
> -- 
> Dave Chinner
> david@xxxxxxxxxxxxx
>