On Thu, Feb 06, 2025 at 08:53:59AM +1100, Dave Chinner wrote: > On Wed, Feb 05, 2025 at 01:16:59PM -0800, Darrick J. Wong wrote: > > On Thu, Feb 06, 2025 at 08:08:14AM +1100, Dave Chinner wrote: > > > On Wed, Feb 05, 2025 at 05:28:00PM +0100, Christoph Hellwig wrote: > > > > Fix the brand new xfstest that tries to swapon on a recently unshared > > > > file and use the chance to document the other bit of magic in this > > > > function. > > > > > > You haven't documented the magic at all - I have no clue what the > > > bug being fixed is nor how adding an inodegc flush fixes anything > > > to do with swap file activation.... > > > > > > > Signed-off-by: Christoph Hellwig <hch@xxxxxx> > > > > --- > > > > fs/xfs/xfs_aops.c | 18 +++++++++++++++++- > > > > 1 file changed, 17 insertions(+), 1 deletion(-) > > > > > > > > diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c > > > > index 69b8c2d1937d..c792297aa0a3 100644 > > > > --- a/fs/xfs/xfs_aops.c > > > > +++ b/fs/xfs/xfs_aops.c > > > > @@ -21,6 +21,7 @@ > > > > #include "xfs_error.h" > > > > #include "xfs_zone_alloc.h" > > > > #include "xfs_rtgroup.h" > > > > +#include "xfs_icache.h" > > > > > > > > struct xfs_writepage_ctx { > > > > struct iomap_writepage_ctx ctx; > > > > @@ -685,7 +686,22 @@ xfs_iomap_swapfile_activate( > > > > struct file *swap_file, > > > > sector_t *span) > > > > { > > > > - sis->bdev = xfs_inode_buftarg(XFS_I(file_inode(swap_file)))->bt_bdev; > > > > + struct xfs_inode *ip = XFS_I(file_inode(swap_file)); > > > > + > > > > + /* > > > > + * Ensure inode GC has finished to remove unmapped extents, as the > > > > + * reflink bit is only cleared once all previously shared extents > > > > + * are unmapped. Otherwise swapon could incorrectly fail on a > > > > + * very recently unshare file. > > > > + */ > > > > + xfs_inodegc_flush(ip->i_mount); > > > > > > The comment doesn't explains what this actually fixes. Inodes that > > > are processed by inodegc *must* be unreferenced by the VFS, so it > > > is not clear exactly what this is actually doing. > > > > > > I'm guessing that the test in question is doing something like this: > > > > > > file2 = clone(file1) > > > unlink(file1) > > > swapon(file2) > > > > > > and so the swap file activation is racing with the background > > > inactivation and extent removal of file1? > > > > Yes, I think hch is referring to this: > > https://lore.kernel.org/fstests/2c9ff99c2bcaec4412b0903e03949d5a3ad0d817.1736783467.git.fdmanana@xxxxxxxx/ > > > > > But in that case, the extents are being removed from file1, and at > > > no time does that remove the reflink bit on file2. i.e. even if the > > > inactivation of file1 results in all the extents in file2 no longer > > > being shared, that only results in refcountbt updates and it does > > > not get propagated back to file2's inode. i.e. file2 will still be > > > marked as a reflink file containing shared extents. > > > > Right, but the (iomap) swapfile activation code only errors out if the > > filesystem gives it a mapping that is marked as shared. So the reflink > > flag isn't relevant here. > > > > How about this for a better comment: > > > > "Ensure inode GC has finished so that unlinked clones of this file have > > been truncated and inactivated fully. This is to ensure that walking > > the swap file does not find any shared extents." > > Even talking about it in terms on "inodegc" seems like > misdirection to me. Now I understand what this flush is working > around, it is clear to me that swapon could race the same way with > any other operation that removes extents from cloned files (e.g. > hole punch, truncate, etc). > > however, from a user perspective, the only one that matters -right > now- is unlink because of the deferred processing of extent removal. > > But even that isn't a guarantee - if something else has that cloned > file open, then the unlinked inode won't be queued for inodegc > and so swapon will still fail regardless of the inodegc flush. > > Hence I think this needs to explain the race with extent removal and > cloned files, then explain that the inodegc flush is a workaround > that applies only to a specific corner case w.r.t. unlinking clones > before swapon is run. > > Something like: > > /* > * Swap file activation is can race against concurrent shared extent "..can race..." > * removal in files that have been cloned. If this happens, > * iomap_swapfile_iter() can fail because it encountered a shared > * extent even though an operation is in progress to remove those > * shared extents. > * > * This race becomes problematic when we defer extent removal > * operations beyond the end of a syscall (i.e. use async background > * processing algorithms). Users think the extents are no longer > * shared, but iomap_swapfile_iter() still sees them as shared > * because the refcountbt entries for the extents being removed have > * not yet been updated. Hence the swapon call fails unexpectedly. > * > * The race condition is currently most obvious from the unlink() > * operation as extent removal is deferred until after the last > * reference to the inode goes away. We then process the extent > * removal asynchronously, hence triggers the "syscall completed but > * work not done" condition mentioned above. To close this race > * window, we need to flush any pending inodegc operations to ensure > * they have updated the refcountbt records before we try to map the > * swapfile. Yes, this is a good explanation. > */ > > This explains the race condition we are working around, and it gives > enough information to document that any other refcountbt updates we > defer to background processing (either removals or inserts!) are > going to need to be synchronised here. There shouldn't be any refcount increments involving the swapfile because the mm already took IOLOCK_EXCL for us. But yes, there could someday be more asynchronous decrements elsewhere in the filesystem. --D > > -Dave. > -- > Dave Chinner > david@xxxxxxxxxxxxx >