Re: [PATCH 1/3] xfs: invalidate block device page cache during unmount

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Tue, Nov 29, 2022 at 04:23:22PM +1100, Dave Chinner wrote:
> On Thu, Nov 24, 2022 at 08:59:24AM -0800, Darrick J. Wong wrote:
> > From: Darrick J. Wong <djwong@xxxxxxxxxx>
> > 
> > Every now and then I see fstests failures on aarch64 (64k pages) that
> > trigger on the following sequence:
> > 
> > mkfs.xfs $dev
> > mount $dev $mnt
> > touch $mnt/a
> > umount $mnt
> > xfs_db -c 'path /a' -c 'print' $dev
> > 
> > 99% of the time this succeeds, but every now and then xfs_db cannot find
> > /a and fails.  This turns out to be a race involving udev/blkid, the
> > page cache for the block device, and the xfs_db process.
> > 
> > udev is triggered whenever anyone closes a block device or unmounts it.
> > The default udev rules invoke blkid to read the fs super and create
> > symlinks to the bdev under /dev/disk.  For this, it uses buffered reads
> > through the page cache.
> > 
> > xfs_db also uses buffered reads to examine metadata.  There is no
> > coordination between xfs_db and udev, which means that they can run
> > concurrently.  Note there is no coordination between the kernel and
> > blkid either.
> > 
> > On a system with 64k pages, the page cache can cache the superblock and
> > the root inode (and hence the root dir) with the same 64k page.  If
> > udev spawns blkid after the mkfs and the system is busy enough that it
> > is still running when xfs_db starts up, they'll both read from the same
> > page in the pagecache.
> > 
> > The unmount writes updated inode metadata to disk directly.  The XFS
> > buffer cache does not use the bdev pagecache, nor does it invalidate the
> > pagecache on umount.  If the above scenario occurs, the pagecache no
> > longer reflects what's on disk, xfs_db reads the stale metadata, and
> > fails to find /a.  Most of the time this succeeds because closing a bdev
> > invalidates the page cache, but when processes race, everyone loses.
> > 
> > Fix the problem by invalidating the bdev pagecache after flushing the
> > bdev, so that xfs_db will see up to date metadata.
> > 
> > Signed-off-by: Darrick J. Wong <djwong@xxxxxxxxxx>
> > ---
> >  fs/xfs/xfs_buf.c |    1 +
> >  1 file changed, 1 insertion(+)
> > 
> > 
> > diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
> > index dde346450952..54c774af6e1c 100644
> > --- a/fs/xfs/xfs_buf.c
> > +++ b/fs/xfs/xfs_buf.c
> > @@ -1945,6 +1945,7 @@ xfs_free_buftarg(
> >  	list_lru_destroy(&btp->bt_lru);
> >  
> >  	blkdev_issue_flush(btp->bt_bdev);
> > +	invalidate_bdev(btp->bt_bdev);
> >  	fs_put_dax(btp->bt_daxdev, btp->bt_mount);
> >  
> >  	kmem_free(btp);
> 
> Looks OK and because XFS has multiple block devices we have to do
> this invalidation for each bdev.
> 
> Reviewed-by: Dave Chinner <dchinner@xxxxxxxxxx>
> 
> However: this does not look to be an XFS specific problem.  If we
> look at reconfigure_super(), when it completes a remount-ro
> operation it calls invalidate_bdev() because:
> 
>        /*
>          * Some filesystems modify their metadata via some other path than the
>          * bdev buffer cache (eg. use a private mapping, or directories in
>          * pagecache, etc). Also file data modifications go via their own
>          * mappings. So If we try to mount readonly then copy the filesystem
>          * from bdev, we could get stale data, so invalidate it to give a best
>          * effort at coherency.
>          */
>         if (remount_ro && sb->s_bdev)
>                 invalidate_bdev(sb->s_bdev);
> 
> This is pretty much the same problem as this patch avoids for XFS in
> the unmount path, yes? Shouldn't we be adding a call to
> invalidate_bdev(sb->s_bdev) after the fs->kill_sb() call in
> deactivate_locked_super() so that this problem goes away for all
> filesystems?

I'm not sure this applies to everyone -- AFAICT, ext2/4 still write
everything through the bdev page cache, which means that the
invalidation isn't necessary there, except for perhaps the MMP block.

Years ago I remember Andreas rolling his eyes at how the kernel would
usually drop the whole pagecache between umount and e2fsck starting.
But I guess that's *usually* what we get anyways, so adding an
invalidation everywhere for the long tail of simple bdev filesystems
wouldn't hurt much.  Hmm.  Ok, I'm more convinced now.

I'll ask on the ext4 concall this week, and in the meantime try to
figure out what's the deal with btrfs.

--D

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@xxxxxxxxxxxxx



[Index of Archives]     [XFS Filesystem Development (older mail)]     [Linux Filesystem Development]     [Linux Audio Users]     [Yosemite Trails]     [Linux Kernel]     [Linux RAID]     [Linux SCSI]


  Powered by Linux