On Thu, Nov 24, 2022 at 08:59:24AM -0800, Darrick J. Wong wrote: > From: Darrick J. Wong <djwong@xxxxxxxxxx> > > Every now and then I see fstests failures on aarch64 (64k pages) that > trigger on the following sequence: > > mkfs.xfs $dev > mount $dev $mnt > touch $mnt/a > umount $mnt > xfs_db -c 'path /a' -c 'print' $dev > > 99% of the time this succeeds, but every now and then xfs_db cannot find > /a and fails. This turns out to be a race involving udev/blkid, the > page cache for the block device, and the xfs_db process. > > udev is triggered whenever anyone closes a block device or unmounts it. > The default udev rules invoke blkid to read the fs super and create > symlinks to the bdev under /dev/disk. For this, it uses buffered reads > through the page cache. > > xfs_db also uses buffered reads to examine metadata. There is no > coordination between xfs_db and udev, which means that they can run > concurrently. Note there is no coordination between the kernel and > blkid either. > > On a system with 64k pages, the page cache can cache the superblock and > the root inode (and hence the root dir) with the same 64k page. If > udev spawns blkid after the mkfs and the system is busy enough that it > is still running when xfs_db starts up, they'll both read from the same > page in the pagecache. > > The unmount writes updated inode metadata to disk directly. The XFS > buffer cache does not use the bdev pagecache, nor does it invalidate the > pagecache on umount. If the above scenario occurs, the pagecache no > longer reflects what's on disk, xfs_db reads the stale metadata, and > fails to find /a. Most of the time this succeeds because closing a bdev > invalidates the page cache, but when processes race, everyone loses. > > Fix the problem by invalidating the bdev pagecache after flushing the > bdev, so that xfs_db will see up to date metadata. > > Signed-off-by: Darrick J. Wong <djwong@xxxxxxxxxx> > --- > fs/xfs/xfs_buf.c | 1 + > 1 file changed, 1 insertion(+) > > > diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c > index dde346450952..54c774af6e1c 100644 > --- a/fs/xfs/xfs_buf.c > +++ b/fs/xfs/xfs_buf.c > @@ -1945,6 +1945,7 @@ xfs_free_buftarg( > list_lru_destroy(&btp->bt_lru); > > blkdev_issue_flush(btp->bt_bdev); > + invalidate_bdev(btp->bt_bdev); > fs_put_dax(btp->bt_daxdev, btp->bt_mount); > > kmem_free(btp); Looks OK and because XFS has multiple block devices we have to do this invalidation for each bdev. Reviewed-by: Dave Chinner <dchinner@xxxxxxxxxx> However: this does not look to be an XFS specific problem. If we look at reconfigure_super(), when it completes a remount-ro operation it calls invalidate_bdev() because: /* * Some filesystems modify their metadata via some other path than the * bdev buffer cache (eg. use a private mapping, or directories in * pagecache, etc). Also file data modifications go via their own * mappings. So If we try to mount readonly then copy the filesystem * from bdev, we could get stale data, so invalidate it to give a best * effort at coherency. */ if (remount_ro && sb->s_bdev) invalidate_bdev(sb->s_bdev); This is pretty much the same problem as this patch avoids for XFS in the unmount path, yes? Shouldn't we be adding a call to invalidate_bdev(sb->s_bdev) after the fs->kill_sb() call in deactivate_locked_super() so that this problem goes away for all filesystems? Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx