Re: Flaky test: generic/085

"Darrick J. Wong" <djwong@xxxxxxxxxx> · Thu, 13 Jun 2024 10:17:09 -0700

On Thu, Jun 13, 2024 at 11:55:37AM +0200, Christian Brauner wrote:
> On Wed, Jun 12, 2024 at 03:47:16PM GMT, Theodore Ts'o wrote:
> > On Wed, Jun 12, 2024 at 01:25:07PM +0200, Christian Brauner wrote:
> > > I've been trying to reproduce this with pmem yesterday and wasn't able to.
> > > 
> > > What's the kernel config and test config that's used?
> > >
> > 
> > The kernel config can be found here:
> > 
> > https://github.com/tytso/xfstests-bld/blob/master/kernel-build/kernel-configs/config-6.1
> > 
> > Drop it into .config in the build directory of any kernel sources
> > newer than 6.1, and then run "make olddefconfig".  This is all
> > automated in the install-kconfig script which I use:
> > 
> > https://github.com/tytso/xfstests-bld/blob/master/kernel-build/install-kconfig
> > 
> > The VM has 4 CPU's, and 26GiB of memory, and kernel is booted with the
> > boot command line options "memmap=4G!9G memmap=9G!14G", which sets up
> > fake /dev/pmem0 and /dev/pmem1 devices backed by RAM.  This is my poor
> > engineer's way of testing DAX without needing to get access to
> > expensive VM's with pmem.  :-)
> > 
> > I'm assuming this is a timing-dependant bug which is easiest to
> > trigger on fast devices, so a ramdisk might also work.  FWIW, I also
> > can see failures relatively frequently using the ext4/nojournal
> > configuration on a SSD-backed cloud block device (GCE's Persistent
> > Disk SSD product).
> > 
> > As a result, if you grab my xfstests-bld repo from github, and then
> > run "qemu-xfstests -c ext4/nojournal C 20 generic/085" it should
> > also reproduce.  See the Documentation/kvm-quickstart.md for more details.
> 
> Thanks, Ted! Ok, I think I figured it out.
> 
> P1
> dm_resume()
> -> bdev_freeze()
>    mutex_lock(&bdev->bd_fsfreeze_mutex);
>    atomic_inc_return(&bdev->bd_fsfreeze_count); // == 1
>    mutex_unlock(&bdev->bd_fsfreeze_mutex);
> 
> P2						P3
> setup_bdev_super()
> bdev_file_open_by_dev();
> atomic_read(&bdev->bd_fsfreeze_count); // != 0
> 
> 						bdev_thaw()
> 						mutex_lock(&bdev->bd_fsfreeze_mutex);
> 						atomic_dec_return(&bdev->bd_fsfreeze_count); // == 0
> 						mutex_unlock(&bdev->bd_fsfreeze_mutex);
> 						bd_holder_lock();
> 						// grab passive reference on sb via sb->s_count
> 						bd_holder_unlock();
> 						// Sleep to be woken when superblock ready or dead
> bdev_fput()
> bd_holder_lock()
> // yield bdev
> bd_holder_unlock()
> 
> deactivate_locked_super()
> // notify that superblock is dead
> 
> 						// get woken and see that superblock is dead; fail
> 
> In words this basically means that P1 calls dm_suspend() which calls
> into bdev_freeze() before the block device has been claimed by the
> filesystem. This brings bdev->bd_fsfreeze_count to 1 and no call into
> fs_bdev_freeze() is required.
> 
> Now P2 tries to mount that frozen block device. It claims it and checks
> bdev->bd_fsfreeze_count. As it's elevated it aborts mounting holding
> sb->s_umount all the time ofc.
> 
> In the meantime P3 calls dm_resume() it sees that the block device is
> already claimed by a filesystem and calls into fs_bdev_thaw().
> 
> It takes a passive reference and realizes that the filesystem isn't
> ready yet. So P3 puts itself to sleep to wait for the filesystem to
> become ready.
> 
> P2 puts the last active reference to the filesystem and marks it as
> dying.
> 
> Now P3 gets woken, sees that the filesystem is dying and
> get_bdev_super() fails.
> 
> So Darrick is correct about the fix but the reasoning is a bit
> different. :)
> 
> Patch appended and on #vfs.fixes.

> From 35224b919d6778ca5dd11f76659ae849594bd2bf Mon Sep 17 00:00:00 2001
> From: Christian Brauner <brauner@xxxxxxxxxx>
> Date: Thu, 13 Jun 2024 11:38:14 +0200
> Subject: [PATCH] fs: don't misleadingly warn during thaw operations
> 
> The block device may have been frozen before it was claimed by a
> filesystem. Concurrently another process might try to mount that
> frozen block device and has temporarily claimed the block device for
> that purpose causing a concurrent fs_bdev_thaw() to end up here. The
> mounter is already about to abort mounting because they still saw an
> elevanted bdev->bd_fsfreeze_count so get_bdev_super() will return
> NULL in that case.
> 
> For example, P1 calls dm_suspend() which calls into bdev_freeze() before
> the block device has been claimed by the filesystem. This brings
> bdev->bd_fsfreeze_count to 1 and no call into fs_bdev_freeze() is
> required.
> 
> Now P2 tries to mount that frozen block device. It claims it and checks
> bdev->bd_fsfreeze_count. As it's elevated it aborts mounting.
> 
> In the meantime P3 calls dm_resume() it sees that the block device is
> already claimed by a filesystem and calls into fs_bdev_thaw().
> 
> It takes a passive reference and realizes that the filesystem isn't
> ready yet. So P3 puts itself to sleep to wait for the filesystem to
> become ready.
> 
> P2 puts the last active reference to the filesystem and marks it as
> dying. Now P3 gets woken, sees that the filesystem is dying and
> get_bdev_super() fails.

Wow that's twisty.  But it makes sense to me, so
Reviewed-by: Darrick J. Wong <djwong@xxxxxxxxxx>

--D

> Fixes: 49ef8832fb1a ("bdev: implement freeze and thaw holder operations")
> Cc: <stable@xxxxxxxxxxxxxxx>
> Reported-by: Theodore Ts'o <tytso@xxxxxxx>
> Link: https://lore.kernel.org/r/20240611085210.GA1838544@xxxxxxx
> Signed-off-by: Christian Brauner <brauner@xxxxxxxxxx>
> ---
>  fs/super.c | 11 ++++++++++-
>  1 file changed, 10 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/super.c b/fs/super.c
> index b72f1d288e95..095ba793e10c 100644
> --- a/fs/super.c
> +++ b/fs/super.c
> @@ -1502,8 +1502,17 @@ static int fs_bdev_thaw(struct block_device *bdev)
>  
>  	lockdep_assert_held(&bdev->bd_fsfreeze_mutex);
>  
> +	/*
> +	 * The block device may have been frozen before it was claimed by a
> +	 * filesystem. Concurrently another process might try to mount that
> +	 * frozen block device and has temporarily claimed the block device for
> +	 * that purpose causing a concurrent fs_bdev_thaw() to end up here. The
> +	 * mounter is already about to abort mounting because they still saw an
> +	 * elevanted bdev->bd_fsfreeze_count so get_bdev_super() will return
> +	 * NULL in that case.
> +	 */
>  	sb = get_bdev_super(bdev);
> -	if (WARN_ON_ONCE(!sb))
> +	if (!sb)
>  		return -EINVAL;
>  
>  	if (sb->s_op->thaw_super)
> -- 
> 2.43.0
>