Re: Flaky test: generic/085

Christian Brauner <brauner@xxxxxxxxxx> · Wed, 12 Jun 2024 13:25:07 +0200

On Tue, Jun 11, 2024 at 09:37:01AM -0700, Darrick J. Wong wrote:
> On Tue, Jun 11, 2024 at 09:52:10AM +0100, Theodore Ts'o wrote:
> > Hi, I've recently found a flaky test, generic/085 on 6.10-rc2 and
> > fs-next.  It's failing on both ext4 and xfs, and it reproduces more
> > easiy with the dax config:
> > 
> > xfs/4k: 20 tests, 1 failures, 137 seconds
> >   Flaky: generic/085:  5% (1/20)
> > xfs/dax: 20 tests, 11 failures, 71 seconds
> >   Flaky: generic/085: 55% (11/20)
> > ext4/4k: 20 tests, 111 seconds
> > ext4/dax: 20 tests, 8 failures, 69 seconds
> >   Flaky: generic/085: 40% (8/20)
> > Totals: 80 tests, 0 skipped, 20 failures, 0 errors, 388s
> > 
> > The failure is caused by a WARN_ON in fs_bdev_thaw() in fs/super.c:
> > 
> > static int fs_bdev_thaw(struct block_device *bdev)
> > {
> > 	...
> > 	sb = get_bdev_super(bdev);
> > 	if (WARN_ON_ONCE(!sb))
> > 		return -EINVAL;
> > 
> > 
> > The generic/085 test which exercises races between the fs
> > freeze/unfeeze and mount/umount code paths, so this appears to be
> > either a VFS-level or block device layer bug.  Modulo the warning, it
> > looks relatively harmless, so I'll just exclude generic/085 from my
> > test appliance, at least for now.  Hopefully someone will have a
> > chance to take a look at it?
> 
> I think this can happen if fs_bdev_thaw races with unmount?
> 
> Let's say that the _umount $lvdev in the second loop in generic/085
> starts the unmount process, which clears SB_ACTIVE from the super_block.
> Then the first loop tries to freeze the bdev (and fails), and
> immediately tries to thaw the bdev.  The thaw code calls fs_bdev_thaw
> because the unmount process is still running & so the fs is still
> holding the bdev.  But get_bdev_super sees that SB_ACTIVE has been
> cleared from the super_block so it returns NULL, which trips the
> warning.
> 
> If that's correct, then I think the WARN_ON_ONCE should go away.

I've been trying to reproduce this with pmem yesterday and wasn't able to.

SB_ACTIVE is cleared in generic_shutdown_super(). If we're in there we
know that there are no active references to the superblock anymore. That
includes freeze requests:

* Freezes are nestable from kernel and userspace but all
  nested freezers share a single active reference in sb->s_active.
* The nested freeze requests are counted in
  sb->s_writers.freeze_{kcount,ucount}.
* The last thaw request (sb->s_writers.freeze_kcount +
  sb->s_writers.freeze_ucount == 0) releases the sb->s_active reference.
* Nested freezes from the block layer via bdev_freeze() are
  additionally counted in bdev->bd_fsfreeze_count protected by
  bdev->bd_fsfreeze_mutex.

The device mapper suspend logic that generic/085 uses relies on
bdev_freeze() and bdev_thaw() from the block layer. So all those dm
freezes should be counted in bdev->bd_fsfreeze_count. And device mapper
has logic to ensure that only a single freeze request is ever made. So
bdev->bd_fsfreeze_count in that test should be 1. So when a bdev_thaw()
request comes via dm_suspend():

* bdev_thaw() is called and encounters bdev->bd_fsfreeze_count == 1.
  * As there aren't any fs initiated freezes we know that
  sb->s_writers.kcount == 0 and sb->s_writer.ucount == 1 ==
  bdev->bd_fsfreeze_count.
* When fs_bdev_thaw() the superblock is still valid and we've got at
  least one active reference taken during the bdev_freeze()
    request.
* get_bdev_super() tries to grab an active reference to the
  superblock but fails.

That can indeed happen because SB_ACTIVE is cleared. But for that to be
the case we must've dropped the last active reference, forgot to take it
during the original freeze, miscounted bdev->bd_fsfreeze_count, or
missed a nested sb->s_writers.freeze_{kcount,ucount}.

What's the kernel config and test config that's used?