Re: [PATCH] fstests: UDEV_SETTLE_PROG before dmsetup create

Chandan Babu R <chandan.babu@xxxxxxxxxx> · Mon, 06 Dec 2021 11:17:48 +0530

On 02 Dec 2021 at 21:15, Josef Bacik wrote:
> On Thu, Dec 02, 2021 at 02:45:24PM +0000, Filipe Manana wrote:
>> On Wed, Dec 01, 2021 at 01:18:52PM -0500, Josef Bacik wrote:
>> > We've been seeing transient errors with any test that uses a dm device
>> > for the entirety of the time that we've been running nightly xfstests
>> 
>> I have been having it on my tests vms since ever as well.
>> It's really annoying, but fortunatelly it doesn't happen too often.
>> 
>
> Yeah before this change we'd fail ~6 tests on every configruation on every
> overnight run.  With this change we fail 0.  It's rare, but with our continual
> testing it happens sooooooo much.
>
>> > runs.  This turns out to be because sometimes we get EBUSY while trying
>> > to create our new dm device.  Generally this is because the test comes
>> > right after another test that messes with the dm device, and thus we
>> > still have udev messing around with the device when DM tries to O_EXCL
>> > the block device.
>> > 
>> > Add a UDEV_SETTLE_PROG before creating the device to make sure we can
>> > create our new dm device without getting this transient error.
>> 
>> I suspect this might only make it seem the problem goes away but does not
>> really fix it.
>> 
>> I say that for 2 reasons:
>> 
>> 1) All tests that use dm end up calling _dmsetup_remove(), like through
>>    _log_writes_remove() or _cleanup_flakey() for example. Normally those
>>    are called in the _cleanup() function, which ensures it's done even if
>>    the test fails for some reason.
>> 
>>    So I don't understand why we need that UDEV_SETTLE_PROG at _dmsetup_create().
>> 
>>    And I've seen the ebusy failure happen even when the previous tests did
>>    not use any dm device;
>> 
>> 2) Some tests fail after creating the dm device and using it. For example
>>    btrfs/206 often fails when it tries to fsck the filesystem:
>> 
>>    btrfs/206 3s ... [failed, exit status 1]- output mismatch (see /home/fdmanana/git/hub/xfstests/results//btrfs/206.out.bad)
>>         --- tests/btrfs/206.out     2020-10-16 23:13:46.554152652 +0100
>>         +++ /home/fdmanana/git/hub/xfstests/results//btrfs/206.out.bad      2021-12-01 21:09:46.317632589 +0000
>>         @@ -3,3 +3,5 @@
>>         XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
>>         wrote 8192/8192 bytes at offset 0
>>         XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
>>         +_check_btrfs_filesystem: filesystem on /dev/sdc is inconsistent
>>         +(see /home/fdmanana/git/hub/xfstests/results//btrfs/206.full for details)
>>         ...
>> 
>>        (Run 'diff -u /home/fdmanana/git/hub/xfstests/tests/btrfs/206.out /home/fdmanana/git/hub/xfstests/results//btrfs/206.out.bad'  to see the entire diff)
>> 
>>     In the .full file I got:
>> 
>>     (...)
>>     replaying 1239@11201: sector 2173408, size 16384, flags 0x10(METADATA)
>>     replaying 1240@11234: sector 0, size 0, flags 0x1(FLUSH)
>>     replaying 1241@11235: sector 128, size 4096, flags 0x12(FUA|METADATA)
>>     _check_btrfs_filesystem: filesystem on /dev/sdc is inconsistent
>>     *** fsck.btrfs output ***
>>     ERROR: cannot open device '/dev/sdc': Device or resource busy
>>     ERROR: cannot open file system
>>     Opening filesystem to check...
>>     *** end fsck.btrfs output
>>     *** mount output ***
>> 
>>    The ebusy failure is not when the test starts, but when somewhere in the middle
>>    of the replay loop when it calls fsck, or when it ends and the fstests framework
>>    calls fsck.
>> 
>>    I've seen that with btrfs/172 too, which also uses dm logwrites in a similar way.
>> 
>> So to me this suggests 2 things:
>> 
>> 1) Calling UDEV_SETTLE_PROG at _dmsetup_create() doesn't solve that problem with
>>    btrfs/206 (and other tests) - the problem is fsck failing to open the scratch
>>    device after it called _log_writes_remove() -> _dmsetup_remove(), and not a
>>    failure to create the dm device;
>> 
>> 2) The problem is likely something missing at _dmsetup_remove(). Perhaps add
>>    another UDEV_SETTLE_PROG there:
>> 
>>    diff --git a/common/rc b/common/rc
>>    index 8e351f17..22b34677 100644
>>    --- a/common/rc
>>    +++ b/common/rc
>>    @@ -4563,6 +4563,7 @@ _dmsetup_remove()
>>             $UDEV_SETTLE_PROG >/dev/null 2>&1
>>             $DMSETUP_PROG remove "$@" >>$seqres.full 2>&1
>>             $DMSETUP_PROG mknodes >/dev/null 2>&1
>>     +       $UDEV_SETTLE_PROG >/dev/null 2>&1
>>      }
>>  
>>     _dmsetup_create()
>> 
>>   I can't say if that change to _dmsetup_remove() is correct, or what it's
>>   needed, as I really haven't spent time trying to figure out why the issue
>>   happens.
>> 
>
> I actually tried a few iterations before I settled on this one, but I was only
> trying to reproduce the EBUSY when creating the flakey device, I hadn't seen it
> with fsck.  So I originally started with your change, but it didn't fix the
> problem.  Then I did both, UDEV_SETTLE at the end of remove and at the beginning
> of create and the problem went away, and then I removed the one from remove and
> the problem still was gone.
>
> But since I've made this change I also have been investigating another problem
> where we'll get EBUSY at mkfs time when we use SCRATCH_DEV_POOL.  I have a test
> patch in our staging branch to make sure it actuall fixes it, but I have to add
> this to the start of _scratch_pool_mkfs as well.
>
> It turns out that udev is doing this thing where it writes to
> /sys/block/whatever/uevent to make sure that a KOBJ_CHANGE event gets sent out
> for the device.
>
> This is racing with the test doing a mount.  So the mount gets O_EXCL, which
> means the uevent doesn't get emitted until umount.  This would explain what
> you're seeing, we umount, we get the KOBJ_CHANGE event once the O_EXCL is
> dropped, udev runs, and then fsck gets an EBUSY.

I started debugging this issue late last week. Among several tests which
failed, xfs/033 was failing once in a while because the umount syscall
returned -EBUSY. Debugging this further led me to the following,

1. Mounting an fs causes udev to invoke xfs_spaceman (via udisksd => xfs_info).
   This causes the per-cpu counter at mount->mnt_pcp->mnt_count to increase by
   1.
2. Unmount at this stage causes the umount syscall to fail since the refcount
   on struct mount is greater than 2.

I changed my test disks from the regular /dev/sd* devices to loop devices. I
then added loop devices to be ignored by udev with the following change in
/lib/udev/rules.d/60-persistent-storage.rules,

KERNEL!="sr*|loop*", IMPORT{builtin}="blkid"

This led to xfs/033 execute fine for 100 times without failure. However, other
tests which use device mapper devices are still failing arbitrarily.

Looks like a simple mount operation causes udev to indirectly invoke
xfs_spaceman as described by the output of perf script,

bash 50034 [002]  2541.601278: sched:sched_process_fork: comm=bash pid=50034 child_comm=bash child_pid=50044                                                
mount 50044 [003]  2541.601851: sched:sched_process_exec: filename=/usr/bin/mount pid=50044 old_pid=50044
systemd-udevd   173 [000]  2541.620525: sched:sched_process_fork: comm=systemd-udevd pid=173 child_comm=systemd-udevd child_pid=50048
systemd-udevd   173 [000]  2541.621071: sched:sched_process_fork: comm=systemd-udevd pid=173 child_comm=systemd-udevd child_pid=50051
systemd-udevd   173 [000]  2541.621562: sched:sched_process_fork: comm=systemd-udevd pid=173 child_comm=systemd-udevd child_pid=50052
systemd-udevd   173 [000]  2541.621971: sched:sched_process_fork: comm=systemd-udevd pid=173 child_comm=systemd-udevd child_pid=50053
systemd-udevd   173 [000]  2541.622657: sched:sched_process_fork: comm=systemd-udevd pid=173 child_comm=systemd-udevd child_pid=50054
udisksd   588 [000]  2541.675852: sched:sched_process_fork: comm=udisksd pid=588 child_comm=udisksd child_pid=50058
xfs_admin 50058 [000]  2541.676645: sched:sched_process_exec: filename=/usr/sbin/xfs_admin pid=50058 old_pid=50058
xfs_admin 50058 [000]  2541.677317: sched:sched_process_fork: comm=xfs_admin pid=50058 child_comm=xfs_admin child_pid=50059
xfs_db 50059 [000]  2541.677569: sched:sched_process_exec: filename=/usr/sbin/xfs_db pid=50059 old_pid=50059
udisksd   588 [003]  2541.687075: sched:sched_process_fork: comm=udisksd pid=588 child_comm=udisksd child_pid=50060
xfs_info 50060 [001]  2541.687843: sched:sched_process_exec: filename=/usr/sbin/xfs_info pid=50060 old_pid=50060
xfs_info 50060 [001]  2541.688362: sched:sched_process_fork: comm=xfs_info pid=50060 child_comm=xfs_info child_pid=50061
xfs_info 50061 [003]  2541.688615: sched:sched_process_fork: comm=xfs_info pid=50061 child_comm=xfs_info child_pid=50062
xfs_info 50062 [001]  2541.688810: sched:sched_process_fork: comm=xfs_info pid=50062 child_comm=xfs_info child_pid=50063
xfs_info 50062 [001]  2541.688944: sched:sched_process_fork: comm=xfs_info pid=50062 child_comm=xfs_info child_pid=50064
losetup 50063 [003]  2541.689058: sched:sched_process_exec: filename=/usr/sbin/losetup pid=50063 old_pid=50063
tail 50064 [000]  2541.689169: sched:sched_process_exec: filename=/usr/bin/tail pid=50064 old_pid=50064
xfs_info 50060 [001]  2541.690200: sched:sched_process_fork: comm=xfs_info pid=50060 child_comm=xfs_info child_pid=50065
findmnt 50065 [003]  2541.690501: sched:sched_process_exec: filename=/usr/bin/findmnt pid=50065 old_pid=50065
xfs_info 50060 [001]  2541.692154: sched:sched_process_fork: comm=xfs_info pid=50060 child_comm=xfs_info child_pid=50066
xfs_spaceman 50066 [003]  2541.692345: sched:sched_process_exec: filename=/usr/sbin/xfs_spaceman pid=50066 old_pid=50066

xfs_spaceman in turn causes ref count on struct mount to increase. Hence an
unmount operation can fail with -EBUSY error.

>
> This is a very long email to say that udev is causing spurious failures because
> of behavior I don't entirely understand.  We're going to need to sprinkle in
> UDEV_SETTLE_PROG in different places to kill all of these different scenarios.
>
> What do we think is best here?  Put UDEV_SETTLE_PROG at the start of any
> function that needs to do O_EXCL?  So this would mean
>
> _dmsetup_create
> _dmsetup_remove
> *_mkfs
> *_mount
> *_check
>

To be honest, I don't understand udev well enough. But as pointed above, a
simple invocation of mount is causing xfs_spaceman to be executed (indirectly)
by udevd.  Hence, may be executing UDEV_SETTLE_PROG is probably not
sufficient.

> That would be safest, and I can certainly do that.  My initial approach was just
> to do it where it was problematic, but the debugging I did yesterday around
> btrfs/223 failures and the fact that udev is queue'ing up events that get
> delivered at some point in the future makes it kind of hard to handle on a
> case-by-case basis.  Thanks,
>
> Josef

-- 
chandan