Re: [PATCH] xfstests 311: test fsync with dm flakey V2

Dave Chinner <david@xxxxxxxxxxxxx> · Fri, 26 Apr 2013 08:45:56 +1000

On Thu, Apr 25, 2013 at 10:12:56AM -0400, Josef Bacik wrote:
> This test sets up a dm flakey target and then runs my fsync tester I've been
> using to verify btrfs's fsync() is working properly.  It will create a dm flakey
> device, mount it, run my test, make the flakey device start dropping writes, and
> then unmount the fs.  Then we mount it back up and make sure the md5sums match
> and then run fsck on the device to make sure we got a consistent fs.  I used the
> output from a run on BTRFS since it's the only one that passes this test
> properly.  I verified each test manually to make sure they were in fact valid
> files.  XFS and Ext4 both fail this test in one way or another.  Thanks,
> 
> Signed-off-by: Josef Bacik <jbacik@xxxxxxxxxxxx>
> ---
> V1->V2
> -make _test_check_fs take an argument on wether or not to force an exit, this is
> because if we failed to fsck we'd leave the dmflakey device around which was
> super annoying.

Why doesn't the standard trap command in the test code catch
the exit case and run the _cleanup function and unmount it?

FWIW, if this change is actually necessary, then it needs to be in a
separate patch.

> -fixed the drop caches bug (thanks Zach!)
> -fixed the output since XFS has a bug with that particular test, it leaves a 0
> length file behind which isn't right. 

What test, what bug, and why have you changed the test to work
around it?

.....
> @@ -478,7 +485,7 @@ _scratch_mkfs_ext4()
>  {
>  	local tmp_dir=/tmp/
>  
> -	/sbin/mkfs -t $FSTYP -- $MKFS_OPTIONS $* $SCRATCH_DEV \
> +	/sbin/mkfs -t $FSTYP -- -F $MKFS_OPTIONS $* $SCRATCH_DEV \
>  			2>$tmp_dir.mkfserr 1>$tmp_dir.mkfsstd
>  	local mkfs_status=$?

That seems like an unrelated bug fix?

> @@ -1041,6 +1048,27 @@ _require_command()
>      [ -n "$1" -a -x "$1" ] || _notrun "$_cmd utility required, skipped this test"
>  }
>  
> +# this test requires the device mapper flakey target
> +#
> +_require_dm_flakey()
> +{
> +    if [ "$HOSTOS" != "Linux" ]
> +    then
> +	_notrun "This test requires linux for dm flakey support"
> +    fi

No need to check this - any test that uses dm-flakey should have a
"_supported_os Linux" line in it.

> +    $DMSETUP_PROG targets | grep flakey >/dev/null 2>&1
> +    if [ $? -eq 0 ]
> +    then
> +	:
> +    else
> +	_notrun "This test requires dm flakey support"
> +    fi

[ $? -ne 0 ] && _notrun "This test requires dm flakey support"

<snip all the "force exit" mess>

<snip the fsync-tester.c code>

FWIW, rebooting the machine at the end of the test should not be the
default behaviour of fsync-tester.c...

> diff --git a/tests/generic/311 b/tests/generic/311
> new file mode 100755
> index 0000000..3f7abe2
> --- /dev/null
> +++ b/tests/generic/311
> @@ -0,0 +1,177 @@
> +#! /bin/bash
> +# FS QA Test No. 311
> +#
> +#Verify a file systems fsync is working properly.  This won't catch problems
> +#with blockdev flushing, but at the very least it makes sure the file system is
> +#doing the right thing with fsync logically.

How is this different to any of the other tests that test fsync() is
working properly? Please describe what aspect of fsync is being
tested....

> +# creator
> +owner=jbacik@xxxxxxxxxxxx

We don't need to add these any more.

> +seq=`basename $0`
> +seqres=$RESULT_DIR/$seq
> +echo "QA output created by $seq"
> +
> +here=`pwd`
> +status=1	# failure is the default!
> +
> +_cleanup()
> +{
> +	$UMOUNT_PROG $SCRATCH_MNT > /dev/null 2>&1
> +	$DMSETUP_PROG remove flakey-test > /dev/null 2>&1
> +}
> +
> +_cleanup

If we are called with a mounted scratch device, then something has
gone badly wrong somewhere else - tests *always* start with
unmounted test and scratch devices.

What I suspect has gone wrong here is that you've removed the line:

trap "_cleanup; exit \$status" 0 1 2 3 15

That will trigger the cleanup function whenever the test exits. I'd
say this is the cause of the problem you have that has caused you to
add the "force exit" crap to the filesytsem check functions...

> +# get standard environment, filters and checks
> +. ./common/rc
> +. ./common/filter
> +
> +# real QA test starts here
> +_supported_fs generic
> +_supported_os Linux
> +_need_to_be_root
> +_require_scratch
> +_require_dm_flakey
> +
> +[ -x $here/src/fsync-tester ] || _notrun "fsync-tester not build"
> +
> +rm -f $seqres.full
> +BLK_DEV_SIZE=`blockdev --getsz $SCRATCH_DEV`
> +FLAKEY_DEV=/dev/mapper/flakey-test
> +SEED=1
> +testfile=$SCRATCH_MNT/$seq.fsync
> +
> +_mount_flakey()
> +{
> +	_scratch_options mount
> +	_mount -t $FSTYP $SCRATCH_OPTIONS $MOUNT_OPTIONS $SELINUX_MOUNT_OPTIONS $* $FLAKEY_DEV $SCRATCH_MNT
> +}

Why do you need to open code all this? We typically don't do
this for loopback device mounts, so I'm not sure it is necessary
here. Indeed, if you are testing on the flakey device, you don't
want XFS using external log or realtime devices, so really just
something like:

mount -t $FSTYP $MOUNT_OPTIONS $FLAKEY_DEV $SCRATCH_MNT

will suffice...

> +_unmount_flakey()
> +{
> +	$UMOUNT_PROG $FLAKEY_DEV
> +}

Empty line after this needed. FWIW, why use $FLAKEY_DEV here and not
$SCRATCH_MNT like in the cleanup function?

> +_drop_writes()
> +{
> +	$DMSETUP_PROG suspend flakey-test

That freezes the filesystem, right?

> +	if [ $? -ne 0 ]; then
> +		echo "failed to suspend flakey-test"
> +		_unmount_flakey
> +		_cleanup
> +		exit
> +	fi

With a properly functioning trap that calls _cleanup(), this can be
replaced with:

	[ $? -ne 0 ] && _fatal "failed to suspend flakey-test"

Same for all the other error cases.

> +	$DMSETUP_PROG load flakey-test --table "0 $BLK_DEV_SIZE flakey $SCRATCH_DEV 0 0 180 1 drop_writes"

Given that there are 2 different table configurations, perhaps
defining them as variables will make it more obvious. e.g.

FLAKEY_TABLE="0 $BLK_DEV_SIZE flakey $SCRATCH_DEV 0 0 180 0"
FLAKEY_WRITE_TABLE="0 $BLK_DEV_SIZE flakey $SCRATCH_DEV 0 0 180 1 drop_writes"

> +_run_test()
> +{
> +	test_num=$1
> +	extra=""
> +
> +	[ $2 -eq 1 ] && extra="-d"

I'm assuming that $2 == 1 means "use direct IO" given it is not
actually documented? Perhaps "extra" is not such a good name?

> +	$here/src/fsync-tester -s $SEED -r -t $test_num $extra $testfile
> +	if [ $? -ne 0 ]; then
> +		_unmount_flakey
> +		_cleanup
> +		exit
> +	fi
> +
> +	_md5_checksum $testfile
> +	_drop_writes
> +	_unmount_flakey

So, _drop_writes suspends the dm-flakey device, freezes the
filesystem, turns off writes then thaws the filesystem, right?

If so, doesn't that mean you're not actually testing fsync() as the
freeze will effectively sync the entire filesystem before you start
dropping writes?

I can see why you want to stop unmount from writing back metadata to
simulate a crash, but if you've already frozen the filesystem then
writeback has already occurred before you stop the writes. So I
can't see how this is actually testing fsync - what it appears to be
testing is the fileystem freeze code...

[ This is precisely the issue that XFS shutdown ioctls deal with to
trigger an immediate forced shutdown of the filesystem that prevents
*any* further writes from being issued by the filesystem - no sync
operations get in the way and change the state of the filesystem
after then fsync call, so we know that what is on disk is what was
written by the sync/fsync calls being tested.

This is how we test sync/fsync in other XFS tests (e.g.
xfs/137-140), and this is the reason why us XFS people have
suggested that other filesystems should implement the ioctls for
this functionality rather than try to invent new ways of trying
to stop filesystems from writing back dirty metadata for fsync/sync
testing....

Besides, if a corruption is detected, you need a method of stopping
all dirty metadata from being written back in the filesystem to
prevent propagation of the corruption.  These ioctls should just be
an interface into that mechanism. ]

> +_cleanup
> +status=0
> +exit

No need to call _cleanup if you have a functioning trap. And, more
importantly, the only reason that status variable exists is so that
the trap function can ensure a correct exit value from the test.....

> diff --git a/tests/generic/group b/tests/generic/group
> index eb52833..a0830c1 100644
> --- a/tests/generic/group
> +++ b/tests/generic/group
> @@ -113,3 +113,4 @@
>  308 auto quick
>  309 auto quick
>  310 auto
> +311 auto

How long does this take to run? It seems like the quick group would
be appropriate if it takes less than a minute. Also, fsync tests
fall under the category of "metadata" and "log", so they probably
should be added, too.

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx

_______________________________________________
xfs mailing list
xfs@xxxxxxxxxxx
http://oss.sgi.com/mailman/listinfo/xfs