Re: [PATCH v3] btrfs: Add a test for dead looping balance after balance cancel

Filipe Manana <fdmanana@xxxxxxxxx> · Wed, 20 May 2020 13:38:11 +0100

On Wed, May 20, 2020 at 12:32 PM Qu Wenruo <quwenruo.btrfs@xxxxxxx> wrote:
>
>
>
> On 2020/5/20 下午6:02, Qu Wenruo wrote:
> >
> >
> > On 2020/5/20 下午5:29, Filipe Manana wrote:
> >> On Wed, May 20, 2020 at 8:59 AM Qu Wenruo <wqu@xxxxxxxx> wrote:
> >>>
> >>> Test if canceling a running balance can cause later balance to dead
> >>> loop.
> >>>
> >>> The ifx is titled "btrfs: relocation: Clear the DEAD_RELOC_TREE bit for
> >>>  orphan roots to prevent runaway balance".
> >>>
> >>> Signed-off-by: Qu Wenruo <wqu@xxxxxxxx>
> >>> ---
> >>> Changelog:
> >>> v2:
> >>> - Remove lsof debug output
> >>> v3:
> >>> - Remove ps debug output
> >>> ---
> >>>  tests/btrfs/213     | 64 +++++++++++++++++++++++++++++++++++++++++++++
> >>>  tests/btrfs/213.out |  2 ++
> >>>  tests/btrfs/group   |  1 +
> >>>  3 files changed, 67 insertions(+)
> >>>  create mode 100755 tests/btrfs/213
> >>>  create mode 100644 tests/btrfs/213.out
> >>>
> >>> diff --git a/tests/btrfs/213 b/tests/btrfs/213
> >>> new file mode 100755
> >>> index 00000000..f56b0279
> >>> --- /dev/null
> >>> +++ b/tests/btrfs/213
> >>> @@ -0,0 +1,64 @@
> >>> +#! /bin/bash
> >>> +# SPDX-License-Identifier: GPL-2.0
> >>> +# Copyright (C) 2020 SUSE Linux Products GmbH. All Rights Reserved.
> >>> +#
> >>> +# FS QA Test 213
> >>> +#
> >>> +# Test if canceling a running balance can lead to dead looping balance
> >>> +#
> >>> +seq=`basename $0`
> >>> +seqres=$RESULT_DIR/$seq
> >>> +echo "QA output created by $seq"
> >>> +
> >>> +here=`pwd`
> >>> +tmp=/tmp/$$
> >>> +status=1       # failure is the default!
> >>> +trap "_cleanup; exit \$status" 0 1 2 3 15
> >>> +
> >>> +_cleanup()
> >>> +{
> >>> +       cd /
> >>> +       rm -f $tmp.*
> >>> +}
> >>> +
> >>> +# get standard environment, filters and checks
> >>> +. ./common/rc
> >>> +. ./common/filter
> >>> +
> >>> +# remove previous $seqres.full before test
> >>> +rm -f $seqres.full
> >>> +
> >>> +# Modify as appropriate.
> >>> +_supported_fs btrfs
> >>> +_supported_os Linux
> >>> +_require_scratch
> >>> +_require_command "$KILLALL_PROG" killall
> >>> +
> >>> +_scratch_mkfs >> $seqres.full
> >>> +_scratch_mount
> >>> +
> >>> +runtime=4
> >>> +
> >>> +# Create enough IO so that we need around $runtime seconds to relocate it
> >>> +dd if=/dev/zero bs=1M of="$SCRATCH_MNT/file" oflag=sync status=none \
> >>> +       &> /dev/null &
> >>> +dd_pid=$!
> >>> +sleep $runtime
> >>> +"$KILLALL_PROG" -q dd &> /dev/null
> >>
> >> Do we really need the killall program? There's only one dd process.
> >>
> >> We should also kill the dd process at _cleanup(), as killing the test
> >> during the sleep above will result in the dd process not being killed.
> >
> > The main problem here is, I can't find a good way to kill dd.
> > plain 'kill $dd_pid' doesn't seem to kill it properly, as my debug
> > lsof/ps still shows dd running and failed to unmount the fs.
> >
> > 'kill -KILL $dd_pid' kills it well, but causes extra output for the
> > terminated dd.
> >
> > Only 'killall -q dd' works as expected.
> >
> > Any good advice on this?
>
> Oh, the fstests "kindly" overrides dd, "kill $dd_pid" only kills the
> child bash running that dd function, not the real dd command.
> I'll use xfs_io to ensure we get no extra wrapper.

Indeed, I was doing this in the test:

ps -eo pid,ppid,pgid,comm > /tmp/ps.txt
my_pid=$$
echo "dd_pid = $dd_pid mypid = $my_pid" >> /tmp/ps.txt
kill -9 $dd_pid
wait $dd_pid

And was noticing that the ps.txt file had:

23467   937 23467 check
...
23840 23467 23467 213
...
24034 23840 23467 213
24039 24034 23467 dd
...
dd_pid = 24034 mypid = 23840

So dd_pid matched a bash process running the test, which is a child of
the bash process that invoked dd.

Before digging further I saw your reply and checked that common/rc
provides a function named dd() which is a wrapper to dd and it invokes
dd twice, whence you needed killall.
Interesting, wasn't aware of that wrapper.

Thanks.

>
> Thanks,
> Qu
>
> >
> >>
> >>> +wait $dd_pid
> >>> +
> >>> +# Now balance should take at least $runtime seconds, we can cancel it at
> >>> +# $runtime/2 to ensure a success cancel.
> >>> +$BTRFS_UTIL_PROG balance start -d --bg "$SCRATCH_MNT"
> >>> +sleep $(($runtime / 2))
> >>> +$BTRFS_UTIL_PROG balance cancel "$SCRATCH_MNT"
> >>> +
> >>> +# Now check if we can finish relocating metadata, which should finish very
> >>> +# quickly
> >>> +$BTRFS_UTIL_PROG balance start -m "$SCRATCH_MNT" >> $seqres.full
> >>
> >> Why redirect this one to $seqres.full and not the other balance? What
> >> kind of useful information it provides?
> >
> > Not really much, just an indicator to show that the balance finishes as
> > expected.
> > And we don't want to output it golden output, as mkfs change may create
> > more metadata chunks and cause difference.
> >
> > For other ones, like the one to be canceled, we don't really care that much.
> >
> > Thanks,
> > Qu
> >>
> >>> +
> >>> +echo "Silence is golden"
> >>> +
> >>> +# success, all done
> >>> +status=0
> >>> +exit
> >>> diff --git a/tests/btrfs/213.out b/tests/btrfs/213.out
> >>> new file mode 100644
> >>> index 00000000..bd8f2430
> >>> --- /dev/null
> >>> +++ b/tests/btrfs/213.out
> >>> @@ -0,0 +1,2 @@
> >>> +QA output created by 213
> >>> +Silence is golden
> >>> diff --git a/tests/btrfs/group b/tests/btrfs/group
> >>> index 8d65bddd..fe4d5bb3 100644
> >>> --- a/tests/btrfs/group
> >>> +++ b/tests/btrfs/group
> >>> @@ -215,3 +215,4 @@
> >>>  210 auto quick qgroup snapshot
> >>>  211 auto quick log prealloc
> >>>  212 auto balance dangerous
> >>> +213 auto fast balance dangerous
> >>
> >> fast -> quick
> >>
> >> Thanks.
> >>
> >>> --
> >>> 2.26.2
> >>>
> >>
> >>
> >
>

-- 
Filipe David Manana,

“Whether you think you can, or you think you can't — you're right.”