Re: [PATCH v3 2/2] xfs: test the deadlock between the AGI and AGF with RENAME_WHITEOUT

Brian Foster <bfoster@xxxxxxxxxx> · Fri, 20 Sep 2019 08:20:59 -0400

On Fri, Sep 20, 2019 at 05:18:00PM +0800, kaixuxia wrote:
> 
> On 2019/9/19 20:26, Brian Foster wrote:
> > On Thu, Sep 19, 2019 at 08:14:10PM +0800, kaixuxia wrote:
> >>
> >>
> >> On 2019/9/19 18:47, Brian Foster wrote:
> >>> On Thu, Sep 19, 2019 at 05:08:04PM +0800, kaixuxia wrote:
> >>>>
> >>>>
> >>>> On 2019/9/18 21:59, Brian Foster wrote:
> >>>>> On Wed, Sep 18, 2019 at 07:49:22PM +0800, kaixuxia wrote:
> >>>>>> There is ABBA deadlock bug between the AGI and AGF when performing
> >>>>>> rename() with RENAME_WHITEOUT flag, and add this testcase to make
> >>>>>> sure the rename() call works well.
> >>>>>>
> >>>>>> Signed-off-by: kaixuxia <kaixuxia@xxxxxxxxxxx>
> >>>>>> ---
> >>>>>
> >>>>> FYI, for some reason your patch series isn't threaded on the mailing
> >>>>> list. I thought git send-email did this by default. Assuming you're not
> >>>>> explicitly using --no-thread, you might have to use the --thread option
> >>>>> so this gets posted as a proper series.
> >>>>>
> >>>> Yeah, thanks!
> >>>>>>  tests/xfs/512     | 96 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >>>>>>  tests/xfs/512.out |  2 ++
> >>>>>>  tests/xfs/group   |  1 +
> >>>>>>  3 files changed, 99 insertions(+)
> >>>>>>  create mode 100755 tests/xfs/512
> >>>>>>  create mode 100644 tests/xfs/512.out
> >>>>>>
> >>>>>> diff --git a/tests/xfs/512 b/tests/xfs/512
> >>>>>> new file mode 100755
> >>>>>> index 0000000..a2089f0
> >>>>>> --- /dev/null
> >>>>>> +++ b/tests/xfs/512
> >>>>>> @@ -0,0 +1,96 @@
> >>>>>> +#! /bin/bash
> >>>>>> +# SPDX-License-Identifier: GPL-2.0
> >>>>>> +# Copyright (c) 2019 Tencent.  All Rights Reserved.
> >>>>>> +#
> >>>>>> +# FS QA Test 512
> >>>>>> +#
> >>>>>> +# Test the ABBA deadlock case between the AGI and AGF When performing
> >>>>>> +# rename operation with RENAME_WHITEOUT flag.
> >>>>>> +#
> >>>>>> +seq=`basename $0`
> >>>>>> +seqres=$RESULT_DIR/$seq
> >>>>>> +echo "QA output created by $seq"
> >>>>>> +
> >>>>>> +here=`pwd`
> >>>>>> +tmp=/tmp/$$
> >>>>>> +status=1	# failure is the default!
> >>>>>> +trap "_cleanup; exit \$status" 0 1 2 3 15
> >>>>>> +
> >>>>>> +_cleanup()
> >>>>>> +{
> >>>>>> +	cd /
> >>>>>> +	rm -f $tmp.*
> >>>>>> +}
> >>>>>> +
> >>>>>> +# get standard environment, filters and checks
> >>>>>> +. ./common/rc
> >>>>>> +. ./common/filter
> >>>>>> +. ./common/renameat2
> >>>>>> +
> >>>>>> +rm -f $seqres.full
> >>>>>> +
> >>>>>> +# real QA test starts here
> >>>>>> +_supported_fs xfs
> >>>>>> +_supported_os Linux
> >>>>>> +# single AG will cause default xfs_repair to fail. This test need a
> >>>>>> +# single AG fs, so ignore the check.
> >>>>>> +_require_scratch_nocheck
> >>>>>> +_requires_renameat2 whiteout
> >>>>>> +
> >>>>>> +filter_enospc() {
> >>>>>> +	sed -e '/^.*No space left on device.*/d'
> >>>>>> +}
> >>>>>> +
> >>>>>> +create_file()
> >>>>>> +{
> >>>>>> +	local target_dir=$1
> >>>>>> +	local files_count=$2
> >>>>>> +
> >>>>>> +	for i in $(seq 1 $files_count); do
> >>>>>> +		echo > $target_dir/f$i >/dev/null 2>&1 | filter_enospc
> >>>>>> +	done
> >>>>>> +}
> >>>>>> +
> >>>>>> +rename_whiteout()
> >>>>>> +{
> >>>>>> +	local target_dir=$1
> >>>>>> +	local files_count=$2
> >>>>>> +
> >>>>>> +	# a long filename could increase the possibility that target_dp
> >>>>>> +	# allocate new blocks(acquire the AGF lock) to store the filename
> >>>>>> +	longnamepre=`$PERL_PROG -e 'print "a"x200;'`
> >>>>>> +
> >>>>>> +	# now try to do rename with RENAME_WHITEOUT flag
> >>>>>> +	for i in $(seq 1 $files_count); do
> >>>>>> +		src/renameat2 -w $SCRATCH_MNT/f$i $target_dir/$longnamepre$i >/dev/null 2>&1
> >>>>>> +	done
> >>>>>> +}
> >>>>>> +
> >>>>>> +_scratch_mkfs_xfs -d agcount=1 >> $seqres.full 2>&1 ||
> >>>>>> +	_fail "mkfs failed"
> >>>>>
> >>>>> This appears to be the only XFS specific bit. Could it be
> >>>>> conditionalized using FSTYP such that this test could go under
> >>>>> tests/generic?
> >>>>>
> >>>> OK, I'll move this test to tests/generic by using FSTYP.
> >>>>
> >>>>>> +_scratch_mount
> >>>>>> +
> >>>>>> +# set the rename and create file counts
> >>>>>> +file_count=50000
> >>>>>> +
> >>>>>> +# create the necessary directory for create and rename operations
> >>>>>> +createdir=$SCRATCH_MNT/createdir
> >>>>>> +mkdir $createdir
> >>>>>> +renamedir=$SCRATCH_MNT/renamedir
> >>>>>> +mkdir $renamedir
> >>>>>> +
> >>>>>> +# create many small files for the rename with RENAME_WHITEOUT
> >>>>>> +create_file $SCRATCH_MNT $file_count
> >>>>>> +
> >>>>>> +# try to create files at the same time to hit the deadlock
> >>>>>> +rename_whiteout $renamedir $file_count &
> >>>>>> +create_file $createdir $file_count &
> >>>>>> +
> >>>>>
> >>>>> When I ran this test I noticed that the rename_whiteout task completed
> >>>>> renaming the 50k files before the create_file task created even 30k of
> >>>>> the 50k files. There's no risk of deadlock once one of these tasks
> >>>>> completes, right? If so, that seems like something that could be fixed
> >>>>> up.
> >>>>>
> >>>>> Beyond that though, the test itself ran for almost 19 minutes on a vm
> >>>>> with the deadlock fix. That seems like overkill to me for a test that's
> >>>>> so narrowly focused on a particular bug that it's unlikely to fail in
> >>>>> the future. If we can't find a way to get this down to a reasonable time
> >>>>> while still reproducing the deadlock, I'm kind of wondering if there's a
> >>>>> better approach to get more rename coverage from existing tests. For
> >>>>> example, could we add this support to fsstress and see if any of the
> >>>>> existing stress tests might trigger the original problem? Even if we
> >>>>> needed to add a new rename/create focused fsstress test, that might at
> >>>>> least be general enough to provide broader coverage.
> >>>>>
> >>>> Yeah, rename_whiteout task run faster than create_file task, so maybe
> >>>> we can set two different files counts for them to reduce the test run
> >>>> time. This test ran for 380s on my vm with the fixed kernel, but we
> >>>> still need to find a way to reduce the run time, like the 19 minutes
> >>>> case. Actually, in most cases, the deadlock happened when the
> >>>> rename_whiteout task completed renaming hundreds of files. 50000
> >>>> is set just because this test take 380s on my vm which is acceptable
> >>>> and the reproduce possibility is near 100%. So maybe we can choose a
> >>>> proper files count to make the test runs faster. Of course, I'll
> >>>> also try to use fsstresss and the TIME_FACTOR if they can help to
> >>>> reduce the run time.
> >>>>  
> >>>
> >>> I think using different file counts as such is too unpredictable across
> >>> different test environments. If we end up with something like the
> >>> current test, I'd rather see explicit logic in the test to terminate the
> >>> workload thread when the rename thread completes. This probably would
> >>> have knocked 2-3 minutes off the slow runtime I reported above.
> >>>
> >>> That aside, I think the fsstress approach is preferable because there is
> >>> at least potential to avoid the need for a new test. The relevant
> >>> questions to me are:
> >>>
> >>> 1.) If you add renameat2 support to fsstress, do any of the existing
> >>> fsstress tests reproduce the original problem?
> >>
> >> Not sure about this, need to do research whether there are existing
> >> fsstress tests can reproduce the problem.	
> > 
> > Right, but this will require some work to fsstress to add renameat2
> > support. To Eryu's earlier point, however, that is probably a useful
> > patch regardless of what approach we take below.
> 
> Yeah, if taking the fsstress approach to reproduce the deadlock we need
> to add renameat2 support for fsstress. Given that the deadlock is a corner
> case and need some special settings for higher reproduce possibility, such
> as the smaller bsize and agcount, the longer rename target filename, and all
> the rename call need have the same target_dp to acquire the AGF lock
> (xfs_dir_createname()), so we may need a separate test with using customized
> parameters(limited to rename(whiteout) and creates). If not, the possibility
> would be lower and also need longer run time.     
> 
> > 
> >>>
> >>> 2.) If not, can fsstress reproduce the problem using customized
> >>> parameters (i.e., limited to rename and creates)? If so, we may still
> >>> need a separate test, but it would be trivial in that it just invokes
> >>> fsstress with particular flags for a period of time.
> >>>
> >>> 3.) If not, then we need to find a way for this test to run quicker. At
> >>> this point, I'm curious how long it takes for this test to reproduce the
> >>> problem (on a broken kernel) on average once the setup portion
> >>> completes. More than a minute or two, for example, or tens of minutes..
> >>> etc.?
> >>>
> >> About five minutes with 50000 files count on a broken kernel to reproduce
> >> the deadlock on my vm, and the most time is preparing 50000 empty files for
> >> the rename operation.
> >>
> > 
> > Ok, so how much time is that outside of the file creation part? You
> > could add some timestamps to the test to figure that out if necessary.
> > How many files were renamed before the deadlock occurred?
> > 
> About one or two minutes that outside of the file creation on my vm. 
> The number of renamed files before the deadlock occurred is not fixed,
> the most common range is from 500 to 10000 files. Of course, this range
> maybe change on different test environments...   
> 

Ok. So FWIW a 10k file count and immediate exit after the renameat task
completes is a less noticeable 3m45s on my low end vm. I'd still prefer
to see an fsstress test if possible, but that might be a reasonable
fallback option.

Brian

> > Brian
> > 
> >> A example for deadlock happened when renaming 2729 files.
> >> call trace: 
> >> root  31829  ... D+ ... /renamedir/aaaaaaaaaaaaaaaaaaa...aaaaaaaaaaaaaaa2729
> >> # cat /proc/31829/stack 
> >> [<0>] xfs_buf_lock+0x34/0xf0 [xfs]
> >> [<0>] xfs_buf_find+0x215/0x6c0 [xfs]
> >> [<0>] xfs_buf_get_map+0x37/0x230 [xfs]
> >> [<0>] xfs_buf_read_map+0x29/0x190 [xfs]
> >> [<0>] xfs_trans_read_buf_map+0x13d/0x520 [xfs]
> >> [<0>] xfs_read_agi+0xa8/0x160 [xfs]
> >> [<0>] xfs_iunlink_remove+0x6f/0x2a0 [xfs]
> >> [<0>] xfs_rename+0x57a/0xae0 [xfs]
> >> [<0>] xfs_vn_rename+0xe4/0x150 [xfs]
> >> [<0>] vfs_rename+0x1f4/0x7b0
> >> [<0>] do_renameat2+0x431/0x4c0
> >> [<0>] __x64_sys_renameat2+0x20/0x30
> >> [<0>] do_syscall_64+0x49/0x120
> >> [<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
> >>
> >>> Brian
> >>>
> >>>>> Alternatively, what if this test ran a create/rename workload (on a
> >>>>> smaller fileset) for a fixed time of a minute or two and then exited? I
> >>>>> think it would be a reasonable compromise if the test still reproduced
> >>>>> on some smaller frequency, it's just not clear to me how effective such
> >>>>> a test would be without actually trying it. Maybe Eryu has additional
> >>>>> thoughts..
> >>>>>
> >>>>> Brian
> >>>>>
> >>>>>> +wait
> >>>>>> +echo Silence is golden
> >>>>>> +
> >>>>>> +# Failure comes in the form of a deadlock.
> >>>>>> +
> >>>>>> +# success, all done
> >>>>>> +status=0
> >>>>>> +exit
> >>>>>> diff --git a/tests/xfs/512.out b/tests/xfs/512.out
> >>>>>> new file mode 100644
> >>>>>> index 0000000..0aabdef
> >>>>>> --- /dev/null
> >>>>>> +++ b/tests/xfs/512.out
> >>>>>> @@ -0,0 +1,2 @@
> >>>>>> +QA output created by 512
> >>>>>> +Silence is golden
> >>>>>> diff --git a/tests/xfs/group b/tests/xfs/group
> >>>>>> index a7ad300..ed250d6 100644
> >>>>>> --- a/tests/xfs/group
> >>>>>> +++ b/tests/xfs/group
> >>>>>> @@ -509,3 +509,4 @@
> >>>>>>  509 auto ioctl
> >>>>>>  510 auto ioctl quick
> >>>>>>  511 auto quick quota
> >>>>>> +512 auto rename
> >>>>>> -- 
> >>>>>> 1.8.3.1
> >>>>>>
> >>>>>> -- 
> >>>>>> kaixuxia
> >>>>
> >>>> -- 
> >>>> kaixuxia
> >>
> >> -- 
> >> kaixuxia
> 
> -- 
> kaixuxia