Re: [PATCH v2] xfstests 293: test xfs direct IO nested transaction deadlock

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Fri, Jan 18, 2013 at 11:28:18AM +1100, Dave Chinner wrote:
> On Thu, Jan 17, 2013 at 03:27:49PM +0800, Eryu Guan wrote:
> > Regression test case for commit:
> > 
> > 437a255 xfs: fix direct IO nested transaction deadlock.
> > 
> > Signed-off-by: Eryu Guan <eguan@xxxxxxxxxx>
> 
> Couple of things:
> 
> > +_scratch_mkfs_xfs -d agcount=16,su=256k,sw=12 -l su=256k,size=2560b >/dev/null 2>&1
> > +_scratch_mount >/dev/null 2>&1
> > +
> > +TESTDIR="$SCRATCH_MNT/testdir"
> > +mkdir -p $TESTDIR
> 
> That's far too similar to TEST_DIR - the mount point for the test
> device.  Reading the test makes me wonder why fsstress is beig run
> on the test device, while the freezes are being done on the scratch
> device.
> 
> Other tests use $testdir, which is fine because the difference in
> case makes it obvious that it's not the test device mount point. I'd
> prefer you cal it something like "STRESS_DIR", though, so it's
> obvious what it is being used for...

Make sense, I'll change it to STRESS_DIR
> 
> > +$FSSTRESS_PROG -d $TESTDIR -n 100 -p 1000 $FSSTRESS_AVOID >/dev/null 2>&1 &
> > +
> > +# Freeze/unfreeze file system randomly
> > +echo "Start freeze/unfreeze randomly" | tee -a $seq.full
> > +LOOP=10
> > +while [ $LOOP -gt 0 ];do
> > +	TIMEOUT=`expr $RANDOM % 5`
> > +	sleep $TIMEOUT
> > +	echo "* Freeze file system after sleeping $TIMEOUT seconds" >>$seq.full
> > +	xfs_freeze -f $SCRATCH_MNT
> > +	if [ $? -ne 0 ];then
> > +		echo " - Error: freeze filesystem failed" | tee -a $seq.full
> > +	fi
> > +	TIMEOUT=`expr $RANDOM % 3`
> > +	sleep $TIMEOUT
> > +	echo "* Unfreeze file system after sleeping $TIMEOUT seconds" >>$seq.full
> > +	xfs_freeze -u $SCRATCH_MNT
> > +	if [ $? -ne 0 ];then
> > +		echo " - Error: unfreeze filesystem failed" | tee -a $seq.full
> > +	fi
> > +	((LOOP--))
> 
> Does that sort of decrement construct work on older versions of
> bash?

I've tried it on RHEL5.8 and it worked as expected, bash version is 3.2.25,
so I guess that's fine. But I think it's better to use the expression

	let LOOP=$LOOP-1 

which many other tests used to decrease counter.
> 
> > +done
> > +echo "Test done" | tee -a $seq.full
> > +killall -q $FSSTRESS_PROG
> > +sync
> 
> No need for the sync, _check_scratch_fs will do it as part of the
> unmount. FWIW you probably need a "wait" call here to wait for all
> the fsstress processes to exit otherwise the unmount will fail.
> 
> In fact, I bet that's what the sync call does - many of the 1000
> fsstress processes will be blocked waiting for queued sys_sync()
> operations to complete as the kernel serialises them and fsstress
> issues them faster than they can be completed. queuing another sync
> will effectively wait for them all to complete and die. I often saw
> 700-800 of the fsstress processes stuck like this when testing the
> fix for the bug this test exposed...

Yes, that's what the sync call is for.

After more testings, I found that unmount succeeds without any wait/sync
call on upstream kernel(3.8-rc3), sync is needed on RHEL6 kernel to make
unmount success, wait won't help.

I have three proposals:
1. do not kill any fsstress processes, call wait to wait for them to die
2. kill fsstress and call wait (as what you suggested, but would fail on
   RHEL6 kernel, another RHEL bug needed?)
3. kill fsstress and call sync(leave the sync untouched), this will make
   it pass on both upstream and RHEL kernels

Any other better ways to go?

Thanks for the review!

Eryu

> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@xxxxxxxxxxxxx

_______________________________________________
xfs mailing list
xfs@xxxxxxxxxxx
http://oss.sgi.com/mailman/listinfo/xfs


[Index of Archives]     [Linux XFS Devel]     [Linux Filesystem Development]     [Filesystem Testing]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux