Re: [PATCH 2/2] fstests: btrfs/011: Handle finished scrub/replace operation gracefully

Qu Wenruo <quwenruo.btrfs@xxxxxxx> · Wed, 18 Sep 2019 15:57:28 +0800

On 2019/9/18 下午3:54, Anand Jain wrote:
> On 18/9/19 2:56 PM, Qu Wenruo wrote:
>> [BUG]
>> When btrfs/011 is executed on a fast enough system (fully memory backed
>> VM, with test device has unsafe cache mode), the test can fail like
>> this:
>>
>>    btrfs/011 43s ... [failed, exit status 1]- output mismatch (see
>> /home/adam/xfstests-dev/results//btrfs/011.out.bad)
>>      --- tests/btrfs/011.out     2019-07-22 14:13:44.643333326 +0800
>>      +++ /home/adam/xfstests-dev/results//btrfs/011.out.bad     
>> 2019-09-18 14:49:28.308798022 +0800
>>      @@ -1,3 +1,4 @@
>>       QA output created by 011
>>       *** test btrfs replace
>>      -*** done
>>      +failed: '/usr/bin/btrfs replace cancel /mnt/scratch'
>>      +(see /home/adam/xfstests-dev/results//btrfs/011.full for details)
>>      ...
>>
>> [CAUSE]
>> Looking into the full output, it shows:
>>    ...
>>    Replace from /dev/mapper/test-scratch1 to /dev/mapper/test-scratch2
>>
>>    # /usr/bin/btrfs replace start -f /dev/mapper/test-scratch1
>> /dev/mapper/test-scratch2 /mnt/scratch
>>    # /usr/bin/btrfs replace cancel /mnt/scratch
>>    INFO: ioctl(DEV_REPLACE_CANCEL)"/mnt/scratch": not started
>>    failed: '/usr/bin/btrfs replace cancel /mnt/scratch'
>>
>> So this means the replace is already finished before we cancel it.
>> For fast system, it's very common.
>>
>> [FIX]
>> Instead of using _run_btrfs_util_prog which requires 0 as return value,
>> we just call "$BTRFS_UTIL_PROG replace cancel" and ignore all its
>> stderr/stdout, and completely rely on "$BTRFS_UTIL_PROG replace status"
>> output to verify the work.
>>
>> Furthermore if we finished replac before cancelling it, we should
>> replace again to switch the device back, or after the test case, btrfs
>> check will fail as there is no valid btrfs on that replaced device.
>>
>> Signed-off-by: Qu Wenruo <wqu@xxxxxxxx>
>> ---
>>   tests/btrfs/011 | 16 ++++++++++++++--
>>   1 file changed, 14 insertions(+), 2 deletions(-)
>>
>> diff --git a/tests/btrfs/011 b/tests/btrfs/011
>> index 89bb4d11..858b00e8 100755
>> --- a/tests/btrfs/011
>> +++ b/tests/btrfs/011
>> @@ -148,13 +148,25 @@ btrfs_replace_test()
>>           # background the replace operation (no '-B' option given)
>>           _run_btrfs_util_prog replace start -f $replace_options
>> $source_dev $target_dev $SCRATCH_MNT
>>           sleep 1
>> -        _run_btrfs_util_prog replace cancel $SCRATCH_MNT
>> +        # 1s is enough for fast system to finish replace, so here we
>> +        # ignore all the output, completely rely on later status
>> +        # output to determine
>> +        $BTRFS_UTIL_PROG replace cancel $SCRATCH_MNT &> /dev/null
>>             # 'replace status' waits for the replace operation to finish
>>           # before the status is printed
>>           $BTRFS_UTIL_PROG replace status $SCRATCH_MNT > $tmp.tmp 2>&1
>>           cat $tmp.tmp >> $seqres.full
>> -        grep -q canceled $tmp.tmp || _fail "btrfs replace status
>> (canceled) failed"
>> +        grep -q -e canceled -e finished $tmp.tmp ||\
>> +            _fail "btrfs replace status (canceled) failed"
>> +
>> +        # If replace finished before cancel, replace them back or
>> +        # the final fsck after test case will fail as there is no btrfs
>> +        # on the $source_dev anymore
>> +        if grep -q -e finished $tmp.tmp ; then
>> +            $BTRFS_UTIL_PROG replace start -Bf $replace_options \
>> +                $target_dev $source_dev $SCRATCH_MNT
>> +        fi
>>       else
>>           if [ "${quick}Q" = "thoroughQ" ]; then
>>               # On current hardware, the thorough test runs
>>
> 
> Faced the same problem before. But I don't have a good solution
> to fix. Because the idea of the test case appears to test the replace
> cancel. This change makes it not to error if replace is not running
> in the first place.

Then what about _notrun if we find that replace/scrub finished too early?

BTW, if we really want to test all regular scrub/replace with
cancel/finish combination, we should split this test into 4 test cases,
then we can completely afford to skip 2 while still allow the finish
cases to run...

Thanks,
Qu

> 
> Thanks, Anand
> 

Attachment:
signature.asc

Description: OpenPGP digital signature