Re: [PATCH] Add new tests/generic/536: intermittent I/O errors must not corrupt a filesystem

Edwin Török <edvin.torok@xxxxxxxxxx> · Fri, 22 Mar 2019 14:42:27 +0000

On 21/03/2019 20:23, Darrick J. Wong wrote:
> On Thu, Mar 21, 2019 at 10:30:46AM +0000, Edwin Török wrote:
>> Based on tests/generic/347.
>>
>> In our lab we've found that if multiple iSCSI connection errors are
>> detected (without completely loosing the iSCSI connection) then the GFS2
>> filesystem becomes corrupt due to differences in filesystem and device blocksizes.
>> Add a test that explicitly checks for this by simulating I/O errors
>> deterministically with dm-thin.
> 
> How is this different from generic/475?  Is there something specific to
> thin pools here (vs. using dm-error to simulate the errors)?

When I tried generic/475 it hanged in unmount and never reached the data corruption part.
Thanks for the suggestion, dm-error would be better than dm-thin, see below.

On 21/03/2019 21:26, Dave Chinner wrote:> On Thu, Mar 21, 2019 at 10:30:46AM +0000, Edwin Török wrote:
>> Based on tests/generic/347.
>>
>> In our lab we've found that if multiple iSCSI connection errors are
>> detected (without completely loosing the iSCSI connection) then the GFS2
>> filesystem becomes corrupt due to differences in filesystem and device blocksizes.
>> Add a test that explicitly checks for this by simulating I/O errors
>> deterministically with dm-thin.
> 
> Exactly what IO errors is dm-thinp generating here? If you run it
> out of space, then it triggers ENOSPC, not EIO. That's very, very
> different to iSCSI throwing random EIO errors..

I agree that dm-error would be a better starting place than dm-thin for this test,
I'll try to modify it and see if I can get it to finish running without hanging, and reproduce the corruption issue.

On 21/03/2019 21:26, Dave Chinner wrote:> On Thu, Mar 21, 2019 at 10:30:46AM +0000, Edwin Török wrote:
>> +# now remount the filesystem without triggering IO errors,
>> +# and check that the filesystem is not corrupt
>> +_dmthin_cycle_mount
>> +# ls --color makes ls stat each file, which finds the corruption
> 
> Not sure it always does - ISTR that in the past if the dtype
> returned indicated the type of file, then it ls would omit the stat
> just for the purposes of coloring....
> 
> And, realistically, the way we find /filesystem/ corruption is to
> run fsck/repair, not iterate the directory structure.

I don't disagree, however GFS2's fsck is very noisy and complains about inconsistencies
even on a filesystem where I can otherwise list and read each entry correctly.
I wanted to make a clear distinction between that and actual corruption observed, so that the 2 bugs
can be fixed independently.

Perhaps the test should first do an 'ls/stat', and if that is fine then unmount and run the filesystem check as usual.

> If we are
> looking for missing files, then we dump the directory structure to
> the golden output file or dump it before/after errors and compare
> that they are the same.
> 
>> +ls --color=always $SCRATCH_MNT/ >/dev/null || _fail "Failed to list filesystem after remount"
>> +ls --color=always $SCRATCH_MNT/ >/dev/null || _fail "Failed to list filesystem after remount"
>> +ls --color=always $SCRATCH_MNT/ >/dev/null || _fail "Failed to list filesystem after remount"
> 
> If corruption is not found on the first pass, why would the next 2
> passes find anything different?

Indeed, I'll drop them.

Thanks,
--Edwin