On 21/03/2019 20:23, Darrick J. Wong wrote: > On Thu, Mar 21, 2019 at 10:30:46AM +0000, Edwin Török wrote: >> Based on tests/generic/347. >> >> In our lab we've found that if multiple iSCSI connection errors are >> detected (without completely loosing the iSCSI connection) then the GFS2 >> filesystem becomes corrupt due to differences in filesystem and device blocksizes. >> Add a test that explicitly checks for this by simulating I/O errors >> deterministically with dm-thin. > > How is this different from generic/475? Is there something specific to > thin pools here (vs. using dm-error to simulate the errors)? When I tried generic/475 it hanged in unmount and never reached the data corruption part. Thanks for the suggestion, dm-error would be better than dm-thin, see below. On 21/03/2019 21:26, Dave Chinner wrote:> On Thu, Mar 21, 2019 at 10:30:46AM +0000, Edwin Török wrote: >> Based on tests/generic/347. >> >> In our lab we've found that if multiple iSCSI connection errors are >> detected (without completely loosing the iSCSI connection) then the GFS2 >> filesystem becomes corrupt due to differences in filesystem and device blocksizes. >> Add a test that explicitly checks for this by simulating I/O errors >> deterministically with dm-thin. > > Exactly what IO errors is dm-thinp generating here? If you run it > out of space, then it triggers ENOSPC, not EIO. That's very, very > different to iSCSI throwing random EIO errors.. I agree that dm-error would be a better starting place than dm-thin for this test, I'll try to modify it and see if I can get it to finish running without hanging, and reproduce the corruption issue. On 21/03/2019 21:26, Dave Chinner wrote:> On Thu, Mar 21, 2019 at 10:30:46AM +0000, Edwin Török wrote: >> +# now remount the filesystem without triggering IO errors, >> +# and check that the filesystem is not corrupt >> +_dmthin_cycle_mount >> +# ls --color makes ls stat each file, which finds the corruption > > Not sure it always does - ISTR that in the past if the dtype > returned indicated the type of file, then it ls would omit the stat > just for the purposes of coloring.... > > And, realistically, the way we find /filesystem/ corruption is to > run fsck/repair, not iterate the directory structure. I don't disagree, however GFS2's fsck is very noisy and complains about inconsistencies even on a filesystem where I can otherwise list and read each entry correctly. I wanted to make a clear distinction between that and actual corruption observed, so that the 2 bugs can be fixed independently. Perhaps the test should first do an 'ls/stat', and if that is fine then unmount and run the filesystem check as usual. > If we are > looking for missing files, then we dump the directory structure to > the golden output file or dump it before/after errors and compare > that they are the same. > >> +ls --color=always $SCRATCH_MNT/ >/dev/null || _fail "Failed to list filesystem after remount" >> +ls --color=always $SCRATCH_MNT/ >/dev/null || _fail "Failed to list filesystem after remount" >> +ls --color=always $SCRATCH_MNT/ >/dev/null || _fail "Failed to list filesystem after remount" > > If corruption is not found on the first pass, why would the next 2 > passes find anything different? Indeed, I'll drop them. Thanks, --Edwin