Re: [PATCH] Add new tests/generic/536: intermittent I/O errors must not corrupt a filesystem

Dave Chinner <david@xxxxxxxxxxxxx> · Fri, 22 Mar 2019 08:26:41 +1100

On Thu, Mar 21, 2019 at 10:30:46AM +0000, Edwin Török wrote:
> Based on tests/generic/347.
> 
> In our lab we've found that if multiple iSCSI connection errors are
> detected (without completely loosing the iSCSI connection) then the GFS2
> filesystem becomes corrupt due to differences in filesystem and device blocksizes.
> Add a test that explicitly checks for this by simulating I/O errors
> deterministically with dm-thin.

Exactly what IO errors is dm-thinp generating here? If you run it
out of space, then it triggers ENOSPC, not EIO. That's very, very
different to iSCSI throwing random EIO errors..

.....
> +# now remount the filesystem without triggering IO errors,
> +# and check that the filesystem is not corrupt
> +_dmthin_cycle_mount
> +# ls --color makes ls stat each file, which finds the corruption

Not sure it always does - ISTR that in the past if the dtype
returned indicated the type of file, then it ls would omit the stat
just for the purposes of coloring....

And, realistically, the way we find /filesystem/ corruption is to
run fsck/repair, not iterate the directory structure. If we are
looking for missing files, then we dump the directory structure to
the golden output file or dump it before/after errors and compare
that they are the same.

> +ls --color=always $SCRATCH_MNT/ >/dev/null || _fail "Failed to list filesystem after remount"
> +ls --color=always $SCRATCH_MNT/ >/dev/null || _fail "Failed to list filesystem after remount"
> +ls --color=always $SCRATCH_MNT/ >/dev/null || _fail "Failed to list filesystem after remount"

If corruption is not found on the first pass, why would the next 2
passes find anything different?

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx