Re: [Lsf-pc] [LSF/MM TOPIC] Working towards better power fail testing

Josef Bacik <jbacik@xxxxxx> · Tue, 6 Jan 2015 11:39:27 -0500

On 01/06/2015 03:53 AM, Jan Kara wrote:
On Tue 06-01-15 08:47:55, Dave Chinner wrote:
As things stand now the other devs are loathe to touch any remotely exotic
fs call, but that hardly seems ideal.  Hopefully a common framework for
powerfail testing can improve on this.  Perhaps there are other ways we
make it easier to tell what is (well) tested, and conversely ensure that
those tests are well-aligned with what real users are doing...

We don't actually need power failure (or even device failure)
infrastructure to test data integrity on failure. Filesystems just
need a shutdown method that stops any IO from being issued once the
shutdown flag is set. XFS has this and it's used by xfstests via the
"godown" utility to shut the fileystem down in various
circumstances. We've been using this for data integrity and log
recovery testing in xfstests for many years.

Hence we know if the device behaves correctly w.r.t cache flushes
and FUA then the filesystem will behave correctly on power loss. We
don't need a device power fail simulator to tell us violating
fundamental architectural assumptions will corrupt filesystems....
   I think that fs ioctl cannot easily simulate the situation where
on-device volatile caches aren't properly flushed in all the necessary
cases (we had a bugs like this in ext3/4 in the past which were hit by real
users).

Agreed, my dm thing was meant to expose problems where we do not wait on 
IO properly before writing our super, a problem we've had at least twice 
so far.  I wanted something that was nice and simple and would quickly 
expose these kind of bugs.

I also think that simulating the device failure in a different layer is
simpler than checking for superblock flag in all the places where the
filesystem submits IO (e.g. ext4 doesn't have dedicated buffer layer like
xfs has and we rely on flusher thread to flush committed metadata to final
location on disk so that writeback path completely avoids ext4 code - it's
a generic writeback of the block device mapping). So I like the solution
with the dm target more than a fs ioctl although I agree that it's more
clumsy from the xfstests perspective.

So I'm working in support to xfstests fsx to emit the proper dm messages 
when it does an fsync so we can easily build a test to stress test fsync 
in all the horrible ways that fsx works.  Building tests around the dm 
target I've written is pretty simple, you just do something like

create device
mkfs device
mark the mkfs in the log
mount device
do your operations
unmount
replay log in whichever way you want and verify the contents

The replay thing is accomplished by the library and some helper 
functions in xfstests, so it's no more awkward than what we do with dm 
flakey, and gives us a bit more reproduce-ability and lets us check more 
esoteric failure conditions.

Like Jan says, we all do things differently, we are all our own little 
snowflakes, I feel like a dm target is a nice solution where we can 
impose a certain set of rules in very little code and all agree that 
it's correct, and then build tests around that.  Then our current fs'es 
will be well tested and any new fs'es will be equally well tested, all 
without having to add fs specific code that could be buggy.  Thanks,

Josef
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html