Re: Crash Consistency xfstests

Amir Goldstein <amir73il@xxxxxxxxx> · Mon, 21 Aug 2017 17:35:02 +0200

On Wed, Aug 16, 2017 at 3:06 PM, Josef Bacik <josef@xxxxxxxxxxxxxx> wrote:
...
>
> Sorry I was travelling yesterday so I couldn't give this my full attention.
> Everything you guys do is already accomplished with dm-log-writes.  If you look
> at the example scripts I've provided
>
> https://github.com/josefbacik/log-writes/blob/master/replay-individual-faster.sh
> https://github.com/josefbacik/log-writes/blob/master/replay-fsck-wrapper.sh
>
> The first initiates the replay, and points at the second script to run after
> each entry is replayed.  The whole point of this stuff was to make it as
> flexible as possible.  The way we use it is to replay, create a snapshot of the
> replay, mount, unmount, fsck, delete the snapshot and carry on to the next
> position in the log.
>
> There is nothing keeping us from generating random crash points, this has been
> something on my list of things to do forever.  All that would be required would
> be to hold the entries between flush/fua events in memory, and then replay them
> in whatever order you deemed fit.  That's the only functionality missing from my
> replay-log stuff that CrashMonkey has.
>
> The other part of this is getting user space applications to do more thorough
> checking of consistency that it expects, which I implemented here
>
> https://github.com/josefbacik/fstests/commit/70d41e17164b2afc9a3f2ae532f084bf64cb4a07
>
> fsx will randomly do operations to a file, and every time it fsync()'s it saves
> it's state and marks the log.  Then we can go back and replay the log to the
> mark and md5sum the file to make sure it matches the saved state.  This
> infrastructure was meant to be as simple as possible so the possiblities for
> crash consistency testing were endless.  One of the next areas we plan to use
> this in Facebook is just for application consistency, so we can replay the fs
> and verify the application works in whatever state the fs is at any given point.
>

Joseph,

FYI, while testing your patches I found that on my system (Ubuntu 16.04)
fsx was always generating the same pseudo random sequence, even
though the printed seed was different.

Replacing initstate()/setstate() with srandom() in fsx fixed the problem for me.
When I further mixed pid into the randomized seed, thus, generating
different sequence of events in the 4 parallel fsx invocations, I
started getting
checksum failures on replay. I will continue to investigate this phenomena.

BTW, I am not sure if it is best to use a randomized or constant random seed
for an xfstest. What is the common practice if any?

> 3) My patches need to actually be pushed into upstream fstests.  This would be
> the largest win because then all the fs developers would be running the tests
> by default.
>

FYI, I rebased your patch, added some minor cleanups and tested over xfs:
https://github.com/amir73il/xfstests/commits/dm-log-writes

replay-log is still an external dependency, but I intend to import
it as xfstests src test program.

I also intend to split your patch into several smaller patches
- infrastructure
- fsx fixes
- generic test

When done with this, I will try to import the fsstress/replay test to
xfstests.

For now, I will leave the btrfs specific tests out from my work.
It should be trivial to add them once the basic infra has been merged.

I noticed that if SCRATCH_DEV is a dm target itself (linear), then
log-writes target creation fails. Is that by design? Can be fixed?
If not, the test would have to require_scratch_not_dm_target or so.

Please let me know if have any other tip or pointers for me.

Thanks,
Amir.