Re: [PATCH] ext4: fix interaction between i_size, fallocate, and delalloc after a crash

"Theodore Ts'o" <tytso@xxxxxxx> · Mon, 16 Oct 2017 20:09:25 -0400

On Tue, Oct 17, 2017 at 12:11:40AM +0300, Amir Goldstein wrote:
> 
> The disk image SHOULD reflect a state on a disk after the power was
> cut in the middle of mounted fs. Then power came back on, filesystem
> was mounted, journal recovered, then filesystem was cleanly unmounted.
> At this stage, I don't expect there should be anything interesting in the
> journal.

I suspect what Ashlie was hoping for was a file system image *before*
the file system was remounted and the journal replayed (and then
truncated).  That would allow for an analysis of image right after the
simulated power cut, so it could be seen what was in the journal.

The only way to get that is to modify the test so that it aborts
before the file system is remounted.  I did some investigations where
I ran commands (such as "debugfs -R "logdump -ac /dev/vdc") before the
file system was remounted to gather debugging information.  That's how
I tracked down the problem.  Unfortunately I never bothered to grab
full file system snapshot, so I can't give Ashlie what she's hoping
for.

> I believe umount call should be blocked until all writes have been flushed
> out to flakey device.

That is correct.

> Ted explained that the bug related to very specific timing of flusher
> thread vs. fallocate thread.
> I was under the impression that CrashMonkey can only reorder writes
> between recorded FLUSH requests, so I am not really sure how you intent to
> modify CrashMonkey to catch this bug.

The real issue is that what CrashMonkey is testing is given a test
trace with N CACHE FLUSH operations, given a random X such that:

       1 <= X < N

If of the writes before the Xth CACHE FLUSH are completed, and a
random set of writes between the Xth and (X+1)th CACHE FLUSH are
completed, is the file system still consistent after a journal replay.

That's a fine thing to test, although you can probably do that more
efficiently by simply looking at all of the metadata writes between
the Xth and X+1th CACHE FLUSH.  Those writes must be effective no-ops
after the journal is replayed up to the Xth cache flush.  Which is to
say, the writes must either (a) be to a data block, or (b) the
contents of the writes must match either (a) the most recent journal
entry for that block (up to the Xth cache flush), or (b) the current
state of the disk.

So if you are willing to assume knowledge of what is stored in the
journal and how ext4 works, it should be possible to implement
CrashMonkey much more effectively.

The problem that this bug exposed is different sort of problem.  To
find this bug, given the I/O stream, you can simply examine the file
system state after each journal commit.  (e.g., after each CACHE
FLUSH).  And just make sure the file system state is consistent.
There is no need to include some random set of writes from the last
commit epoch.

The sort of searching of the test space new CrashMonkey' would have to
test can't be done just by looking at the io traces.  Instead, a
workload consists of a series of micro-transactions (jbd2 handles)
which are assigned to a set of journal transactions.  Normally, which
handles get assigned to a given transaction is based on timing (we
close a transaction every 5 seconds), or based on the size of the
transaction (we limit the number of blocks in a transaction), or based
file system operations --- e.g., a fsync() will cause a transaction to
close.

This CrashMoney' would have to explore a different set of transaction
boundaries (e.g., which handles are assigned to a transaction before a
transaction closes), and whether the file system is consistent at each
transaction boundary given a each possible assignment of handles to
transactions.

It's doable, but it would have to be done by logging the data passed
to the jbd2 logging layer, and checking file system consistency at
each handle boundary.

							- Ted