Re: [PATCH 0/4] (RESEND) ext3[34] barrier changes

Theodore Tso <tytso@xxxxxxx> · Sat, 17 May 2008 09:43:44 -0400

On Fri, May 16, 2008 at 05:35:52PM -0700, Andrew Morton wrote:
> Journal wrapping could cause the commit block to get written before its
> data blocks.  We could improve that by changing the journal layout a bit:
> wrap the entire commit rather than just its tail.  Such a change might
> be intrusive though.

Or we backport jbd2's checksum support in the commit block to jbd;
with the checksum support, if the commit block is written out-of-order
with the rest of the blocks in the commit, the commit will simply not
be recognized as valid, and we'll use the previous commit block as the
last valid commit.

So with ext4, the only problems we should actually have w/o a barrier
are:

	* Writes that were supposed to happen only *after* the journal
          commit is written get reordered *before* the journal commit.
          (But normally these writes, while _allowed_ after a journal
          commit are not forced by the kernel.)

	* In data=ordered mode, data blocks that *should* have been
          written out before the journal commit, get reordered until
          *after* the journal commit.

And in both cases, where the crash has to happen sometime *very*
shortly after commit record has been forced out.

Thinking about this some more, the most likely way I can think of some
problems happening would be an unexpected power failure that happened
exactly right as an unmount (or filesystem snapshot) was taking place.
That's one of the few cases I can think of where, a journal commit
write is followed immediately by the metadata writes.  And in
data=ordered more, that sequence would be data writes, followed
immediately by journal writes, followed immediately by metadata
writes.

So if you want to demonstrate that this really *could* happen in real
life, without an artificially contrived, horribly fragmented journal
inode, here's a worst case scenario I would try to arrange.

(1) Pre-fill the disk 100% with some string, such as "my secret love
letters".

(2) In data=ordered mode, unpack a very large tarball, ideally with
the files ad directory ordered maximally pessimized so that files are
created in directory #1, then directory #6, then directory #4, then
directory #1, then directory #2, etc.  (AKA the anti-Reiser4 benchmark
tarball, because Hans would do kernel unpack benchmarks using
specially prepared tarballs that were in an optimal order for a
particular reiser4 filesystem hash; this is the exact opposite.  :-)

(3) With lots of dirty files in the page cache, (and for extra fun,
try this with ext4's unstable patch queue with delayed allocation
enabled), unmount the filesystem ---- and crash the system in the
middle of the unmount.

(4) Check to see if the filesystem metadata checks out cleanly using
e2fsck -f.

(5) Check all of the files on the disk to see if any of them contain
the string, "my secret love letters".

So all of this is not to argue one way or another about whether or not
barriers are a good idea.  It's really so we (and system
administrators) can make some informed decisions about choices.

One thing which we *can* definitely do is add a flag in the superblock
to change the default mount option to enable barriers on a
per-filesystem basis, settable by tune2fs/mke2fs.

Another question is whether we can do better in our implementation of
a barrier, and the way the jbd layer uses barriers.  The way we do it
in the jbd layer is actually pretty bad:

	if (journal->j_flags & JFS_BARRIER) {
		set_buffer_ordered(bh);
		barrier_done = 1;
	}
	ret = sync_dirty_buffer(bh);
	if (barrier_done)
		clear_buffer_ordered(bh);

This means that while we are waiting for commit record to be written
out, any other writes that are happening via buffer heads (which
includes directory operations) are getting done with strict ordering.
All set_buffer_ordered() does is change make the submit_bh() done in
sync_dirty_buffer() actually be submitted with WRITE_BARRIER instead
of WRITE.  So fixing sync_dirty_buffer() so that there is an
_sync_dirty_buffer() which takes two arguments, so we can do something
like this instead:

	ret = _sync_dirty_buffer(bh, WRITE_BARRIER);

Should hopefully reduce the hit on the benchmarks.

On disks with real tagged command queuing it might be possible to do
even better by sending the hard drives the real data dependencies,
since in fact a barrier is a stronger guarantee than what we really
need.  Unfortunatelly, TCQ seems to be getting obsoleted by the dumber
NCQ, where we don't get to make explicit write ordering requests to
the drive (and my drives ignored the ordering requests anyway).

    	       	  	 	     	      - Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html