Re: [PATCH 0/4] (RESEND) ext3[34] barrier changes

Theodore Tso <tytso@xxxxxxx> · Sun, 18 May 2008 20:43:25 -0400

On Sun, May 18, 2008 at 10:03:55PM +0200, Andi Kleen wrote:
> Eric Sandeen <sandeen@xxxxxxxxxx> writes:
> >
> > Right, that was the plan.  I wasn't really going to stand there and pull
> > the plug.  :)  I'd like to get to "out of $NUMBER power-loss events
> > under this usage, I saw $THIS corruption $THISMANY times ..."
> 
> I'm not sure how good such exact numbers would do. Surely power down
> behaviour that would depend on the exact disk/controller/system
> combination? Some might be better at getting data out at 
> power less, some might be worse.

Given how rarely people have reported problems, I think it's a really
good idea to understand what exactly our exposure is for
$COMMON_HARDWARE.  And I suspect the biggest question isn't the
hardware, but the workload.  Here are the questions that I think are
worth asking:

* How often can we get corruption on a common desktop workload?  Given
  that we're mostly kernel developers, and kernbench is probably worst
  case for desktops, that's useful.

* What is the performance hit on a common desktop workload (let's use
  kernbench for consistency).

* How often can we get corruption on a hard-core enterprise
  application with lots of fsync()'s?   (i.e., postmark, et. al)

* What is the performance hit on a an fsync()-heavy workload?

I have a feeling that the likelihood of corruption when running
kernbench is minimal, but the performance hit is probably minimal as
well.  And that the corruption for potential is higher for an
fsync-heavy workload, but that's also where we are seeing the
(reported) 30% hit.

The other thing which we should consider is that I suspect we can do
much better for ext4 given that we have journal checksums.  As Chris
pointed out, right now, with barriers turned on, we are doing this:

write log blocks
flush #1
write commit block
flush #2
write metadata blocks

If we don't mind mixing bh and bio functions, we could change it to
this for ext4 (when journal checksumming is enabled)

write log blocks
write commit block
flush (via submitting an empty barrier block I/O request)
write metadata blocks

This should hopefully reduce the performance hit by half, since we're
eliminating one of the flushes.  Even more interesting would be moving
the flush until right before we attempt to write the metadata blocks,
and allowing data writes which don't require metadata updates through.
That should be safe, even in data=ordered mode.  The point is we
should think about ways that we can optimize barrier mode for ext4.
If we do this, then it may be that people will find it interesting to
mount ext3 filesystems using ext4, even without making any additional
changes, because of the better choices of speed/safety tradeoffs.

							- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html