Theodore Tso wrote:
On Mon, Aug 24, 2009 at 12:31:19PM -0600, Andreas Dilger wrote:
No one has gotten around to looking at this closely; I think adding a
strategically placed blkdev_issue_flush() will allow us to safely
enable this feature, but it needs careful study.
I don't think that was the issue, but rather that we wanted to have
per-block checksums in order to handle the case were some block in
transaction A is causing a transaction checksum failure, yet transaction
B has already committed and begun checkpointing.
There are multiple problems that are going on here.
Adding per-block checksums is useful in providing better resiliance in
the case of write errors in the journal, and providing better error
handling when we detect a checksum error in the commit block.
My issue with the async commit is that it is basically a detection
mechanism.
Drives will (almost always) write to platter sequential writes in order.
Async commit lets us send down things out of order which means that we
have a wider window of "bad state" for any given transaction...
ric
However, even if we add even if we add per-block checksums, there is
still the problem that there is logic in the jbd layer where we avoid
reusing certain blocks until we are sure the transaction has
committed. With asynchronous commits, there is no way of knowing when
it is safe to reuse those blocks.
To illustrate, consider a data block for an inode which has just been
deleted. We have code which prevents that block from being reused
until the transaction has committed; but when we use asynchronous
commits, we don't know when that will be. Currently the async commit
code assumes that once we send the commit block to be written, the
commit has happened; this opens up a race where between when the
commit record (and all of the associated journal blocks) is actually
written to disk, and when a data block gets reused.
Most of the time, this will cause silent corruption of a data file
that was about to be deleted right before the power failure --- but if
the block in question was part of a directory that was in the process
of being deleted, it could result in a filesystem corruption
detectable by e2fsck.
Looking at the code, the best we can do in the short-term is to write
the commit record where we do, but do so with a barrier requested, and
then wait for the commit block where we do. This will provide some
performance improvement, since we won't wait for all of the journal
blocks to be sent to disk before writing the commit record. However,
we still have to wait for the commit record (and all of the blocks
before it) to be committed to stable store before we can mark the
transaction as being truly committed. So it's not a true "async
commit", but it is a benefit of adding journal checksums.
It might be possible in the long term to batch together multiple
transaction in the "committing" state, instead of only allowing one
transaction to be in the committing state, and to prevent blocks from
being reused or allowing pinned buffer_heads from writing the contents
of the blocks to their final location on disk until we are 100% sure
the commit block and all of the associated journal blocks really have
made it to disk. However, it's probably not worth doing this, since
the only time this would make a big difference is when we have several
transactions closing within the space of a short time; and in that
case the fsync() requires a barrier operation anyway.
- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html