On May 25, 2007 17:58 +1000, Neil Brown wrote: > These devices would find it very hard to support BIO_RW_BARRIER. > Doing this would require keeping track of all in-flight requests > (which some, possibly all, of the above don't) and then: > When a BIO_RW_BARRIER request arrives: > wait for all pending writes to complete > call blkdev_issue_flush on all devices > issue the barrier write to the target device(s) > as BIO_RW_BARRIER, > if that is -EOPNOTSUP, re-issue, wait, flush. We noticed when testing the SLES10 kernel (which has barriers enabled by default) that ext3 write throughput went from about 170MB/s to about 130MB/s (on high-end RAID storage using no-op scheduler). The reason (as far as we could tell) is that the barriers are implemented by flushing and waiting for all previosly submitted IOs to finish, but all that ext3/jbd really care about is that the journal blocks are safely on disk. Since the journal blocks are only a small fraction of the total IO in flight, the barrier + write cache ends up being a lot worse than just doing synchronous IO with the write cache disabled because no new IO can be submitted past the barrier, and since that IO is large and contiguous it might complete much faster than the scattered metadata updates that are also being checkpointed to disk from the previous transactions. With jbd there can be both a running and a committing transaction, and multiple checkpointing transactions, and the use of barriers breaks this important optimization. If ext3 used an external journal this problem would be avoided, but then there isn't really a need for barriers in the first place, since the jbd code already will handle the wait for the commit block itself. We've got a pretty-much complete version of the ext3 journal checksumming patch that avoids the need to do the pre-commit barrier, since the checksum can verify at recovery time whether all of the transaction's blocks made it to disk or not (which is what the commit block is all about in the end). Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html