On Sat, 14 Aug 2010, Chris Mason wrote: > On Sat, Aug 14, 2010 at 04:52:10PM +0200, Christoph Hellwig wrote: > > On Sat, Aug 14, 2010 at 10:14:51AM -0400, Ted Ts'o wrote: > > > Also, to be clear, the block layer will guarantee that a trim/discard > > > of block 12345 will not be reordered with respect to a write block > > > 12345, correct? > > > > Right now that is what the hardbarrier does, and that's what we're > > trying to get rid of. > > So btrfs will wait_on_{page/buffer/bio} to meet all ordering > requirements. This holds both for transaction commit and for discard. > Reiserfs has the exception you already know about. > > > For XFS we prevent this by something that is > > called the busy extent list - extents delete by a transaction are > > inserted into it (it's actually a rbtree not a list these days), > > and before we can reuse blocks from it we need to ensure that it > > is fully commited. discards only happen off that list and extents > > are only removed from it once the discard has finished. I assume > > other filesystems have a similar mechanism. Yes, whatever works for XFS and for btrfs will be enough for swap: all it needs is to wait on completion of the discard, just as you enforced with BLKDEV_IFL_WAIT, before issuing more writes to the discarded area - as it already does. > > > > > And on SATA devices, where discard requests are not queued requests, > > > the ata layer will have to do a queue flush *before* the discard is > > > sent, right? > > Another way to say this is we have to be 100% sure that if we write > something after a discard, that storage will do that write after it does > the discard. > > I'm not actually worried about writes before the discard, because the > worst case for us is the drive fails to discard something it could have > (this is the drive's problem). Cache flushes from the FS will cover the > case where transaction commits depend on the data going in before the > discard. That is a great point. Swap does not need nor want a queue flush before discard: all that achieves is interfere with the flow to other partitions. Can we reason that that queue flush cannot be necessary in any case - that anything which appears to need it for correctness must actually be already doing serialization that makes it superfluous? > > I care a lot about the write after the discards though. If the discards > themselves become async, that's ok too as long as we have some way to do > end_io processing on them. Yes, same for swap. Hugh -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html