Re: discard and barriers

Hugh Dickins <hughd@xxxxxxxxxx> · Sat, 14 Aug 2010 13:11:02 -0700 (PDT)

On Sat, 14 Aug 2010, Chris Mason wrote:
> On Sat, Aug 14, 2010 at 04:52:10PM +0200, Christoph Hellwig wrote:
> > On Sat, Aug 14, 2010 at 10:14:51AM -0400, Ted Ts'o wrote:
> > > Also, to be clear, the block layer will guarantee that a trim/discard
> > > of block 12345 will not be reordered with respect to a write block
> > > 12345, correct?
> > 
> > Right now that is what the hardbarrier does, and that's what we're
> > trying to get rid of.
> 
> So btrfs will wait_on_{page/buffer/bio} to meet all ordering
> requirements. This holds both for transaction commit and for discard.
> Reiserfs has the exception you already know about.
> 
> > For XFS we prevent this by something that is
> > called the busy extent list - extents delete by a transaction are
> > inserted into it (it's actually a rbtree not a list these days),
> > and before we can reuse blocks from it we need to ensure that it
> > is fully commited.  discards only happen off that list and extents
> > are only removed from it once the discard has finished.  I assume
> > other filesystems have a similar mechanism.

Yes, whatever works for XFS and for btrfs will be enough for swap:
all it needs is to wait on completion of the discard, just as you
enforced with BLKDEV_IFL_WAIT, before issuing more writes to the
discarded area - as it already does.

> > 
> > > And on SATA devices, where discard requests are not queued requests,
> > > the ata layer will have to do a queue flush *before* the discard is
> > > sent, right?
> 
> Another way to say this is we have to be 100% sure that if we write
> something after a discard, that storage will do that write after it does
> the discard.
> 
> I'm not actually worried about writes before the discard, because the
> worst case for us is the drive fails to discard something it could have
> (this is the drive's problem).  Cache flushes from the FS will cover the
> case where transaction commits depend on the data going in before the
> discard. 

That is a great point.  Swap does not need nor want a queue flush
before discard: all that achieves is interfere with the flow to other
partitions.  Can we reason that that queue flush cannot be necessary
in any case - that anything which appears to need it for correctness
must actually be already doing serialization that makes it superfluous?

> 
> I care a lot about the write after the discards though.  If the discards
> themselves become async, that's ok too as long as we have some way to do
> end_io processing on them.

Yes, same for swap.

Hugh
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html