On Tue, Nov 3, 2015 at 4:50 PM, Ross Zwisler <ross.zwisler@xxxxxxxxxxxxxxx> wrote: > On Tue, Nov 03, 2015 at 04:04:13PM +1100, Dave Chinner wrote: >> On Mon, Nov 02, 2015 at 07:53:27PM -0800, Dan Williams wrote: >> > On Mon, Nov 2, 2015 at 1:44 PM, Dave Chinner <david@xxxxxxxxxxxxx> wrote: > <> >> > > This comes back to the comments I made w.r.t. the pmem driver >> > > implementation doing synchronous IO by immediately forcing CPU cache >> > > flushes and barriers. it's obviously correct, but it looks like >> > > there's going to be a major performance penalty associated with it. >> > > This is why I recently suggested that a pmem driver that doesn't do >> > > CPU cache writeback during IO but does it on REQ_FLUSH is an >> > > architecture we'll likely have to support. >> > > >> > >> > The only thing we can realistically delay is wmb_pmem() i.e. the final >> > sync waiting for data that has *left* the cpu cache. Unless/until we >> > get a architecturally guaranteed method to write-back the entire >> > cache, or flush the cache by physical-cache-way we're stuck with >> > either non-temporal cycles or looping on potentially huge virtual >> > address ranges. >> >> I'm missing something: why won't flushing the address range returned >> by bdev_direct_access() during a fsync operation work? i.e. we're >> working with exactly the same address as dax_clear_blocks() and >> dax_do_io() use, so why can't we look up that address and flush it >> from fsync? > > I could be wrong, but I don't see a reason why DAX can't use the strategy of > writing data and marking it dirty in one step and then flushing later in > response to fsync/msync. I think this could be used everywhere we write or > zero data - dax_clear_blocks(), dax_io() etc. (I believe that lots of the > block zeroing code will go away once we have the XFS and ext4 patches in that > guarantee we will only get written and zeroed extents from the filesystem in > response to get_block().) I think the PMEM driver, lacking the ability to > mark things as dirty in the radix tree, etc, will need to keep doing things > synchronously. Not without numbers showing the relative performance of dirtying cache followed by flushing vs non-temporal + pcommit. > Hmm...if we go this path, though, is that an argument against moving the > zeroing from DAX down into the driver? True, with BRD it makes things nice > and efficient because you can zero and never flush, and the driver knows > there's nothing else to do. > > For PMEM, though, you lose the ability to zero the data and then queue the > flushing for later, as you would be able to do if you left the zeroing code in > DAX. The benefit of this is that if you are going to immediately re-write the > newly zeroed data (which seems common), PMEM will end up doing an extra cache > flush of the zeroes, only to have them overwritten and marked as dirty by DAX. > If we leave the zeroing to DAX we can mark it dirty once, zero it once, write > it once, and flush it once. Why do we lose the ability to flush later if the driver supports blkdev_issue_zeroout? > This would make us lose the ability to do hardware-assisted flushing in the > future that requires driver specific knowledge, though I don't think that > exists yet. ioatdma has supported memset() for a while now, but I would prioritize a non-temporal SIMD implementation first. > Perhaps we should leave the zeroing in DAX for now to take > advantage of the single flush, and then move it down if a driver can improve > performance with hardware assisted PMEM zeroing? Not convinced. I think we should implement the driver zeroing solution and take a look at performance. -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html