On Tue 03-11-15 21:46:13, Ross Zwisler wrote: > On Tue, Nov 03, 2015 at 05:02:34PM -0800, Dan Williams wrote: > > > Hmm...if we go this path, though, is that an argument against moving the > > > zeroing from DAX down into the driver? True, with BRD it makes things nice > > > and efficient because you can zero and never flush, and the driver knows > > > there's nothing else to do. > > > > > > For PMEM, though, you lose the ability to zero the data and then queue the > > > flushing for later, as you would be able to do if you left the zeroing code in > > > DAX. The benefit of this is that if you are going to immediately re-write the > > > newly zeroed data (which seems common), PMEM will end up doing an extra cache > > > flush of the zeroes, only to have them overwritten and marked as dirty by DAX. > > > If we leave the zeroing to DAX we can mark it dirty once, zero it once, write > > > it once, and flush it once. > > > > Why do we lose the ability to flush later if the driver supports > > blkdev_issue_zeroout? > > I think that if you implement zeroing in the driver you'd need to also > flush in the driver because you wouldn't have access to the radix tree to > be able to mark entries as dirty so you can flush them later. > > As I think about this more, though, I'm not sure that having the zeroing > flush later could work. I'm guessing that the filesystem must require a > sync point between the zeroing and the subsequent follow-up writes so > that you can sync metadata for the block allocation. Otherwise you could > end up in a situation where you've got your metadata pointing at newly > allocated blocks but the new zeros are still in the processor cache - if > you lose power you've just created an information leak. Dave, Jan, does > this make sense? So the problem you describe does not exist. Thing to keep in mind is that filesystem are designed to work reliably with 'non-persistent' cache in the disk which is common these days. That's why we bother with all that REQ_FLUSH | REQ_FUA and blkdev_issue_flush() stuff after all. Processor cache is exactly that kind of the cache attached to the PMEM storage. And Dave and I try to steer you to a solution that would also treat it equally in DAX filesystems as well :). Now how the problem is currently solved: When we allocate blocks, we just record that information in a transaction in the journal. For DAX case we also submit the IO zeroing those blocks and wait for it. Now if we crash before the transaction gets committed, blocks won't be seen in the inode after a journal recovery and thus no data exposure can happen. As a part of transaction commit, we call blkdev_issue_flush() (or submit REQ_FLUSH request). We expect that to force out all the IO in volatile caches into the persistent storage. So this will also force the zeroing into persistent storage for normal disks and AFAIU if you do zeroing with non-temporal writes in pmem driver and then do wmb_pmem() in response to a flush request we get the same persistency guarantee in pmem case as well. So after a transaction commit we are guaranteed to see zeros in those allocated blocks. So the transaction commit and the corresponding flush request in particular is the sync point you speak about above but the good thing is that in most cases this will happen after real data gets written into those blocks so we save the unnecessary flush. Honza -- Jan Kara <jack@xxxxxxxx> SUSE Labs, CR _______________________________________________ xfs mailing list xfs@xxxxxxxxxxx http://oss.sgi.com/mailman/listinfo/xfs