On Tue, Feb 23, 2016 at 10:10:59AM -0700, Ross Zwisler wrote: > On Tue, Feb 23, 2016 at 11:06:44PM +1100, Dave Chinner wrote: > > On Tue, Feb 23, 2016 at 10:07:07AM +0000, Rudoff, Andy wrote: > > Not to mention that the filesystem will convert and zero much > > more than just a single cacheline (whole pages at minimum, could > > be 2MB extents for large pages, etc) so the filesystem may > > require CPU cache flushes over a much wider range of cachelines > > that the application realises are dirty and require flushing for > > data integrity purposes. The filesytem knows about these dirty > > cache lines, userspace doesn't. > > With the current code at least dax_zero_page_range() doesn't rely dax_clear_sectors(), actually. > on fsync/msync from userspace to make the zeroes that it writes > persistent. It does all the necessary flushing and wmb_pmem() > calls itself. Yes, that's the current implementation. We don't actually depend on those semantics, though, and assuming we do is a demonstration of the problems we're having right now. We could get rid of all the synchronous cache flushes and just mark the range dirty in the mapping radix tree and ensure that the cache flushes occur before the conversion transaction is made durable. And to make my point even clearer, that "flush data then transactions" ordering is exactly how fsync is implemented. i.e. what we've implemented right now is a basic, slow, easy-to-make-work-correctly brute force solution. That doesn't mean we always need to implement it this way, or that we are bound by the way dax_clear_sectors() currently flushes cachelines before it returns. It's just a simple implementation that provides the ordering the *filesystem requires* to provide the correct data integrity semantics to userspace. pmem cache flushing is a durability mechanism, it's not a data integrity solution. We have to flush CPU caches to provide durability, but that alone is not sufficient to guarantee that application data is complete and accessible after a crash. > I agree that this does not address your concern > about metadata being in sync, though. Right, and msync/fsync is the only way to guarantee that. Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>