Re: [RFC 0/2] New MAP_PMEM_AWARE mmap flag

Dave Chinner <david@xxxxxxxxxxxxx> · Wed, 24 Feb 2016 08:47:29 +1100

On Tue, Feb 23, 2016 at 10:10:59AM -0700, Ross Zwisler wrote:
> On Tue, Feb 23, 2016 at 11:06:44PM +1100, Dave Chinner wrote:
> > On Tue, Feb 23, 2016 at 10:07:07AM +0000, Rudoff, Andy wrote:
> > Not to mention that the filesystem will convert and zero much
> > more than just a single cacheline (whole pages at minimum, could
> > be 2MB extents for large pages, etc) so the filesystem may
> > require CPU cache flushes over a much wider range of cachelines
> > that the application realises are dirty and require flushing for
> > data integrity purposes. The filesytem knows about these dirty
> > cache lines, userspace doesn't.
> 
> With the current code at least dax_zero_page_range() doesn't rely

dax_clear_sectors(), actually.

> on fsync/msync from userspace to make the zeroes that it writes
> persistent.  It does all the necessary flushing and wmb_pmem()
> calls itself. 

Yes, that's the current implementation. We don't actually depend on
those semantics, though, and assuming we do is a demonstration of
the problems we're having right now. We could get rid of all the
synchronous cache flushes and just mark the range dirty in the
mapping radix tree and ensure that the cache flushes occur before
the conversion transaction is made durable. And to make my point
even clearer, that "flush data then transactions" ordering is
exactly how fsync is implemented.

i.e. what we've implemented right now is a basic, slow,
easy-to-make-work-correctly brute force solution. That doesn't mean
we always need to implement it this way, or that we are bound by the
way dax_clear_sectors() currently flushes cachelines before it
returns. It's just a simple implementation that provides the
ordering the *filesystem requires* to provide the correct data
integrity semantics to userspace.

pmem cache flushing is a durability mechanism, it's not a data
integrity solution. We have to flush CPU caches to provide
durability, but that alone is not sufficient to guarantee that
application data is complete and accessible after a crash.

> I agree that this does not address your concern
> about metadata being in sync, though.

Right, and msync/fsync is the only way to guarantee that.

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@xxxxxxxxx.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>