On Thu, Feb 04, 2016 at 10:15:58AM +0100, Jan Kara wrote: <> > Yes, you are right. filemap_write_and_wait_range() actually doesn't > guarantee data durability. That function only means all dirty data has been > sent to storage and the storage has acknowledged them. This is noop for > PMEM. So we are perfectly fine ignoring calls to > filemap_write_and_wait_range(). What guarantees data durability are only > ->fsync() and ->sync_fs() calls. But some code could get upset by seeing > that filemap_write_and_wait_range() didn't actually get rid of dirty pages > (in some special cases like inode eviction or similar). That's why I'd > choose one of the two options for consistency: > > 1) Treat inode indexes to flush as close to dirty pages as we can - this > means inode is dirty with all the tracking associated with it, radix tree > entries have dirty tag, we get rid of these in ->writepages(). We are close > to this currently. I think we're actually pretty far from this, at least for v4.5. The issue is that I don't think we can safely clear radix tree dirty entries during the DAX code that is called via ->writepages(). To do this correctly we would need to also mark the PTE as clean so that when the userspace process next writes to their mmap mapping we would get a new fault to make the page writable. This would allow us to re-dirty the DAX entry in the radix tree. I implemented code to do this in v2 of my set, but ripped it out in v3: https://lkml.org/lkml/2015/11/13/759 (DAX fsync v2) https://lkml.org/lkml/2015/12/8/583 (DAX fsync v3) The race that compelled this removal is described here: https://lists.01.org/pipermail/linux-nvdimm/2016-January/004057.html (sorry for all the links) Anyway, for v4.5 I think whatever solution we come up with must be okay with an ever growing list of dirty radix tree entries, as we currently have. Are you aware of a reason why this won't work, or was the cleaning of the radix tree entries just a good goal to have? (And I agree it is a good goal, I just don't know how to do it safely.) > 2) Completely avoid the dirty tracking and writeback code and reimplement > everything in DAX code. > > Because some hybrid between these is IMHO bound to provoke weird (and very > hard to find) bugs. > > > > So revisiting the decision I see two options: > > > > > > 1) Move the DAX flushing code from filemap_write_and_wait() into > > > ->writepages() fs callback. There the filesystem can provide all the > > > information it needs including bdev, get_block callback, or whatever. > > > > This seems fine as long as we add it to ->fsync as well since ->writepages is > > never called in that path, and as long as we are okay with skipping DAX > > writebacks on hole punch, truncate, and block relocation. > > Look at ext4_sync_file() -> filemap_write_and_wait_range() -> > __filemap_fdatawrite_range() -> do_writepages(). Except those nrpages > 0 > checks which would need to be changed. Ah, cool, I missed this path. Thank you for setting me straight. -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html