On Sat 06-02-16 23:43:49, Ross Zwisler wrote: > On Sun, Feb 07, 2016 at 10:40:53AM +1100, Dave Chinner wrote: > > On Fri, Feb 05, 2016 at 03:25:00PM -0700, Ross Zwisler wrote: > > > On Thu, Feb 04, 2016 at 09:29:57PM +0100, Jan Kara wrote: > > > > I think changes aren't very intrusive so we can feed them in during RC > > > > phase and frankly, you have to move to using ->writepages() anyway to make > > > > sync(2) work reliably. > > > > > > I've been looking into this a bit more, and I don't think we actually want to > > > have DAX flushing in response to sync(2) or syncfs(2). Here's some text from > > > the BUGS section of the sync(2) man page: > > > > > > BUGS > > > According to the standard specification (e.g., POSIX.1-2001), sync() > > > schedules the writes, but may return before the actual writing > > > is done. However, since version 1.3.20 Linux does actually wait. > > > (This still does not guarantee data integrity: modern disks have large > > > caches.) > > > > > > Based on this I don't believe that it is a requirement that sync and syncfs > > > actually flush the data durably to media before they return - they just need > > > to make sure it has been sent to the device, which is always true for all > > > writes PMEM via DAX, even without any calls to dax_writeback_mapping_range(). > > > > That's an assumption we've already pointed out as being platform > > dependent, not to mention also being undesirable from a performance > > point of view (writes are 10x slower into pmem than into the page > > cache using the same physical memory!). > > > > Further, the ordering constraints of modern filesystems mean that > > they guarantee durability of data written back when sync() is run. > > i.e. ext4, XFS, btrfs, etc all ensure that sync guarantees data > > integrity is maintained across all the data and metadata written > > back during sync(). > > > > e.g. for XFS we do file size update transactions at IO completion. > > sync() triggers writeback of data, then runs a transaction that > > modifies the file size so that the data is valid on disk. We > > absolutely need to ensure that this transaction is durable before > > sync() returns, otherwise we lose that data if a failure occurs > > immediately after sync() returns because the size update is not on > > disk. > > > > Users are right to complain when data written before a sync() call > > is made does not accessible after a crash/reboot because we failed > > to make it durable. That's why ->sync_fs(wait) is called at the end > > of the sync() implementation - it enables filesystems to ensure all > > data and metadata written during the sync processing is on durable > > storage. > > > > IOWs, we can't language-lawyer or weasel-word our way out of > > providing durability guarantees for sync(). > > To be clear, we are only talking about user data that was mmaped. All I/Os > initiated by the filesystem for metadata, and all user issued I/Os will be > durable on media before the I/O is completed. Nothing we do or don't do for > fsync, msync, sync or syncfs() will break filesystem metadata guarantees. Yes, but be careful here: So far ext4 (and AFAIK XFS as well) have depended on the fact that blkdev_issue_flush() (or WRITE_FLUSH request) will flush all the volatile caches for the storage. We do such calls / writes from transaction commit code to make sure that all metadata *and data* writes reach permanent storage before we make some metadata changes visible. This is necessary so that we don't expose uninitialized block contents after a power failure (think of making i_size increase / block allocation visible after power failure without data being durable). For PMEM, we ignore blkdev_issue_flush() / WRITE_FLUSH requests on the grounds that writes are either done bypassing caches (the case for metadata IO and IO done via dax_do_io()). Currently we are fine even for mmap because both ext4 & XFS zero out allocated blocks using non-cached writes so even though latest data needn't be on persistent storage we won't expose stale data. But it is a catch that may hit us in future. So from this POV flushing caches from ->writepages() would also make PMEM give more similar guarantees to fs as ordinary block storage. > I think the question then is: "Do users currently rely on data written to an > mmap to be durable on media after a sync() or syncfs(), in spite of the BUGS > section quoted above?" > > If the answer for this is "yes", then I agree that we need to enhance the > current DAX fsync code to cover the sync use case. IMO the answer is yes. > However, as stated previously I don't think that it is as easy as just moving > the call to dax_writeback_mapping_range() to the sync() call path. On my > system sync() is called every 5 seconds, and flushing every page of every DAX > mmap that has ever been dirtied every 5 seconds is unreasonable. It is not sync() that is called but background writeback. But you are correct that ->writepages() will get called for a dirty inode every 5 seconds. > If we need to support the sync() use case with our DAX mmap flushing code, I > think the only way to do that is to clear out radix entries as we flush them, > so that the sync calls that happen every 5 seconds are only flushing new > writes. I agree and I forgot about the issue that we currently don't clear out dirty radix tree entries for DAX. I also agree that clearing of written radix tree entries needs to be implemented before flushing caches from ->writepages() is practical. > I think we're actually pretty far from this, at least for v4.5. The issue is > that I don't think we can safely clear radix tree dirty entries during the DAX > code that is called via ->writepages(). To do this correctly we would need to > also mark the PTE as clean so that when the userspace process next writes to > their mmap mapping we would get a new fault to make the page writable. This > would allow us to re-dirty the DAX entry in the radix tree. Yup, so I agree that clearing of radix tree entries probably is not a 4.5 material. > I implemented code to do this in v2 of my set, but ripped it out in v3: > > https://lkml.org/lkml/2015/11/13/759 (DAX fsync v2) > https://lkml.org/lkml/2015/12/8/583 (DAX fsync v3) > > The race that compelled this removal is described here: > > https://lists.01.org/pipermail/linux-nvdimm/2016-January/004057.html > > If this is really where we need to be, that's fine, we'll have to figure out > some way to defeat the race and get this code into v4.6. I think we need to find a way to safely clear radix tree entries anyway. And I agree that without page lock it is not easy. I see two possibilities here: 1) Protect dax_writeback_mapping_range() with exclusive EXT4_I(inode)->i_mmap_sem and provide new callback called from wp_page_reuse() to mark index dirty in the radix tree (that would take shared i_mmap_sem). This is relatively expensive but it should work. Or 2) We could devote one bit in the radix tree exceptional entry as a bitlock equivalent to page lock. This will likely scale better but it is more intrusive. Honza -- Jan Kara <jack@xxxxxxxx> SUSE Labs, CR -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html