On Wed, Feb 03, 2016 at 11:46:11AM +0100, Jan Kara wrote: > On Tue 02-02-16 10:34:56, Ross Zwisler wrote: > > On Tue, Feb 02, 2016 at 09:10:24AM -0800, Dan Williams wrote: > > > On Tue, Feb 2, 2016 at 8:46 AM, Jan Kara <jack@xxxxxxx> wrote: > > > > On Tue 02-02-16 08:33:56, Dan Williams wrote: > > > >> On Tue, Feb 2, 2016 at 3:17 AM, Jan Kara <jack@xxxxxxx> wrote: > > > >> [..] > > > >> > I see, thanks for explanation. So I'm OK with changing what is stored in > > > >> > the radix tree to accommodate this use case but my reservation that we IHMO > > > >> > have other more pressing things to fix remains... > > > >> > > > >> We don't need pfns in the radix to support XFS RT configurations. > > > >> Just call get_blocks() again and use the sector, or am I missing > > > >> something? > > > > > > > > You are correct. But if you decide to pay the cost of additional > > > > get_block() call, you only need the dirty tag in the radix tree and nothing > > > > else. So my understanding was that the whole point of games with radix tree > > > > is avoiding this extra get_block() calls for fsync(). > > > > > > > > > > DAX-fsync() is already a potentially expensive operation to cover data > > > durability guarantees for DAX-unaware applications. A DAX-aware > > > application is going to skip fsync, and the get_blocks() cost, to do > > > cache management itself. > > > > > > Willy pointed out some other potential benefits, assuming a suitable > > > replacement for the protections afforded by the block-device > > > percpu_ref counter can be found. However, optimizing for the > > > DAX-unaware-application case seems the wrong motivation to me. > > > > Oh, no, the primary issue with calling get_block() in the fsync path isn't > > performance. It's that we don't have any idea what get_block() function to > > call. > > > > The fault handler calls all come from the filesystem directly, so they can > > easily give us an appropriate get_block() function pointer. But the > > dax_writeback_mapping_range() calls come from the generic code in > > mm/filemap.c, and don't know what get_block() to pass in. > > > > During one iteration I had the calls to dax_writeback_mapping_range() > > happening in the filesystem fsync code so that it could pass in get_block(), > > but Dave Chinner pointed out that this misses other paths in the filesystem > > that need to have things flushed via a call to filemap_write_and_wait_range(). > > Let's clear this up a bit: The problem with using ->fsync() method is that > it doesn't get called for sync(2). We could use ->sync_fs() to flush caches > in case of sync(2) (that's what's happening for normal storage) but the > problem with PMEM is that "flush all cached data" operation effectively > means iterate through all modified pages and we didn't want to implement > this for DAX fsync code. > > So we have decided to do cache flushing for DAX at a different point - mark > inodes which may have writes cached as dirty and use writeback code for the > cache flushing. But looking at it now, we have actually chosen a wrong > place to do the flushing in the writeback path - note that sync(2) writes > data via __writeback_single_inode() -> do_writepages() and thus doesn't > even get to filemap_write_and_wait(). > > So revisiting the decision I see two options: > > 1) Move the DAX flushing code from filemap_write_and_wait() into > ->writepages() fs callback. There the filesystem can provide all the > information it needs including bdev, get_block callback, or whatever. > > 2) Back out even further and implement own tracking and iteration of inodes > to write. > > So far I still think 2) is not worth the complexity (although it would > bring DAX code closer to how things behave with standard storage) so I > would go for 1). Jan, just to clarify, are you proposing this change for v4.5 in the remaining RCs as an alternative to the get_bdev() patch? https://lkml.org/lkml/2016/2/2/941 Or can we move forward with get_bdev(), and try and figure out this new way of calling fsync/msync for v4.6? My main concern here is that changing how the DAX sync code gets called will affect all three filesystems as well as MM, and that it might be too much for RC inclusion... -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html