On Tue, Sep 01, 2015 at 09:19:45PM -0600, Ross Zwisler wrote: > On Wed, Sep 02, 2015 at 08:21:20AM +1000, Dave Chinner wrote: > > Which means applications that should "just work" without > > modification on DAX are now subtly broken and don't actually > > guarantee data is safe after a crash. That's a pretty nasty > > landmine, and goes against *everything* we've claimed about using > > DAX with existing applications. > > > > That's wrong, and needs fixing. > > I agree that we need to fix fsync as well, and that the fsync solution could > be used to implement msync if we choose to go that route. I think we might > want to consider keeping the msync and fsync implementations separate, though, > for two reasons. > > 1) The current msync implementation is much more efficient than what will be > needed for fsync. Fsync will need to call into the filesystem, traverse all > the blocks, get kernel virtual addresses from those and then call > wb_cache_pmem() on those kernel addresses. I think this is a necessary evil > for fsync since you don't have a VMA, but for msync we do and we can just > flush using the user addresses without any fs lookups. Yet you're ignoring the fact that flushing the entire range of the relevant VMAs may not be very efficient. It may be a very large mapping with only a few pages that need flushing from the cache, but you still iterate the mappings flushing GB ranges from the cache at a time. We don't need struct pages to track page dirty state. We already have a method for doing this that does not rely on having a struct page and can be used for efficient lookup of exact dirty ranges. i.e the per-page dirty tag that is kept in the mapping radix tree. It's fine grained, and extremely efficient in terms of lookups to find dirty pages. Indeed, the mapping tree tags were specifically designed to avoid this "fsync doesn't know what range to flush" problem for normal files. That same problem still exists here for msync - these patches are just hitting it with a goddamn massive hammer "because it is easy" rather than attempting to do the flushing efficiently. > 2) I believe that the near-term fsync code will rely on struct pages for > PMEM, which I believe are possible but optional as of Dan's last patch set: > > https://lkml.org/lkml/2015/8/25/841 > > I believe that this means that if we don't have struct pages for PMEM (becuase > ZONE_DEVICE et al. are turned off) fsync won't work. I'd be nice not to lose > msync as well. I don't think this is an either-or situation. If we use struct pages for PMEM, then fsync will work without modification as DAX will need to use struct pages and hence we can insert them in the mapping radix tree directly and use the normal set/clear_page_dirty() calls to track dirty state. It will give us fine grained flush capability and we won't want msync() to be using the big hammer if we can avoid it. If we make the existing pfn-based DAX code track dirty pages via mapping radix tree tags right now, then we allow fsync to work by reusing most of the infrastructure we already have. That means DAX and fsync will work exactly the same regardless of how we index/reference PMEM in future and we won't have to come back and fix it all up again. Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html