On Wed, Oct 17, 2018 at 04:23:50PM -0400, Jeff Moyer wrote: > Jan Kara <jack@xxxxxxx> writes: > > > [Added ext4, xfs, and linux-api folks to CC for the interface discussion] > > > > On Tue 02-10-18 14:10:39, Johannes Thumshirn wrote: > >> On Tue, Oct 02, 2018 at 12:05:31PM +0200, Jan Kara wrote: > >> > Hello, > >> > > >> > commit e1fb4a086495 "dax: remove VM_MIXEDMAP for fsdax and device dax" has > >> > removed VM_MIXEDMAP flag from DAX VMAs. Now our testing shows that in the > >> > mean time certain customer of ours started poking into /proc/<pid>/smaps > >> > and looks at VMA flags there and if VM_MIXEDMAP is missing among the VMA > >> > flags, the application just fails to start complaining that DAX support is > >> > missing in the kernel. The question now is how do we go about this? > >> > >> OK naive question from me, how do we want an application to be able to > >> check if it is running on a DAX mapping? > > > > The question from me is: Should application really care? After all DAX is > > just a caching decision. Sure it affects performance characteristics and > > memory usage of the kernel but it is not a correctness issue (in particular > > we took care for MAP_SYNC to return EOPNOTSUPP if the feature cannot be > > supported for current mapping). And in the future the details of what we do > > with DAX mapping can change - e.g. I could imagine we might decide to cache > > writes in DRAM but do direct PMEM access on reads. And all this could be > > auto-tuned based on media properties. And we don't want to tie our hands by > > specifying too narrowly how the kernel is going to behave. > > For read and write, I would expect the O_DIRECT open flag to still work, > even for dax-capable persistent memory. Is that a contentious opinion? Not contentious at all, because that's the way it currently works. FYI, XFS decides what to do with read (and similarly writes) like this: if (IS_DAX(inode)) ret = xfs_file_dax_read(iocb, to); else if (iocb->ki_flags & IOCB_DIRECT) ret = xfs_file_dio_aio_read(iocb, to); else ret = xfs_file_buffered_aio_read(iocb, to); Neither DAX or O_DIRECT on pmem use the page cache - the only difference between the DAX read/write path and the O_DIRECT read/write path is where the memcpy() into the user buffer is done. For DAX it's done in the fsdax layer, for O_DIRECT it's done in the pmem block driver. > So, what we're really discussing is the behavior for mmap. Yes. > MAP_SYNC > will certainly ensure that the page cache is not used for writes. It > would also be odd for us to decide to cache reads. The only issue I can > see is that perhaps the application doesn't want to take a performance > hit on write faults. I haven't heard that concern expressed in this > thread, though. > > Just to be clear, this is my understanding of the world: > > MAP_SYNC > - file system guarantees that metadata required to reach faulted-in file > data is consistent on media before a write fault is completed. A > side-effect is that the page cache will not be used for > writably-mapped pages. I think you are conflating current implementation with API requirements - MAP_SYNC doesn't guarantee anything about page cache use. The man page definition simply says "supported only for files supporting DAX" and that it provides certain data integrity guarantees. It does not define the implementation. We've /implemented MAP_SYNC/ as O_DSYNC page fault behaviour, because that's the only way we can currently provide the required behaviour to userspace. However, if a filesystem can use the page cache to provide the required functionality, then it's free to do so. i.e. if someone implements a pmem-based page cache, MAP_SYNC data integrity could be provided /without DAX/ by any filesystem using that persistent page cache. i.e. MAP_SYNC really only requires mmap() of CPU addressable persistent memory - it does not require DAX. Right now, however, the only way to get this functionality is through a DAX capable filesystem on dax capable storage. And, FWIW, this is pretty much how NOVA maintains DAX w/ COW - it COWs new pages in pmem and attaches them a special per-inode cache on clean->dirty transition. Then on data sync, background writeback or crash recovery, it migrates them from the cache into the file map proper via atomic metadata pointer swaps. IOWs, NOVA provides the correct MAP_SYNC semantics by using a separate persistent per-inode write cache to provide the correct crash recovery semantics for MAP_SYNC. > and what I think Dan had proposed: > > mmap flag, MAP_DIRECT > - file system guarantees that page cache will not be used to front storage. > storage MUST be directly addressable. This *almost* implies MAP_SYNC. > The subtle difference is that a write fault /may/ not result in metadata > being written back to media. SIimilar to O_DIRECT, these semantics do not allow userspace apps to replace msync/fsync with CPU cache flush operations. So any application that uses this mode still needs to use either MAP_SYNC or issue msync/fsync for data integrity. If the app is using MAP_DIRECT, the what do we do if the filesystem can't provide the required semantics for that specific operation? In the case of O_DIRECT, we fall back to buffered IO because it has the same data integrity semantics as O_DIRECT and will always work. It's just slower and consumes more memory, but the app continues to work just fine. Sending SIGBUS to apps when we can't perform MAP_DIRECT operations without using the pagecache seems extremely problematic to me. e.g. an app already has an open MAP_DIRECT file, and a third party reflinks it or dedupes it and the fs has to fall back to buffered IO to do COW operations. This isn't the app's fault - the kernel should just fall back transparently to using the page cache for the MAP_DIRECT app and just keep working, just like it would if it was using O_DIRECT read/write. The point I'm trying to make here is that O_DIRECT is a /hint/, not a guarantee, and it's done that way to prevent applications from being presented with transient, potentially fatal error cases because a filesystem implementation can't do a specific operation through the direct IO path. IMO, MAP_DIRECT needs to be a hint like O_DIRECT and not a guarantee. Over time we'll end up with filesystems that can guarantee that MAP_DIRECT is always going to use DAX, in the same way we have filesystems that guarantee O_DIRECT will always be O_DIRECT (e.g. XFS). But if we decide that MAP_DIRECT must guarantee no page cache will ever be used, then we are basically saying "filesystems won't provide MAP_DIRECT even in common, useful cases because they can't provide MAP_DIRECT in all cases." And that doesn't seem like a very good solution to me. > and this is what I think you were proposing, Jan: > > madvise flag, MADV_DIRECT_ACCESS > - same semantics as MAP_DIRECT, but specified via the madvise system call Seems to be the equivalent of fcntl(F_SETFL, O_DIRECT). Makes sense to have both MAP_DIRECT and MADV_DIRECT_ACCESS to me - one is an init time flag, the other is a run time flag. Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx