On Wed, Oct 17, 2018 at 6:05 PM Dave Chinner <david@xxxxxxxxxxxxx> wrote: > > On Wed, Oct 17, 2018 at 02:44:55PM -0700, Dan Williams wrote: > > On Wed, Oct 17, 2018 at 2:31 PM Jeff Moyer <jmoyer@xxxxxxxxxx> wrote: > > > > > > Eric Sandeen <sandeen@xxxxxxxxxxx> writes: > > > > > > > I've been thinking about the per-inode stuff a bit, and while I don't know > > > > how to resolve some of the trickier issues, at least the expected behavior > > > > seems like something we can narrow down and specify. > > > > > > > > Because it's an on-disk flag (in xfs today, in any case) it seems that > > > > the only sane behavior to expect is either/or, i.e.: > > > > > > > > Mount option: All files always dax, per-inode flags ignored (or rejected) > > > > Per-inode: Mount option cannot be specified; only inodes explicitly flagged are dax > > > > > > > > Think about it; what would mount-option-plus-per-inode mean? We have > > > > no "negative" dax flag, so while mount-option-with-flag surely means > > > > "dax", what the heck does mount-option-without-flag mean, and how is it > > > > distinguishable from mount option only? > > > > > > > > I submit that flags can only have meaning w/o the fs-wide mount option > > > > enabled, so the question of "should we hard fail mount -o dax for devices > > > > that cannot support it" seems to be orthogonal to the per-inode question. > > > > > > > > i.e. mount -o dax really can only mean "I want dax on everything" and so > > > > again, I think we probably need to fail the mount if that can't be honored. > > > > > > I hate to even open up this can of worms, but what about killing the dax > > > mount option? > > > > > > To quote Christoph: > > > How does an application "make use of DAX"? What actual user visible > > > semantics are associated with a file that has this flag set? > > > > > > We're already talking about making caching decisions automatically, so > > > does DAX even mean anything at that point? If the storage and the file > > > system support it, enable it. > > > > > > From what we've seen so far, aplications want: > > > 1) to be able to make data persistent from userspace > > > For this, we have MAP_SYNC. > > > 2) to determine whether or not page cache will be used > > > For this, we have O_DIRECT for read/write access, and MAP_SYNC for > > > mmap access (and maybe a third option coming, we'll see). > > > > As Jan has said, it's not safe to assume that 'no page cache' is > > implied with MAP_SYNC. It's a side effect not a contract of the > > current implementation. > > Even MAP_DIRECT shouldn't mean "no page cache". O_DIRECT is a hint, > not a guarantee, and so it may very well use the page cache if it > needs to (as I've just explained in detail in a different thread). > > > > The only thing users gain from a mount option is the ability to turn OFF > > > dax. I suppose there might be a use case that wants this, but I'm not > > > aware of it. > > > > I think we're stuck with it as many scripts would break if it ever > > went completely away. However, we could mark it deprecated / ignored > > I don't really care that much about this - it is still marked > experimental. > > That said, deprecation is the best way forward here if we are going > to remove the mount option. We've done this for other XFS mount > options recently (e.g. barrier/nobarrier) where the functionality is > now fully baked into the fileystem and there's no user option to > control it anymore. > > What we really need is a document describing the expected behaviour > of filesysetms on dax-capable storage. Let's nail down exactly what > we need to do to pull DAX out of the experimental state before we > start changing things. We've been doing things in a very ad-hoc way > for a while now, and we're not really converging on an endpoint where we > can say "we're done, have at it". > > I think we need to decide on: > > - default filesystem behaviour on dax-capable block devices > - what information aout DAX do applications actually need? What > makes sense to provide them with that information? > - how to provide hints to the kernel for desired behaviour > - on-disk inode flags, or something else? > - dax/nodax mount options or root dir inode flags become default > global hints? > - is a single hint flag sufficient or do we also need an > explicit "do not use dax" flag? > - behaviour of MAP_SYNC w.r.t. non-DAX filesystems that can provide > required MAP_SYNC semnatics > - behaviour of MAP_DIRECT - hint like O_DIRECT or guarantee? > - default read/write path behaviour of dax-capable block devices > - automatically bypass the pagecache if bdev is capable? > - default mmap behaviour on dax capable devices > - use dax always? > - DAX vs get_user_pages_longterm > - turns off DAX dynamically? > - how do DAX-enabled filesystems interact with page fault capable > hardware? Can we allow DAX in those cases? > > I'm sure there's a heap more we need to document and nail down. > There's a lot of stuff to sort out before we start hammering on > random bits of code.... Nice, yes, I'll add some more: - Is MADV_DIRECT_ACCESS a hint or a requirement? - How does the kernel communicate the effective mode of a mapping taking into account madvise(), inode flags, mount options, and / or default fs behavior? New madvice() syscall? - What is the behavior of dax in the presence of reflink'd extents? Just failing seems the 'experimental' behavior. What to do about page->index when page belongs to more than 1 file via reflink? - Is there ever a case to force disable dax operation? To date we've only ever thought about interfaces to force *enable* dax operation - The virtio-pmem use case wants dax mappings but requires an explicit fsync() instead of MAP_SYNC to flush software buffers, it's a DAX sub-set, should it have it's own name? - DAX operation is loosely tied to block devices. There has been discussions of mounting filesystems on /dev/dax devices directly. Should we take that to its logical conclusion and support a block-layer-less conversion of dax-capable file systems? - Willy has proposed that the Xarray cache file-offset-to-physical address lookups, currently it only tracks dirty mapping state - The NVDIMM sub-system tracks badblocks, but the filesytem currently only finds out about them late when it attempts dax_direct_access(). Applications want to be able to list files+offsets that have experienced media corruption. > > provided we had a way for applications to query and override if DAX is > > enabled. I also think it's important to keep separate the dax-mmap > > behavior from the dax-read/write behavior. dax-mmap is where an > > application would make different decisions if it can get a mapping > > without page cache, > > The functionality people keep saying "requires DAX" really doesn't - > what it really requires is that mmap() exposes filesystem tracked > pmem in a CPU addressable memory range. DAX is not the only way to > do that - a filesystem with a pmem-based persistent page cache can > provide MAP_SYNC semantics to userspace without being a DAX > filesystem. *nod*