On Tue, Jun 20, 2017 at 10:18:24PM -0700, Andy Lutomirski wrote: > On Tue, Jun 20, 2017 at 6:40 PM, Dave Chinner <david@xxxxxxxxxxxxx> wrote: > >> A per-inode > >> count of the number of live DAX mappings or of the number of struct > >> file instances that have requested DAX would work here. > > > > For what purpose does this serve? The reflink invalidates all the > > existing mappings, so the next write access causes a fault and then > > page_mkwrite is called and the shared extent will get COWed.... > > The same purpose as XFS's FS_XFLAG_DAX (assuming I'm understanding it > right), except that IMO an API that doesn't involve making a change to > an inode that sticks around would be nice. The inode flag has the > unfortunate property that, if two different programs each try to set > the flag, mmap, write, and clear the flag, they'll stomp on each other > and risk data corruption. > > I admit I'm now thoroughly confused as to exactly what XFS does here > -- does FS_XFLAG_DAX persist across unmount/mount? Yes, it is. i.e. DAX on XFS does not rely on a naive fs-wide mount option. You can have applications on pmem filesystems use either DAX or normal IO based on directory/inode flags. Something doesn't work with DAX, so just remove the DAX flags from the directories/inodes, and it will safely and transparently switch to page-cache based IO. <snip> > Here's the overall point I'm trying to make: unprivileged programs > that want to write to DAX files with userspace commit mechanisms > (CLFLUSHOPT;SFENCE, etc) should be able to do so reliably, without > privilege, and with reasonably clean APIs. Ideally they could do this > to any file they have write access to. The privilege argument is irrelevant now - it was /suggested/ initially as a way of preventing people from shooting themselves in the foot based on the immutable file model. It's clear that's not desired, and it's not a show stopper. > Programs that want to write to > mmapped files, DAX or otherwise, without latency spikes due to > .page_mkwrite should be able to opt in to a heavier weight mechanism. > But these two issues are someone independent, and I think they should > be solved separately. You seem to be calling the "fdatasync on every page fault" the "lightweight" option. That's the brute-force-with-big-hammer solution - it's most definitely not lightweight as every page fault has extra overhead to call ->fsync(). Sure, the API is simple, but the runtime overhead is significant. The lightweight *runtime* option is to set up the file in such a way that there is never any extra overhead at page fault time. This is what immutable extent maps provide. Indeed, because the mappings never change, you could use hardware dirty tracking if you wanted, as there's no need to look up the filesystem to do writeback as everything needed for writeback was mapped at page fault time. This "map first and then just write when you need to" is *exactly how swap files work*. Even if you are considering the complexity of the APIs, it's hardly a "heavyweight" when it only requires a single call to fallocate() before mmap() to set up the immutable extents on the file... Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx