On Thu, Aug 30, 2018 at 02:49:07PM -0400, Mike Snitzer wrote: > On Thu, Aug 30 2018 at 5:30am -0400, Jan Kara <jack@xxxxxxx> wrote: > > Well, changing device from DAX-capable to DAX-incapable is problematic for > > filesystem on top of it as well. Filesystems simply don't expect this > > feature of a device can change so they would fail in unexpected ways. Also > > PFNs from the pmem (DAX-capable) device that are already mapped to user page > > tables won't magically become unmapped so those processes will still have > > DAX access to those areas of the device. .... > As you point out, how are the upper layers (e.g. filesystems) supposed > to reliably cope with this runtime switch to from DAX to non-DAX access? They can't right now. There's unsolved races between page faults, invalidations and changing the file operations to/from DAX dynamically. This is the entire problem facing the dynamic per-inode DAX on/off flag - if it happens globally to the filesystem without warning, then the filesystem is screwed. To support the block device changing between DAX and non-DAX dynamically, then the filesystem needs to first invalidate the entire filesystem cache, eject all cached inodes from memory, any cached metadata that is using DAX, etc to clear out all the DAX mappings it have. And it has to do it without racing with new page faults or IO that might map new DAX pages. And I'm ignoring the fact that we can't eject referenced inodes (i.e. open files) from the inode cache and so we currently cannot safely change the DAX on such files. That's a blocker right now. Once we can safely change the DAX state of open files, we've got to co=ordinate the block device state change with the filesystem - the filesystem wide invalidation has to be done before the block device can start the change of state, and the filesystem must remain completely stopped until the block device has completed it's change of state. So AFAICT this ends up being "stop the world instantly, eject the world from memory, rebuild the world from scratch, start the world again". Freezing the filesystem doesn't stop the world - we can still do read IO and page faults, so that doesn't prevent pagefault races with the invalidation leaving DAX references in the page cache. Hence we currently have no valid "stop the world" mechanism in the kernel other than unmount, which we can't do while there are open files. What about MAP_SYNC applications? If we turn off DAX with those applications still running, we silently break them and users won't know until the system loses power and they see data corruption after the system comes back. However, applications SEGVing unpredictably becuse of "transparent" storage state changes is almost as unfriendly. Dynamically changing block device DAX support seems like a non-starter to me. At least, it's a non starter until we add a lot more infrastructure, solve a bunch of really hard problems and define how active userspace controlled DAX-only features behave when DAX is no longer available... Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx