On Sat, Jun 17, 2017 at 10:05:45PM -0700, Andy Lutomirski wrote: > On Sat, Jun 17, 2017 at 8:15 PM, Dan Williams <dan.j.williams@xxxxxxxxx> wrote: > > On Sat, Jun 17, 2017 at 4:50 PM, Andy Lutomirski <luto@xxxxxxxxxx> wrote: > >> My other objection is that the syscall intentionally leaks a reference > >> to the file. This means it needs overflow protection and it probably > >> shouldn't ever be allowed to use it without privilege. > > > > We only hold the one reference while S_DAXFILE is set, so I think the > > protection is there, and per Dave's original proposal this requires > > CAP_LINUX_IMMUTABLE. > > > >> Why can't the underlying issue be easily fixed, though? Could > >> .page_mkwrite just make sure that metadata is synced when the FS uses > >> DAX? > > > > Yes, it most definitely could and that idea has been floated. > > > >> On a DAX fs, syncing metadata should be extremely fast. <sigh> This again.... Persistent memory means the *I/O* is fast. It does not mean that *complex filesystem operations* are fast. Don't forget that there's an shitload of CPU that gets burnt to make sure that the metadata is synced correctly. Do that /synchronously/ on *every* write page fault (which, BTW, modify mtime, so will always have dirty metadata to sync) and now you have a serious performance problem with your "fast" DAX access method. And that's before we even consider all the problems with running sync operations in page fault context.... > >> This > >> could be conditioned on an madvise or mmap flag if performance might > >> be an issue. As far as I know, this change alone should be > >> sufficient. > > > > The hang up is that it requires per-fs enabling as it needs to be > > careful to manage mmap_sem vs fs journal locks for example. I know the > > in-development NOVA [1] filesystem is planning to support this out of > > the gate. ext4 would be open to implementing it, but I think xfs is > > cold on the idea. Christoph originally proposed it here [2], before > > Dave went on to propose immutable semantics. > > Hmm. Given a choice between a very clean API that works without > privilege but is awkward to implement on XFS and an awkward-to-use > API, I'd personally choose the former. Yup, you have the choice of a clean kernel API that will be substantially slower than the existing "dirty page" tracking and having the app run fsync() when necessary, or having to do a little more work in a library routine that preallocates a file and sets a flag on it? The apps will use the library API, not the kernel API, so who really cares if there's a few steps to setting up the file state appropriately? > Dave, even with the lock ordering issue, couldn't XFS implement > MAP_PMEM_AWARE by having .page_mkwrite work roughly like this: > > if (metadata is dirty) { > up_write(&mmap_sem); > sync the metadata; > down_write(&mmap_sem); > return 0; /* retry the fault */ > } else { > return whatever success code; > } How do you know that there is dependent filesystem metadata that needs syncing at a level that you can safely manipulate the mmap_sem? And how, exactly, do you do this without races? It'd be trivial to DOS such retryable DAX faults simply by touching the file in a tight loop in a separate process... Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx