On Sun, Oct 8, 2017 at 8:40 PM, Dave Chinner <david@xxxxxxxxxxxxx> wrote: Thanks for the review Dave. > On Fri, Oct 06, 2017 at 03:35:49PM -0700, Dan Williams wrote: >> MAP_DIRECT is an mmap(2) flag with the following semantics: >> >> MAP_DIRECT >> When specified with MAP_SHARED_VALIDATE, sets up a file lease with the >> same lifetime as the mapping. Unlike a typical F_RDLCK lease this lease >> is broken when a "lease breaker" attempts to write(2), change the block >> map (fallocate), or change the size of the file. Otherwise the mechanism >> of a lease break is identical to the typical lease break case where the >> lease needs to be removed (munmap) within the number of seconds >> specified by /proc/sys/fs/lease-break-time. If the lease holder fails to >> remove the lease in time the kernel will invalidate the mapping and >> force all future accesses to the mapping to trigger SIGBUS. >> >> In addition to lease break timeouts causing faults in the mapping to >> result in SIGBUS, other states of the file will trigger SIGBUS at fault >> time: >> >> * The file is not DAX capable >> * The file has reflinked (copy-on-write) blocks >> * The fault would trigger the filesystem to allocate blocks >> * The fault would trigger the filesystem to perform extent conversion >> >> In other words, MAP_DIRECT expects and enforces a fully allocated file >> where faults can be satisfied without modifying block map metadata. >> >> An unprivileged process may establish a MAP_DIRECT mapping on a file >> whose UID (owner) matches the filesystem UID of the process. A process >> with the CAP_LEASE capability may establish a MAP_DIRECT mapping on >> arbitrary files >> >> ERRORS >> EACCES Beyond the typical mmap(2) conditions that trigger EACCES >> MAP_DIRECT also requires the permission to set a file lease. >> >> EOPNOTSUPP The filesystem explicitly does not support the flag >> >> SIGBUS Attempted to write a MAP_DIRECT mapping at a file offset that >> might require block-map updates, or the lease timed out and the >> kernel invalidated the mapping. >> >> Cc: Jan Kara <jack@xxxxxxx> >> Cc: Arnd Bergmann <arnd@xxxxxxxx> >> Cc: Jeff Moyer <jmoyer@xxxxxxxxxx> >> Cc: Christoph Hellwig <hch@xxxxxx> >> Cc: Dave Chinner <david@xxxxxxxxxxxxx> >> Cc: Alexander Viro <viro@xxxxxxxxxxxxxxxxxx> >> Cc: "Darrick J. Wong" <darrick.wong@xxxxxxxxxx> >> Cc: Ross Zwisler <ross.zwisler@xxxxxxxxxxxxxxx> >> Cc: Jeff Layton <jlayton@xxxxxxxxxxxxxxx> >> Cc: "J. Bruce Fields" <bfields@xxxxxxxxxxxx> >> Signed-off-by: Dan Williams <dan.j.williams@xxxxxxxxx> >> --- >> fs/xfs/Kconfig | 2 - >> fs/xfs/xfs_file.c | 102 +++++++++++++++++++++++++++++++++++++++ >> include/linux/mman.h | 3 + >> include/uapi/asm-generic/mman.h | 1 >> 4 files changed, 106 insertions(+), 2 deletions(-) >> >> diff --git a/fs/xfs/Kconfig b/fs/xfs/Kconfig >> index f62fc6629abb..f8765653a438 100644 >> --- a/fs/xfs/Kconfig >> +++ b/fs/xfs/Kconfig >> @@ -112,4 +112,4 @@ config XFS_ASSERT_FATAL >> >> config XFS_LAYOUT >> def_bool y >> - depends on EXPORTFS_BLOCK_OPS >> + depends on EXPORTFS_BLOCK_OPS || FS_DAX >> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c >> index ebdd0bd2b261..e35518600e28 100644 >> --- a/fs/xfs/xfs_file.c >> +++ b/fs/xfs/xfs_file.c >> @@ -40,12 +40,22 @@ >> #include "xfs_iomap.h" >> #include "xfs_reflink.h" >> >> +#include <linux/mman.h> >> #include <linux/dcache.h> >> #include <linux/falloc.h> >> #include <linux/pagevec.h> >> +#include <linux/mapdirect.h> >> #include <linux/backing-dev.h> >> >> static const struct vm_operations_struct xfs_file_vm_ops; >> +static const struct vm_operations_struct xfs_file_vm_direct_ops; >> + >> +static inline bool >> +is_xfs_map_direct( >> + struct vm_area_struct *vma) >> +{ >> + return vma->vm_ops == &xfs_file_vm_direct_ops; >> +} > > Namespacing (xfs_vma_is_direct) and whitespace damage. Will fix. > >> >> /* >> * Clear the specified ranges to zero through either the pagecache or DAX. >> @@ -1008,6 +1018,26 @@ xfs_file_llseek( >> return vfs_setpos(file, offset, inode->i_sb->s_maxbytes); >> } >> >> +static int >> +xfs_vma_checks( >> + struct vm_area_struct *vma, >> + struct inode *inode) > > Exactly what are we checking for - function name doesn't tell me, > and there's no comments, either? Ok, I'll improve this. > >> +{ >> + if (!is_xfs_map_direct(vma)) >> + return 0; >> + >> + if (!is_map_direct_valid(vma->vm_private_data)) >> + return VM_FAULT_SIGBUS; >> + >> + if (xfs_is_reflink_inode(XFS_I(inode))) >> + return VM_FAULT_SIGBUS; >> + >> + if (!IS_DAX(inode)) >> + return VM_FAULT_SIGBUS; > > And how do we get is_xfs_map_direct() set to true if we don't have a > DAX inode or the inode has shared extents? So, this was my way of trying to satisfy the request you made here: https://lkml.org/lkml/2017/8/11/876 i.e. allow MAP_DIRECT on non-dax files to enable a use case of freezing the block-map to examine which file extents are linked. If you don't want to use MAP_DIRECT for this, we can move these checks to mmap time. > >> + >> + return 0; >> +} >> + >> /* >> * Locking for serialisation of IO during page faults. This results in a lock >> * ordering of: >> @@ -1024,6 +1054,7 @@ __xfs_filemap_fault( >> enum page_entry_size pe_size, >> bool write_fault) >> { >> + struct vm_area_struct *vma = vmf->vma; >> struct inode *inode = file_inode(vmf->vma->vm_file); > > You missed this vmf->vma.... > > ..... >> >> +#define XFS_MAP_SUPPORTED (LEGACY_MAP_MASK | MAP_DIRECT) >> + >> +STATIC int >> +xfs_file_mmap_validate( >> + struct file *filp, >> + struct vm_area_struct *vma, >> + unsigned long map_flags, >> + int fd) >> +{ >> + struct inode *inode = file_inode(filp); >> + struct xfs_inode *ip = XFS_I(inode); >> + struct map_direct_state *mds; >> + >> + if (map_flags & ~(XFS_MAP_SUPPORTED)) >> + return -EOPNOTSUPP; >> + >> + if ((map_flags & MAP_DIRECT) == 0) >> + return xfs_file_mmap(filp, vma); >> + >> + file_accessed(filp); >> + vma->vm_ops = &xfs_file_vm_direct_ops; >> + if (IS_DAX(inode)) >> + vma->vm_flags |= VM_MIXEDMAP | VM_HUGEPAGE; > > And if it isn't a DAX inode? what is MAP_DIRECT supposed to do then? In the non-DAX case it just takes the FL_LAYOUT file lease... although we could also just have an fcntl for that purpose. The use case of just freezing the block map does not need a mapping. >> + mds = map_direct_register(fd, vma); >> + if (IS_ERR(mds)) >> + return PTR_ERR(mds); >> + >> + /* flush in-flight faults */ >> + xfs_ilock(ip, XFS_MMAPLOCK_EXCL); >> + xfs_iunlock(ip, XFS_MMAPLOCK_EXCL); > > Urk. That's nasty. And why is it even necessary? Please explain why > this is necessary in the comment, because it's not at all obvious to > me... This is related to your other observation about i_mapdcount and adding an iomap_can_allocate() helper. I think I can clean both of these up by using a call to break_layout(inode, false) and bailing in ->iomap_begin() if it returns EWOULDBLOCK. This would also fix the current problem that allocating write-faults don't start the lease break process.