On Fri, Oct 06, 2017 at 03:35:49PM -0700, Dan Williams wrote: > MAP_DIRECT is an mmap(2) flag with the following semantics: > > MAP_DIRECT > When specified with MAP_SHARED_VALIDATE, sets up a file lease with the > same lifetime as the mapping. Unlike a typical F_RDLCK lease this lease > is broken when a "lease breaker" attempts to write(2), change the block > map (fallocate), or change the size of the file. Otherwise the mechanism > of a lease break is identical to the typical lease break case where the > lease needs to be removed (munmap) within the number of seconds > specified by /proc/sys/fs/lease-break-time. If the lease holder fails to > remove the lease in time the kernel will invalidate the mapping and > force all future accesses to the mapping to trigger SIGBUS. > > In addition to lease break timeouts causing faults in the mapping to > result in SIGBUS, other states of the file will trigger SIGBUS at fault > time: > > * The file is not DAX capable > * The file has reflinked (copy-on-write) blocks > * The fault would trigger the filesystem to allocate blocks > * The fault would trigger the filesystem to perform extent conversion > > In other words, MAP_DIRECT expects and enforces a fully allocated file > where faults can be satisfied without modifying block map metadata. > > An unprivileged process may establish a MAP_DIRECT mapping on a file > whose UID (owner) matches the filesystem UID of the process. A process > with the CAP_LEASE capability may establish a MAP_DIRECT mapping on > arbitrary files > > ERRORS > EACCES Beyond the typical mmap(2) conditions that trigger EACCES > MAP_DIRECT also requires the permission to set a file lease. > > EOPNOTSUPP The filesystem explicitly does not support the flag > > SIGBUS Attempted to write a MAP_DIRECT mapping at a file offset that > might require block-map updates, or the lease timed out and the > kernel invalidated the mapping. > > Cc: Jan Kara <jack@xxxxxxx> > Cc: Arnd Bergmann <arnd@xxxxxxxx> > Cc: Jeff Moyer <jmoyer@xxxxxxxxxx> > Cc: Christoph Hellwig <hch@xxxxxx> > Cc: Dave Chinner <david@xxxxxxxxxxxxx> > Cc: Alexander Viro <viro@xxxxxxxxxxxxxxxxxx> > Cc: "Darrick J. Wong" <darrick.wong@xxxxxxxxxx> > Cc: Ross Zwisler <ross.zwisler@xxxxxxxxxxxxxxx> > Cc: Jeff Layton <jlayton@xxxxxxxxxxxxxxx> > Cc: "J. Bruce Fields" <bfields@xxxxxxxxxxxx> > Signed-off-by: Dan Williams <dan.j.williams@xxxxxxxxx> > --- > fs/xfs/Kconfig | 2 - > fs/xfs/xfs_file.c | 102 +++++++++++++++++++++++++++++++++++++++ > include/linux/mman.h | 3 + > include/uapi/asm-generic/mman.h | 1 > 4 files changed, 106 insertions(+), 2 deletions(-) > > diff --git a/fs/xfs/Kconfig b/fs/xfs/Kconfig > index f62fc6629abb..f8765653a438 100644 > --- a/fs/xfs/Kconfig > +++ b/fs/xfs/Kconfig > @@ -112,4 +112,4 @@ config XFS_ASSERT_FATAL > > config XFS_LAYOUT > def_bool y > - depends on EXPORTFS_BLOCK_OPS > + depends on EXPORTFS_BLOCK_OPS || FS_DAX > diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c > index ebdd0bd2b261..e35518600e28 100644 > --- a/fs/xfs/xfs_file.c > +++ b/fs/xfs/xfs_file.c > @@ -40,12 +40,22 @@ > #include "xfs_iomap.h" > #include "xfs_reflink.h" > > +#include <linux/mman.h> > #include <linux/dcache.h> > #include <linux/falloc.h> > #include <linux/pagevec.h> > +#include <linux/mapdirect.h> > #include <linux/backing-dev.h> > > static const struct vm_operations_struct xfs_file_vm_ops; > +static const struct vm_operations_struct xfs_file_vm_direct_ops; > + > +static inline bool > +is_xfs_map_direct( > + struct vm_area_struct *vma) > +{ > + return vma->vm_ops == &xfs_file_vm_direct_ops; > +} Namespacing (xfs_vma_is_direct) and whitespace damage. > > /* > * Clear the specified ranges to zero through either the pagecache or DAX. > @@ -1008,6 +1018,26 @@ xfs_file_llseek( > return vfs_setpos(file, offset, inode->i_sb->s_maxbytes); > } > > +static int > +xfs_vma_checks( > + struct vm_area_struct *vma, > + struct inode *inode) Exactly what are we checking for - function name doesn't tell me, and there's no comments, either? > +{ > + if (!is_xfs_map_direct(vma)) > + return 0; > + > + if (!is_map_direct_valid(vma->vm_private_data)) > + return VM_FAULT_SIGBUS; > + > + if (xfs_is_reflink_inode(XFS_I(inode))) > + return VM_FAULT_SIGBUS; > + > + if (!IS_DAX(inode)) > + return VM_FAULT_SIGBUS; And how do we get is_xfs_map_direct() set to true if we don't have a DAX inode or the inode has shared extents? > + > + return 0; > +} > + > /* > * Locking for serialisation of IO during page faults. This results in a lock > * ordering of: > @@ -1024,6 +1054,7 @@ __xfs_filemap_fault( > enum page_entry_size pe_size, > bool write_fault) > { > + struct vm_area_struct *vma = vmf->vma; > struct inode *inode = file_inode(vmf->vma->vm_file); You missed this vmf->vma.... ..... > > +#define XFS_MAP_SUPPORTED (LEGACY_MAP_MASK | MAP_DIRECT) > + > +STATIC int > +xfs_file_mmap_validate( > + struct file *filp, > + struct vm_area_struct *vma, > + unsigned long map_flags, > + int fd) > +{ > + struct inode *inode = file_inode(filp); > + struct xfs_inode *ip = XFS_I(inode); > + struct map_direct_state *mds; > + > + if (map_flags & ~(XFS_MAP_SUPPORTED)) > + return -EOPNOTSUPP; > + > + if ((map_flags & MAP_DIRECT) == 0) > + return xfs_file_mmap(filp, vma); > + > + file_accessed(filp); > + vma->vm_ops = &xfs_file_vm_direct_ops; > + if (IS_DAX(inode)) > + vma->vm_flags |= VM_MIXEDMAP | VM_HUGEPAGE; And if it isn't a DAX inode? what is MAP_DIRECT supposed to do then? > + mds = map_direct_register(fd, vma); > + if (IS_ERR(mds)) > + return PTR_ERR(mds); > + > + /* flush in-flight faults */ > + xfs_ilock(ip, XFS_MMAPLOCK_EXCL); > + xfs_iunlock(ip, XFS_MMAPLOCK_EXCL); Urk. That's nasty. And why is it even necessary? Please explain why this is necessary in the comment, because it's not at all obvious to me... Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx