From: Ira Weiny <ira.weiny@xxxxxxxxx> GUP longterm pins of non-pagecache file system pages (FS DAX) are currently disallowed because they are unsafe. The danger for pinning these pages comes from the fact that hole punch and/or truncate of those files results in the pages being mapped and pinned by a user space process while DAX has potentially allocated those pages to other processes. Attempts to hold those pages in reserve defeat the purpose of allowing for FS truncate/hole punch should the user truely desire those operations. That said most users who are mapping FS DAX pages for long term pin purposes (such as RDMA) are not going to want to deallocate these pages while those pages are in use. To do so would mean the application would lose data. So the use case for allowing these operations of such pages seems limited. However, the kernel must protect itself and users from potential mistakes and or malicious user space code. Rather than disable long term pins as is done now. Allow for users who know they are going to be pinning this memory to alert the file system of this intention. Furthermore, allow them to be alerted if the pages they have pined are going away such that they can react. Example user space pseudocode for a user using RDMA and reacting to a lease break of this type would look like this: lease_break() { ... if (sigio.fd == rdma_fd) { ibv_dereg_mr(mr); close(rdma_fd); } } foo() { rdma_fd = open() fcntl(rdma_fd, F_SETLEASE, F_LONGTERM); sigaction(SIGIO, ... lease_break ...); ptr = mmap(rdma_fd, ...); mr = ibv_reg_mr(ptr, ...); } Follow on patches present 2 possible solutions to what to do should an application not take this lease. 1) failure to take the lease results in a failure of the ibv_reg_mr() (or other pin system call which results in GUP being called.) 2) failure to take the lease results in GUP taking the lease on behalf of the user. In both of these cases a failure to react and unpin the memory of the file in question will result in a SIGBUS being sent to the application holding the lease. This is slightly different behavior from what would happen if an application were to write to a hole punched area of a file but it still seems reasonable given that this operation is not allowed at all currently. This patch 1 of X... exports the FL_LONGTERM lease type to user space and implements taking this lease on a file. Follow on patches implement failing a longterm GUP as well as sending a SIGBUS. The last patch in the series removes the restriction of failing FOLL_LONGTERM for DAX operations. A follow on series (not yet completed) will remove the FOLL_LONGTERM restrictions within GUP for calls such as get_user_pages_locked because vma access is no longer required. RFC NOTEs / questions: Should F_LONGTERM be a "flag" of some sort OR'ed in with F_RDLCK? It was considered to use F_WRLCK vs F_RDLCK to indicate if the user was going to be writing vs reading from the file in question. However, in the end this does not matter as far as the FS is concerned. While internally we treat this as a F_RDLCK type the user should consider this a F_LONGTERM lease type which has no concept of read or write. FL_LAYOUT was not used because FL_LAYOUT lease break in XFS would have created a "chicken and the egg" problem. FL_LONGTERM must be broken and the ref counts of devmap page dropped to 1 before FL_LAYOUT could be broken. Not using FL_LAYOUT also makes it very clear we don't have issues conflicting with NFS code. Although I don't think that there would have been any conflict other than the XFS lease break order. The name "FL_LONGTERM" is probably not the best name for this feature. Alternative names are welcome. --- fs/locks.c | 38 +++++++++++++++++++++++++++----- include/linux/fs.h | 1 + include/uapi/asm-generic/fcntl.h | 2 ++ 3 files changed, 35 insertions(+), 6 deletions(-) diff --git a/fs/locks.c b/fs/locks.c index 4b66ed91fb53..8ea1c5713e6a 100644 --- a/fs/locks.c +++ b/fs/locks.c @@ -610,7 +610,8 @@ static const struct lock_manager_operations lease_manager_ops = { /* * Initialize a lease, use the default lock manager operations */ -static int lease_init(struct file *filp, long type, struct file_lock *fl) +static int lease_init(struct file *filp, long type, unsigned int flags, + struct file_lock *fl) { if (assign_type(fl, type) != 0) return -EINVAL; @@ -620,6 +621,8 @@ static int lease_init(struct file *filp, long type, struct file_lock *fl) fl->fl_file = filp; fl->fl_flags = FL_LEASE; + if (flags & FL_LONGTERM) + fl->fl_flags |= FL_LONGTERM; fl->fl_start = 0; fl->fl_end = OFFSET_MAX; fl->fl_ops = NULL; @@ -628,7 +631,8 @@ static int lease_init(struct file *filp, long type, struct file_lock *fl) } /* Allocate a file_lock initialised to this type of lease */ -static struct file_lock *lease_alloc(struct file *filp, long type) +static struct file_lock *lease_alloc(struct file *filp, long type, + unsigned int flags) { struct file_lock *fl = locks_alloc_lock(); int error = -ENOMEM; @@ -636,7 +640,7 @@ static struct file_lock *lease_alloc(struct file *filp, long type) if (fl == NULL) return ERR_PTR(error); - error = lease_init(filp, type, fl); + error = lease_init(filp, type, flags, fl); if (error) { locks_free_lock(fl); return ERR_PTR(error); @@ -1530,6 +1534,10 @@ static bool leases_conflict(struct file_lock *lease, struct file_lock *breaker) { bool rc; + if ((breaker->fl_flags & FL_LONGTERM) != (lease->fl_flags & FL_LONGTERM)) { + rc = false; + goto trace; + } if ((breaker->fl_flags & FL_LAYOUT) != (lease->fl_flags & FL_LAYOUT)) { rc = false; goto trace; @@ -1582,7 +1590,7 @@ int __break_lease(struct inode *inode, unsigned int mode, unsigned int type) int want_write = (mode & O_ACCMODE) != O_RDONLY; LIST_HEAD(dispose); - new_fl = lease_alloc(NULL, want_write ? F_WRLCK : F_RDLCK); + new_fl = lease_alloc(NULL, want_write ? F_WRLCK : F_RDLCK, 0); if (IS_ERR(new_fl)) return PTR_ERR(new_fl); new_fl->fl_flags = type; @@ -1773,7 +1781,7 @@ check_conflicting_open(const struct dentry *dentry, const long arg, int flags) int ret = 0; struct inode *inode = dentry->d_inode; - if (flags & FL_LAYOUT) + if (flags & FL_LAYOUT || flags & FL_LONGTERM) return 0; if ((arg == F_RDLCK) && inode_is_open_for_write(inode)) @@ -2009,8 +2017,26 @@ static int do_fcntl_add_lease(unsigned int fd, struct file *filp, long arg) struct file_lock *fl; struct fasync_struct *new; int error; + unsigned int flags = 0; + + /* + * NOTE on F_LONGTERM lease + * + * LONGTERM lease types are taken on files which the user knows that + * they will be pinning in memory for some indeterminate amount of + * time. Such as for use with RDMA. While we don't know what user + * space is going to do with the file we still use a F_RDLOCK level of + * lease. This ensures that there are no conflicts between + * 2 users. The conflict should only come from the File system wanting + * to revoke the lease in break_layout() And this is done by using + * F_WRLCK in the break code. + */ + if (arg == F_LONGTERM) { + arg = F_RDLCK; + flags = FL_LONGTERM; + } - fl = lease_alloc(filp, arg); + fl = lease_alloc(filp, arg, flags); if (IS_ERR(fl)) return PTR_ERR(fl); diff --git a/include/linux/fs.h b/include/linux/fs.h index 8b42df09b04c..ace21c6feb19 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -991,6 +991,7 @@ static inline struct file *get_file(struct file *f) #define FL_UNLOCK_PENDING 512 /* Lease is being broken */ #define FL_OFDLCK 1024 /* lock is "owned" by struct file */ #define FL_LAYOUT 2048 /* outstanding pNFS layout */ +#define FL_LONGTERM 4096 /* user held pin */ #define FL_CLOSE_POSIX (FL_POSIX | FL_CLOSE) diff --git a/include/uapi/asm-generic/fcntl.h b/include/uapi/asm-generic/fcntl.h index 9dc0bf0c5a6e..9938ebc24adf 100644 --- a/include/uapi/asm-generic/fcntl.h +++ b/include/uapi/asm-generic/fcntl.h @@ -174,6 +174,8 @@ struct f_owner_ex { #define F_SHLCK 8 /* or 4 */ #endif +#define F_LONGTERM 16 /* lease to allow longterm GUP */ + /* operations for bsd flock(), also used by the kernel implementation */ #define LOCK_SH 1 /* shared lock */ #define LOCK_EX 2 /* exclusive lock */ -- 2.20.1