Re: [PATCH - RFC] VFS: disable new delegations during delegation-breaking operations

Chuck Lever III <chuck.lever@xxxxxxxxxx> · Mon, 30 Sep 2024 14:15:03 +0000

> On Sep 30, 2024, at 10:04 AM, Jeff Layton <jlayton@xxxxxxxxxx> wrote:
> 
> On Mon, 2024-09-16 at 14:34 +1000, NeilBrown wrote:
>> Various operations such as rename and unlink must break any delegations
>> before they proceed.
>> do_dentry_open() and vfs_truncate(), which use break_lease(), increment
>> i_writecount and/or i_readcount which blocks delegations until the
>> counter is decremented, but the various callers of try_break_deleg() do
>> not impose any such barrier.  They hold the inode lock while performing
>> the operation which blocks delegations, but must drop it while waiting
>> for a delegation to be broken, which leaves an opportunity for a new
>> delegation to be added.
>> 
>> nfsd - the only current user of delegations - records any files on which
>> it is called to break a delegation in a manner which blocks further
>> delegations for 30-60 seconds.  This is normally sufficient.  However
>> there is talk of reducing the timeout and it would be best if operations
>> that needed delegations to be blocked used something more definitive
>> than a timer.
>> 
>> This patch adds that definitive blocking by adding a counter to struct
>> file_lock_context of the number of concurrent operations which require
>> delegations to be blocked.  check_conflicting_open() checks that counter
>> when a delegation is requested and denies the delegation if the counter
>> is elevated.
>> 
>> try_break_deleg() now increments that counter when it records the inode
>> as a 'delegated_inode'.
>> 
>> break_deleg_wait() now leaves the inode pointer in *delegated_inode when
>> it signals that the operation should be retried, and then clears it -
>> decrementing the new counter - when the operation has completed, whether
>> successfully or not.  To achieve this we now pass the current error
>> status in to break_deleg_wait().
>> 
>> vfs_rename() now uses two delegated_inode pointers, one for the
>> source and one for the destination in the case of replacement.  This is
>> needed as it may be necessary to block further delegations to both
>> inodes while the rename completes.
>> 
>> The net result is that we no longer depend on the delay that nfsd
>> imposes on new delegation in order for various functions that break
>> delegations to be sure that new delegations won't be added while they wait
>> with the inode unlocked.  This gives more freedom to nfsd to make more
>> subtle choices about when and for how long to block delegations when
>> there is no active contention.
>> 
>> try_break_deleg() is possibly now large enough that it shouldn't be
>> inline.
>> 
> 
> I like this approach. Moving the blocking of new delegations into the
> VFS layer makes sense, and we can do better, more targeted blocking of
> new leases this way.

I have only a little terminology quibble: "blocking" suggests
to my brain that this causes a process to sleep until some
resource is available.

That's not actually what is going on; this is a mechanism to
prevent handing out delegations in some cases. The OPEN is
allowed to proceed without hindrance.

Cork, retard, squelch, stifle... I like squelch, but it's a
little uncommon.

My two cents US.

> I wonder -- do we still need the bloom filter if we do this? We could
> keep track of the time of the last recall in i_flctx as well, and then
> avoid handing them out until some time has elapsed.
> 
> try_break_deleg() (and break_deleg_wait(), for that matter) were
> already a bit large for inlines, so moving them to regular functions
> sounds like a good idea. Maybe we can even have inline fastpath
> wrappers around them that check for i_flctx == NULL and avoid the jmp
> in that case?
> 
>> Signed-off-by: NeilBrown <neilb@xxxxxxx>
>> ---
>> fs/locks.c               | 12 ++++++++++--
>> fs/namei.c               | 32 ++++++++++++++++++++------------
>> fs/open.c                |  8 ++++----
>> fs/posix_acl.c           |  8 ++++----
>> fs/utimes.c              |  4 ++--
>> fs/xattr.c               |  8 ++++----
>> include/linux/filelock.h | 31 ++++++++++++++++++++++++-------
>> include/linux/fs.h       |  3 ++-
>> 8 files changed, 70 insertions(+), 36 deletions(-)
>> 
>> diff --git a/fs/locks.c b/fs/locks.c
>> index e45cad40f8b6..171628094daa 100644
>> --- a/fs/locks.c
>> +++ b/fs/locks.c
>> @@ -191,6 +191,7 @@ locks_get_lock_context(struct inode *inode, int type)
>> INIT_LIST_HEAD(&ctx->flc_flock);
>> INIT_LIST_HEAD(&ctx->flc_posix);
>> INIT_LIST_HEAD(&ctx->flc_lease);
>> + atomic_set(&ctx->flc_deleg_blockers, 0);
>> 
>> /*
>>  * Assign the pointer if it's not already assigned. If it is, then
>> @@ -255,6 +256,7 @@ locks_free_lock_context(struct inode *inode)
>> struct file_lock_context *ctx = locks_inode_context(inode);
>> 
>> if (unlikely(ctx)) {
>> + WARN_ON(atomic_read(&ctx->flc_deleg_blockers) != 0);
>> locks_check_ctx_lists(inode);
>> kmem_cache_free(flctx_cache, ctx);
>> }
>> @@ -1743,9 +1745,15 @@ check_conflicting_open(struct file *filp, const int arg, int flags)
>> 
>> if (flags & FL_LAYOUT)
>> return 0;
>> - if (flags & FL_DELEG)
>> - /* We leave these checks to the caller */
>> + if (flags & FL_DELEG) {
>> + struct file_lock_context *ctx = locks_inode_context(inode);
>> +
>> + if (ctx && atomic_read(&ctx->flc_deleg_blockers) > 0)
>> + return -EAGAIN;
>> +
>> + /* We leave the remaining checks to the caller */
>> return 0;
>> + }
>> 
>> if (arg == F_RDLCK)
>> return inode_is_open_for_write(inode) ? -EAGAIN : 0;
>> diff --git a/fs/namei.c b/fs/namei.c
>> index 5512cb10fa89..3054da90276b 100644
>> --- a/fs/namei.c
>> +++ b/fs/namei.c
>> @@ -4493,8 +4493,8 @@ int do_unlinkat(int dfd, struct filename *name)
>> iput(inode); /* truncate the inode here */
>> inode = NULL;
>> if (delegated_inode) {
>> - error = break_deleg_wait(&delegated_inode);
>> - if (!error)
>> + error = break_deleg_wait(&delegated_inode, error);
>> + if (error == -EWOULDBLOCK)
>> goto retry_deleg;
>> }
>> mnt_drop_write(path.mnt);
>> @@ -4764,8 +4764,8 @@ int do_linkat(int olddfd, struct filename *old, int newdfd,
>> out_dput:
>> done_path_create(&new_path, new_dentry);
>> if (delegated_inode) {
>> - error = break_deleg_wait(&delegated_inode);
>> - if (!error) {
>> + error = break_deleg_wait(&delegated_inode, error);
>> + if (error == -EWOULDBLOCK) {
>> path_put(&old_path);
>> goto retry;
>> }
>> @@ -4848,7 +4848,8 @@ int vfs_rename(struct renamedata *rd)
>> struct inode *old_dir = rd->old_dir, *new_dir = rd->new_dir;
>> struct dentry *old_dentry = rd->old_dentry;
>> struct dentry *new_dentry = rd->new_dentry;
>> - struct inode **delegated_inode = rd->delegated_inode;
>> + struct inode **delegated_inode_old = rd->delegated_inode_old;
>> + struct inode **delegated_inode_new = rd->delegated_inode_new;
>> unsigned int flags = rd->flags;
>> bool is_dir = d_is_dir(old_dentry);
>> struct inode *source = old_dentry->d_inode;
>> @@ -4954,12 +4955,12 @@ int vfs_rename(struct renamedata *rd)
>> goto out;
>> }
>> if (!is_dir) {
>> - error = try_break_deleg(source, delegated_inode);
>> + error = try_break_deleg(source, delegated_inode_old);
>> if (error)
>> goto out;
>> }
>> if (target && !new_is_dir) {
>> - error = try_break_deleg(target, delegated_inode);
>> + error = try_break_deleg(target, delegated_inode_new);
>> if (error)
>> goto out;
>> }
>> @@ -5011,7 +5012,8 @@ int do_renameat2(int olddfd, struct filename *from, int newdfd,
>> struct path old_path, new_path;
>> struct qstr old_last, new_last;
>> int old_type, new_type;
>> - struct inode *delegated_inode = NULL;
>> + struct inode *delegated_inode_old = NULL;
>> + struct inode *delegated_inode_new = NULL;
>> unsigned int lookup_flags = 0, target_flags = LOOKUP_RENAME_TARGET;
>> bool should_retry = false;
>> int error = -EINVAL;
>> @@ -5118,7 +5120,8 @@ int do_renameat2(int olddfd, struct filename *from, int newdfd,
>> rd.new_dir    = new_path.dentry->d_inode;
>> rd.new_dentry    = new_dentry;
>> rd.new_mnt_idmap   = mnt_idmap(new_path.mnt);
>> - rd.delegated_inode = &delegated_inode;
>> + rd.delegated_inode_old = &delegated_inode_old;
>> + rd.delegated_inode_new = &delegated_inode_new;
>> rd.flags    = flags;
>> error = vfs_rename(&rd);
>> exit5:
>> @@ -5128,9 +5131,14 @@ int do_renameat2(int olddfd, struct filename *from, int newdfd,
>> exit3:
>> unlock_rename(new_path.dentry, old_path.dentry);
>> exit_lock_rename:
>> - if (delegated_inode) {
>> - error = break_deleg_wait(&delegated_inode);
>> - if (!error)
>> + if (delegated_inode_old) {
>> + error = break_deleg_wait(&delegated_inode_old, error);
>> + if (error == -EWOULDBLOCK)
>> + goto retry_deleg;
>> + }
>> + if (delegated_inode_new) {
>> + error = break_deleg_wait(&delegated_inode_new, error);
>> + if (error == -EWOULDBLOCK)
>> goto retry_deleg;
>> }
>> mnt_drop_write(old_path.mnt);
>> diff --git a/fs/open.c b/fs/open.c
>> index 22adbef7ecc2..6b6d20a68dd8 100644
>> --- a/fs/open.c
>> +++ b/fs/open.c
>> @@ -656,8 +656,8 @@ int chmod_common(const struct path *path, umode_t mode)
>> out_unlock:
>> inode_unlock(inode);
>> if (delegated_inode) {
>> - error = break_deleg_wait(&delegated_inode);
>> - if (!error)
>> + error = break_deleg_wait(&delegated_inode, error);
>> + if (error == -EWOULDBLOCK)
>> goto retry_deleg;
>> }
>> mnt_drop_write(path->mnt);
>> @@ -795,8 +795,8 @@ int chown_common(const struct path *path, uid_t user, gid_t group)
>>       &delegated_inode);
>> inode_unlock(inode);
>> if (delegated_inode) {
>> - error = break_deleg_wait(&delegated_inode);
>> - if (!error)
>> + error = break_deleg_wait(&delegated_inode, error);
>> + if (error == -EWOULDBLOCK)
>> goto retry_deleg;
>> }
>> return error;
>> diff --git a/fs/posix_acl.c b/fs/posix_acl.c
>> index 3f87297dbfdb..5eb3635d1067 100644
>> --- a/fs/posix_acl.c
>> +++ b/fs/posix_acl.c
>> @@ -1143,8 +1143,8 @@ int vfs_set_acl(struct mnt_idmap *idmap, struct dentry *dentry,
>> inode_unlock(inode);
>> 
>> if (delegated_inode) {
>> - error = break_deleg_wait(&delegated_inode);
>> - if (!error)
>> + error = break_deleg_wait(&delegated_inode, error);
>> + if (error == -EWOULDBLOCK)
>> goto retry_deleg;
>> }
>> 
>> @@ -1251,8 +1251,8 @@ int vfs_remove_acl(struct mnt_idmap *idmap, struct dentry *dentry,
>> inode_unlock(inode);
>> 
>> if (delegated_inode) {
>> - error = break_deleg_wait(&delegated_inode);
>> - if (!error)
>> + error = break_deleg_wait(&delegated_inode, error);
>> + if (error == -EWOULDBLOCK)
>> goto retry_deleg;
>> }
>> 
>> diff --git a/fs/utimes.c b/fs/utimes.c
>> index 3701b3946f88..21b7605551dc 100644
>> --- a/fs/utimes.c
>> +++ b/fs/utimes.c
>> @@ -67,8 +67,8 @@ int vfs_utimes(const struct path *path, struct timespec64 *times)
>>       &delegated_inode);
>> inode_unlock(inode);
>> if (delegated_inode) {
>> - error = break_deleg_wait(&delegated_inode);
>> - if (!error)
>> + error = break_deleg_wait(&delegated_inode, error);
>> + if (error == -EWOULDBLOCK)
>> goto retry_deleg;
>> }
>> 
>> diff --git a/fs/xattr.c b/fs/xattr.c
>> index 7672ce5486c5..63e0b067dab9 100644
>> --- a/fs/xattr.c
>> +++ b/fs/xattr.c
>> @@ -323,8 +323,8 @@ vfs_setxattr(struct mnt_idmap *idmap, struct dentry *dentry,
>> inode_unlock(inode);
>> 
>> if (delegated_inode) {
>> - error = break_deleg_wait(&delegated_inode);
>> - if (!error)
>> + error = break_deleg_wait(&delegated_inode, error);
>> + if (error == -EWOULDBLOCK)
>> goto retry_deleg;
>> }
>> if (value != orig_value)
>> @@ -577,8 +577,8 @@ vfs_removexattr(struct mnt_idmap *idmap, struct dentry *dentry,
>> inode_unlock(inode);
>> 
>> if (delegated_inode) {
>> - error = break_deleg_wait(&delegated_inode);
>> - if (!error)
>> + error = break_deleg_wait(&delegated_inode, error);
>> + if (error == -EWOULDBLOCK)
>> goto retry_deleg;
>> }
>> 
>> diff --git a/include/linux/filelock.h b/include/linux/filelock.h
>> index daee999d05f3..66470ba9658c 100644
>> --- a/include/linux/filelock.h
>> +++ b/include/linux/filelock.h
>> @@ -144,6 +144,7 @@ struct file_lock_context {
>> struct list_head flc_flock;
>> struct list_head flc_posix;
>> struct list_head flc_lease;
>> + atomic_t flc_deleg_blockers;
>> };
>> 
>> #ifdef CONFIG_FILE_LOCKING
>> @@ -450,21 +451,37 @@ static inline int try_break_deleg(struct inode *inode, struct inode **delegated_
>> {
>> int ret;
>> 
>> + if (delegated_inode && *delegated_inode) {
>> + if (*delegated_inode == inode)
>> + /* Don't need to count this */
>> + return break_deleg(inode, O_WRONLY|O_NONBLOCK);
>> +
>> + /* inode changed, forget the old one */
>> + atomic_dec(&(*delegated_inode)->i_flctx->flc_deleg_blockers);
>> + iput(*delegated_inode);
>> + *delegated_inode = NULL;
>> + }
>> ret = break_deleg(inode, O_WRONLY|O_NONBLOCK);
>> if (ret == -EWOULDBLOCK && delegated_inode) {
>> *delegated_inode = inode;
>> + atomic_inc(&(*delegated_inode)->i_flctx->flc_deleg_blockers);
>> ihold(inode);
>> }
>> return ret;
>> }
>> 
>> -static inline int break_deleg_wait(struct inode **delegated_inode)
>> +static inline int break_deleg_wait(struct inode **delegated_inode, int ret)
>> {
>> - int ret;
>> -
>> - ret = break_deleg(*delegated_inode, O_WRONLY);
>> - iput(*delegated_inode);
>> - *delegated_inode = NULL;
>> + if (ret == -EWOULDBLOCK) {
>> + ret = break_deleg(*delegated_inode, O_WRONLY);
>> + if (ret == 0)
>> + ret = -EWOULDBLOCK;
>> + }
>> + if (ret != -EWOULDBLOCK) {
>> + atomic_dec(&(*delegated_inode)->i_flctx->flc_deleg_blockers);
>> + iput(*delegated_inode);
>> + *delegated_inode = NULL;
>> + }
>> return ret;
>> }
>> 
>> @@ -494,7 +511,7 @@ static inline int try_break_deleg(struct inode *inode, struct inode **delegated_
>> return 0;
>> }
>> 
>> -static inline int break_deleg_wait(struct inode **delegated_inode)
>> +static inline int break_deleg_wait(struct inode **delegated_inode, int ret)
>> {
>> BUG();
>> return 0;
>> diff --git a/include/linux/fs.h b/include/linux/fs.h
>> index 6ca11e241a24..50957d9e1c2b 100644
>> --- a/include/linux/fs.h
>> +++ b/include/linux/fs.h
>> @@ -1902,7 +1902,8 @@ struct renamedata {
>> struct mnt_idmap *new_mnt_idmap;
>> struct inode *new_dir;
>> struct dentry *new_dentry;
>> - struct inode **delegated_inode;
>> + struct inode **delegated_inode_old;
>> + struct inode **delegated_inode_new;
>> unsigned int flags;
>> } __randomize_layout;
>> 
>> 
>> base-commit: 98f7e32f20d28ec452afb208f9cffc08448a2652
> 
> -- 
> Jeff Layton <jlayton@xxxxxxxxxx>

--
Chuck Lever