The pid_revalidate() function requires dropping from RCU into REF lookup mode. When many threads are resolving paths within /proc in parallel, this can result in heavy spinlock contention on d_locrkef as each thread tries to grab a reference to the /proc dentry (and drop it shortly thereafter). Allow the pid_revalidate() function to execute under LOOKUP_RCU. When updates must be made to the inode, drop out of RCU and into REF mode. Signed-off-by: Stephen Brennan <stephen.s.brennan@xxxxxxxxxx> --- When running running ~100 parallel instances of "TZ=/etc/localtime ps -fe >/dev/null" on a 100CPU machine, the %sys utilization reaches 90%, and perf shows the following code path as being responsible for heavy contention on the d_lockref spinlock: walk_component() lookup_fast() unlazy_child() lockref_get_not_dead(&nd->path.dentry->d_lockref) By applying this patch, %sys utilization falls to around 60% under the same workload. Although this particular workload is a bit contrived, we have seen some monitoring scripts which produced similarly high %sys time due to this contention. Changes from v3: - Rather than call pid_update_inode() with flags, create proc_inode_needs_update() to determine whether the call can be skipped. - Restore the call to the security hook (see next patch). Changes from v2: - Remove get_pid_task_rcu_user() and get_proc_task_rcu(), since they were unnecessary. - Remove the call to security_task_to_inode(). fs/proc/base.c | 35 +++++++++++++++++++++++++---------- 1 file changed, 25 insertions(+), 10 deletions(-) diff --git a/fs/proc/base.c b/fs/proc/base.c index b3422cda2a91..4b246e9bd5df 100644 --- a/fs/proc/base.c +++ b/fs/proc/base.c @@ -1968,6 +1968,20 @@ void pid_update_inode(struct task_struct *task, struct inode *inode) security_task_to_inode(task, inode); } +/* See if we can avoid the above call. Assumes RCU lock held */ +static bool pid_inode_needs_update(struct task_struct *task, struct inode *inode) +{ + kuid_t uid; + kgid_t gid; + + if (inode->i_mode & (S_ISUID | S_ISGID)) + return true; + task_dump_owner(task, inode->i_mode, &uid, &gid); + if (!uid_eq(uid, inode->i_uid) || !gid_eq(gid, inode->i_gid)) + return true; + return false; +} + /* * Rewrite the inode's ownerships here because the owning task may have * performed a setuid(), etc. @@ -1977,19 +1991,20 @@ static int pid_revalidate(struct dentry *dentry, unsigned int flags) { struct inode *inode; struct task_struct *task; + int rv = 0; - if (flags & LOOKUP_RCU) - return -ECHILD; - - inode = d_inode(dentry); - task = get_proc_task(inode); - + rcu_read_lock(); + inode = d_inode_rcu(dentry); + task = pid_task(proc_pid(inode), PIDTYPE_PID); if (task) { - pid_update_inode(task, inode); - put_task_struct(task); - return 1; + rv = 1; + if ((flags & LOOKUP_RCU) && pid_inode_needs_update(task, inode)) + rv = -ECHILD; + else if (!(flags & LOOKUP_RCU)) + pid_update_inode(task, inode); } - return 0; + rcu_read_unlock(); + return rv; } static inline bool proc_inode_is_dead(struct inode *inode) -- 2.25.1