The pid_revalidate() function drops from RCU into REF lookup mode. When many threads are resolving paths within /proc in parallel, this can result in heavy spinlock contention on d_lockref as each thread tries to grab a reference to the /proc dentry (and drop it shortly thereafter). Investigation indicates that it is not necessary to drop RCU in pid_revalidate(), as no RCU data is modified and the function never sleeps. So, remove the LOOKUP_RCU check. Signed-off-by: Stephen Brennan <stephen.s.brennan@xxxxxxxxxx> --- Resending this in the hopes of Al picking this up, or else more feedback about how to test for RCU-unsafe code in procfs. When running running ~128 parallel instances of "TZ=/etc/localtime ps -fe >/dev/null" on a 128CPU machine, the %sys utilization reaches 97%, and perf shows the following code path as being responsible for heavy contention on the d_lockref spinlock: walk_component() lookup_fast() d_revalidate() pid_revalidate() // returns -ECHILD unlazy_child() lockref_get_not_dead(&nd->path.dentry->d_lockref) <-- contention By applying this patch, %sys utilization falls to around 85% under the same workload, and the number of ps processes executed per unit time increases by 3x-4x. Although this particular workload is a bit contrived, we have seen some monitoring scripts which produced similarly high %sys time due to this contention. As a result this patch, several procfs methods which were only called in ref-walk mode could now be called from RCU mode. To ensure that this patch is safe, I audited all the inode get_link and permission() implementations, as well as dentry d_revalidate() implementations, in fs/proc. These methods are called in the following ways: * get_link() receives a NULL dentry pointer when called in RCU mode. * permission() receives MAY_NOT_BLOCK in the mode parameter when called from RCU. * d_revalidate() receives LOOKUP_RCU in flags. There were generally three groups I found. Group (1) are functions which contain a check at the top of the function and return -ECHILD, and so appear to be trivially RCU safe (although this is by dropping out of RCU completely). Group (2) are functions which have no explicit check, but on my audit, I was confident that there were no sleeping function calls, and thus were RCU safe as is. However, I would appreciate any additional review if possible. Group (3) are functions which call security hooks, but which ought to be safe (especially after Al's commits: 23d8f5b684fc ("make dump_common_audit_data() safe to be called from RCU pathwalk") and 2 previous). Group (1): proc_ns_get_link() proc_pid_get_link() map_files_d_revalidate() proc_misc_d_revalidate() tid_fd_revalidate() Group (2): proc_get_link() proc_self_get_link() proc_thread_self_get_link() proc_fd_permission() Group (3): pid_revalidate() -- addressed by my patch, calls security_task_to_inode() proc_pid_permission() -- calls security_ptrace_access_check() proc_map_files_get_link() -- calls security_capable() I've tested this patch by enabling CONFIG_PROVE_RCU to warn on sleeping during RCU, and running heavy procfs-related workloads (like the PS one described above). I would love more input on selinux/audit rules to explore to attempt to catch any other potential issues. Changes in v5: - Al's commits are now in linux-next, resolving proc_pid_permission() issue. - Add NULL check after d_inode_rcu(dentry), because inode may become NULL if we do not hold a reference. Changes in v4: - Simplify by unconditionally calling pid_update_inode() from pid_revalidate, and removing the LOOKUP_RCU check. Changes in v3: - Rather than call pid_update_inode() with flags, create proc_inode_needs_update() to determine whether the call can be skipped. - Restore the call to the security hook Changes in v2: - Remove get_pid_task_rcu_user() and get_proc_task_rcu(), since they were unnecessary. - Remove the call to security_task_to_inode(). fs/proc/base.c | 18 ++++++++++-------- 1 file changed, 10 insertions(+), 8 deletions(-) diff --git a/fs/proc/base.c b/fs/proc/base.c index ebea9501afb8..3e105bd05801 100644 --- a/fs/proc/base.c +++ b/fs/proc/base.c @@ -1830,19 +1830,21 @@ static int pid_revalidate(struct dentry *dentry, unsigned int flags) { struct inode *inode; struct task_struct *task; + int ret = 0; - if (flags & LOOKUP_RCU) - return -ECHILD; - - inode = d_inode(dentry); - task = get_proc_task(inode); + rcu_read_lock(); + inode = d_inode_rcu(dentry); + if (!inode) + goto out; + task = pid_task(proc_pid(inode), PIDTYPE_PID); if (task) { pid_update_inode(task, inode); - put_task_struct(task); - return 1; + ret = 1; } - return 0; +out: + rcu_read_unlock(); + return ret; } static inline bool proc_inode_is_dead(struct inode *inode) -- 2.27.0