Stephen Brennan <stephen.s.brennan@xxxxxxxxxx> writes: > The pid_revalidate() function drops from RCU into REF lookup mode. When > many threads are resolving paths within /proc in parallel, this can > result in heavy spinlock contention on d_lockref as each thread tries to > grab a reference to the /proc dentry (and drop it shortly thereafter). > > Investigation indicates that it is not necessary to drop RCU in > pid_revalidate(), as no RCU data is modified and the function never > sleeps. So, remove the LOOKUP_RCU check. > > Signed-off-by: Stephen Brennan <stephen.s.brennan@xxxxxxxxxx> > --- > > Resending this in the hopes of Al picking this up, or else more feedback about > how to test for RCU-unsafe code in procfs. I just wanted to bump this again - I'm very open to suggestions of new workloads, configuration options, or selinux configurations which could help me verify this is fully safe. Thanks, Stephen > When running running ~128 parallel instances of "TZ=/etc/localtime ps -fe > >/dev/null" on a 128CPU machine, the %sys utilization reaches 97%, and perf > shows the following code path as being responsible for heavy contention on > the d_lockref spinlock: > > walk_component() > lookup_fast() > d_revalidate() > pid_revalidate() // returns -ECHILD > unlazy_child() > lockref_get_not_dead(&nd->path.dentry->d_lockref) <-- contention > > By applying this patch, %sys utilization falls to around 85% under the same > workload, and the number of ps processes executed per unit time increases by > 3x-4x. Although this particular workload is a bit contrived, we have seen some > monitoring scripts which produced similarly high %sys time due to this > contention. > > As a result this patch, several procfs methods which were only called in > ref-walk mode could now be called from RCU mode. To ensure that this patch > is safe, I audited all the inode get_link and permission() implementations, > as well as dentry d_revalidate() implementations, in fs/proc. These methods > are called in the following ways: > > * get_link() receives a NULL dentry pointer when called in RCU mode. > * permission() receives MAY_NOT_BLOCK in the mode parameter when called > from RCU. > * d_revalidate() receives LOOKUP_RCU in flags. > > There were generally three groups I found. Group (1) are functions which > contain a check at the top of the function and return -ECHILD, and so > appear to be trivially RCU safe (although this is by dropping out of RCU > completely). Group (2) are functions which have no explicit check, but > on my audit, I was confident that there were no sleeping function calls, > and thus were RCU safe as is. However, I would appreciate any additional > review if possible. Group (3) are functions which call security hooks, but > which ought to be safe (especially after Al's commits: 23d8f5b684fc ("make > dump_common_audit_data() safe to be called from RCU pathwalk") and 2 > previous). > > Group (1): > proc_ns_get_link() > proc_pid_get_link() > map_files_d_revalidate() > proc_misc_d_revalidate() > tid_fd_revalidate() > > Group (2): > proc_get_link() > proc_self_get_link() > proc_thread_self_get_link() > proc_fd_permission() > > Group (3): > pid_revalidate() -- addressed by my patch, > calls security_task_to_inode() > proc_pid_permission() -- calls security_ptrace_access_check() > proc_map_files_get_link() -- calls security_capable() > > I've tested this patch by enabling CONFIG_PROVE_RCU to warn on sleeping during > RCU, and running heavy procfs-related workloads (like the PS one described > above). I would love more input on selinux/audit rules to explore to attempt to > catch any other potential issues. > > Changes in v5: > - Al's commits are now in linux-next, resolving proc_pid_permission() issue. > - Add NULL check after d_inode_rcu(dentry), because inode may become NULL if > we do not hold a reference. > Changes in v4: > - Simplify by unconditionally calling pid_update_inode() from pid_revalidate, > and removing the LOOKUP_RCU check. > Changes in v3: > - Rather than call pid_update_inode() with flags, create > proc_inode_needs_update() to determine whether the call can be skipped. > - Restore the call to the security hook > Changes in v2: > - Remove get_pid_task_rcu_user() and get_proc_task_rcu(), since they were > unnecessary. > - Remove the call to security_task_to_inode(). > > fs/proc/base.c | 18 ++++++++++-------- > 1 file changed, 10 insertions(+), 8 deletions(-) > > diff --git a/fs/proc/base.c b/fs/proc/base.c > index ebea9501afb8..3e105bd05801 100644 > --- a/fs/proc/base.c > +++ b/fs/proc/base.c > @@ -1830,19 +1830,21 @@ static int pid_revalidate(struct dentry *dentry, unsigned int flags) > { > struct inode *inode; > struct task_struct *task; > + int ret = 0; > > - if (flags & LOOKUP_RCU) > - return -ECHILD; > - > - inode = d_inode(dentry); > - task = get_proc_task(inode); > + rcu_read_lock(); > + inode = d_inode_rcu(dentry); > + if (!inode) > + goto out; > + task = pid_task(proc_pid(inode), PIDTYPE_PID); > > if (task) { > pid_update_inode(task, inode); > - put_task_struct(task); > - return 1; > + ret = 1; > } > - return 0; > +out: > + rcu_read_unlock(); > + return ret; > } > > static inline bool proc_inode_is_dead(struct inode *inode) > -- > 2.27.0