Stephen Brennan <stephen.s.brennan@xxxxxxxxxx> writes: > The pid_revalidate() function drops from RCU into REF lookup mode. When > many threads are resolving paths within /proc in parallel, this can > result in heavy spinlock contention on d_lockref as each thread tries to > grab a reference to the /proc dentry (and drop it shortly thereafter). > > Investigation indicates that it is not necessary to drop RCU in > pid_revalidate(), as no RCU data is modified and the function never > sleeps. So, remove the LOOKUP_RCU check. > > Signed-off-by: Stephen Brennan <stephen.s.brennan@xxxxxxxxxx> > --- Hello all, I wanted to bring this patch to the surface again in case anyone has an opportunity to review the changes. Thank you, Stephen > > When running running ~100 parallel instances of "TZ=/etc/localtime ps -fe >>/dev/null" on a 100CPU machine, the %sys utilization reaches 90%, and perf > shows the following code path as being responsible for heavy contention on > the d_lockref spinlock: > > walk_component() > lookup_fast() > d_revalidate() > pid_revalidate() // returns -ECHILD > unlazy_child() > lockref_get_not_dead(&nd->path.dentry->d_lockref) <-- contention > > By applying this patch, %sys utilization falls to around 60% under the same > workload. Although this particular workload is a bit contrived, we have > seen some monitoring scripts which produced similarly high %sys time due to > this contention. > > As a result this patch, several procfs methods which were only called in > ref-walk mode could now be called from RCU mode. To ensure that this patch > is safe, I audited all the inode get_link and permission() implementations, > as well as dentry d_revalidate() implementations, in fs/proc. These methods > are called in the following ways: > > * get_link() receives a NULL dentry pointer when called in RCU mode. > * permission() receives MAY_NOT_BLOCK in the mode parameter when called > from RCU. > * d_revalidate() receives LOOKUP_RCU in flags. > > There were generally three groups I found. Group (1) are functions which > contain a check at the top of the function and return -ECHILD, and so > appear to be trivially RCU safe (although this is by dropping out of RCU > completely). Group (2) are functions which have no explicit check, but > on my audit, I was confident that there were no sleeping function calls, > and thus were RCU safe as is. However, I would appreciate any additional > review if possible. Group (3) are functions which might be be unsafe for some > reason or another. > > Group (1): > proc_ns_get_link() > proc_pid_get_link() > map_files_d_revalidate() > proc_misc_d_revalidate() > tid_fd_revalidate() > > Group (2): > proc_get_link() > proc_self_get_link() > proc_thread_self_get_link() > proc_fd_permission() > > Group (3): > pid_revalidate() -- addressed by my patch > proc_pid_permission() -- addressed by commits by Al > proc_map_files_get_link() -- calls capable() which could audit > > I believe proc_map_files_get_link() is safe despite calling into the audit > framework, however I'm not confident and so I did not include it in Group 2. > proc_pid_permission() calls into the audit code, and is not safe with > LSM_AUDIT_DATA_DENTRY or LSM_AUDIT_DATA_INODE. Al's commits[1] address > these issues. This patch is tested and stable on the workload described > at the beginning of this cover letter, on a system with selinux enabled. > > [1]: 23d8f5b684fc ("make dump_common_audit_data() safe to be called from > RCU pathwalk") and 2 previous > > Changes in v5: > - Al's commits are now in linux-next, resolving proc_pid_permission() issue. > - Add NULL check after d_inode_rcu(dentry), because inode may become NULL if > we do not hold a reference. > Changes in v4: > - Simplify by unconditionally calling pid_update_inode() from pid_revalidate, > and removing the LOOKUP_RCU check. > Changes in v3: > - Rather than call pid_update_inode() with flags, create > proc_inode_needs_update() to determine whether the call can be skipped. > - Restore the call to the security hook (see next patch). > Changes in v2: > - Remove get_pid_task_rcu_user() and get_proc_task_rcu(), since they were > unnecessary. > - Remove the call to security_task_to_inode(). > > fs/proc/base.c | 18 ++++++++++-------- > 1 file changed, 10 insertions(+), 8 deletions(-) > > diff --git a/fs/proc/base.c b/fs/proc/base.c > index ebea9501afb8..3e105bd05801 100644 > --- a/fs/proc/base.c > +++ b/fs/proc/base.c > @@ -1830,19 +1830,21 @@ static int pid_revalidate(struct dentry *dentry, unsigned int flags) > { > struct inode *inode; > struct task_struct *task; > + int ret = 0; > > - if (flags & LOOKUP_RCU) > - return -ECHILD; > - > - inode = d_inode(dentry); > - task = get_proc_task(inode); > + rcu_read_lock(); > + inode = d_inode_rcu(dentry); > + if (!inode) > + goto out; > + task = pid_task(proc_pid(inode), PIDTYPE_PID); > > if (task) { > pid_update_inode(task, inode); > - put_task_struct(task); > - return 1; > + ret = 1; > } > - return 0; > +out: > + rcu_read_unlock(); > + return ret; > } > > static inline bool proc_inode_is_dead(struct inode *inode) > -- > 2.27.0