On Mon, Oct 04, 2021 at 10:56:28AM -0700, Stephen Brennan wrote: > Problem Description: > > When running running ~128 parallel instances of "TZ=/etc/localtime ps > -fe >/dev/null" on a 128CPU machine, the %sys utilization reaches 97%, > and perf shows the following code path as being responsible for heavy > contention on the d_lockref spinlock: > > walk_component() > lookup_fast() > d_revalidate() > pid_revalidate() // returns -ECHILD > unlazy_child() > lockref_get_not_dead(&nd->path.dentry->d_lockref) <-- contention > > The reason is that pid_revalidate() is triggering a drop from RCU to ref > path walk mode. All concurrent path lookups thus try to grab a reference > to the dentry for /proc/, before re-executing pid_revalidate() and then > stepping into the /proc/$pid directory. Thus there is huge spinlock > contention. This patch allows pid_revalidate() to execute in RCU mode, > meaning that the path lookup can successfully enter the /proc/$pid > directory while still in RCU mode. Later on, the path lookup may still > drop into ref mode, but the contention will be much reduced at this > point. > > By applying this patch, %sys utilization falls to around 85% under the > same workload, and the number of ps processes executed per unit time > increases by 3x-4x. Although this particular workload is a bit > contrived, we have seen some large collections of eager monitoring > scripts which produced similarly high %sys time due to contention in the > /proc directory. I think it's perhaps also worth noting that this is a performance regression relative to ... v5.4? v4.14? I forget the details; do you have those to hand, Stephen? (Yes, this is a stupid workload. Yes, a customer really does have this workload.)