Hi all, We have a user who has reported performance problems related to (presumably) custom task monitoring on Linux systems when running processes with large numbers of threads. Unfortunately I don't have much information around the practical workload and observations, but only that the problem had been narrowed down to excessive readdir latency of the /proc root dir in the presence of large numbers of threads in the associated pid namespace. This latency boils down to the inefficient pid_namespace walk down in the proc_pid_readdir() path. More specifically, every thread/task allocates an associated struct pid, and the procfs next_tgid() implementation walks every pid in the namespace looking for those with an associated PIDTYPE_TGID task to fill into the directory listing. Given that ids are part of the idr radix-tree, it seemed fairly logical that this could be improved using an internal tree tag. I started playing around with an approach that tagged and untagged ids based on actual task association (i.e., attach_pid() and friends), but after some thought and feedback came to the realization that this could probably be simplified to just tag the pid once on allocation and allow procfs to use it as a hint for root dir population. This works because post-fork tgid task disassociation (without an exit() and freeing the pid) seems to be uncommon. The only tool I've seen in my testing so far that leaves around a tagged, non-TGID pid is chronyd, which appears to do a fork() -> setsid() -> fork() pattern where the intermediate task exits but the associated pid hangs around for the lifetime of the process due to the PIDTYPE_SID association. Therefore, this series implements this tgid tag hinting approach. Patch 1 includes a couple tweaks to the idr tree to support traditional radix-tree tag propagation. Patch 2 defines the new tag and sets it on pid allocation. Patch 3 updates procfs to use the tag for the readdir pid_namespace traversal. As far as testing goes, I've thrown this at fstests (not for filesystem testing purposes, but moreso just because I had the test env handy and it's a longish running task creation workload), LTP and some of the kernel internal tests in tools/testing/selftests (clone, proc, pid_namespace) without any obvious regressions. From the performance angle, the user who reported this problem has provided some synthetic tools to create dummy tasks/threads and run repeated readdir iterations of /proc, which is what they've been using to compare results on Linux kernels with some $other OS. These tools show a notable improvement in terms of the number of /proc readdir iterations possible per-second. For example, on 5.19.0-rc2 running on a mostly idle system with an active 100k thread process, readdirs-per-second improves from a baseline of ~285 to ~7.3k with the series applied. More detailed getdents() latency numbers are included in the commit log of patch 3. Thoughts, reviews, flames appreciated. Brian Brian Foster (3): radix-tree: propagate all tags in idr tree pid: use idr tag to hint pids associated with group leader tasks proc: use idr tgid tag hint to iterate pids in readdir fs/proc/base.c | 2 +- include/linux/idr.h | 25 +++++++++++++++++++++++++ include/linux/pid.h | 2 +- kernel/fork.c | 2 +- kernel/pid.c | 9 ++++++++- lib/radix-tree.c | 26 +++++++++++++++----------- 6 files changed, 51 insertions(+), 15 deletions(-) -- 2.34.1