On Mon, 10 Aug 2020 17:41:32 +0200 Greg KH <gregkh@xxxxxxxxxxxxxxxxxxx> wrote: > On Tue, Aug 11, 2020 at 01:27:00AM +1000, Eugene Lubarsky wrote: > > On Mon, 10 Aug 2020 17:04:53 +0200 > > Greg KH <gregkh@xxxxxxxxxxxxxxxxxxx> wrote: > And have you benchmarked any of this? Try working with the common > tools that want this information and see if it actually is noticeable > (hint, I have been doing that with the readfile work and it's > surprising what the results are in places...) Apologies for the delay. Here are some benchmarks with atop. Patch to atop at: https://github.com/eug48/atop/commits/proc-all Patch to add /proc/all/schedstat & cpuset below. atop not collecting threads & cmdline as /proc/all/ doesn't support it. 10,000 processes, kernel 5.8, nested KVM, 2 cores of i7-6700HQ @ 2.60GHz # USE_PROC_ALL=0 ./atop -w test 1 & # pidstat -p $(pidof atop) 1 01:33:05 %usr %system %guest %wait %CPU CPU Command 01:33:06 33.66 33.66 0.00 0.99 67.33 1 atop 01:33:07 33.00 32.00 0.00 2.00 65.00 0 atop 01:33:08 34.00 31.00 0.00 1.00 65.00 0 atop ... Average: 33.15 32.79 0.00 1.09 65.94 - atop # USE_PROC_ALL=1 ./atop -w test 1 & # pidstat -p $(pidof atop) 1 01:33:33 %usr %system %guest %wait %CPU CPU Command 01:33:34 28.00 14.00 0.00 1.00 42.00 1 atop 01:33:35 28.00 14.00 0.00 0.00 42.00 1 atop 01:33:36 26.00 13.00 0.00 0.00 39.00 1 atop ... Average: 27.08 12.86 0.00 0.35 39.94 - atop So CPU usage goes down from ~65% to ~40%. Data collection times in milliseconds are: # xsv cat columns proc.csv procall.csv \ > | xsv stats \ > | xsv select field,min,max,mean,stddev \ > | xsv table field min max mean stddev /proc time 558 625 586.59 18.29 /proc/all time 231 262 243.56 8.02 Much performance optimisation can still be done, e.g. the modified atop uses fgets which is reading 1KB at a time, and seq_file seems to only return 4KB pages. task_diag should be much faster still. I'd imagine this sort of thing would be useful for daemons monitoring large numbers of processes. I don't run such systems myself; my initial motivation was frustration with the Kubernetes kubelet having ~2-4% CPU usage even with a couple of containers. Basic profiling suggests syscalls have a lot to do with it - it's actually reading loads of tiny cgroup files and enumerating many directories every 10 seconds, but /proc has similar issues and seemed easier to start with. Anyway, I've read that io_uring could also help here in the near future, which would be really cool especially if there was a way to enumerate directories and read many files regex-style in a single operation, e.g. /proc/[0-9].*/(stat|statm|io) > > Currently I'm trying to re-use the existing code in fs/proc that > > controls which PIDs are visible, but may well be missing > > something.. > > Try it out and see if it works correctly. And pid namespaces are not > the only thing these days from what I call :) > I've tried `unshare --fork --pid --mount-proc cat /proc/all/stat` which seems to behave correctly. ptrace flags are handled by the existing code. Best Wishes, Eugene >From 2ffc2e388f7ce4e3f182c2442823e5f13bae03dd Mon Sep 17 00:00:00 2001 From: Eugene Lubarsky <elubarsky.linux@xxxxxxxxx> Date: Tue, 25 Aug 2020 12:36:41 +1000 Subject: [RFC PATCH] fs/proc: /proc/all: add schedstat and cpuset Signed-off-by: Eugene Lubarsky <elubarsky.linux@xxxxxxxxx> --- fs/proc/base.c | 42 ++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 42 insertions(+) diff --git a/fs/proc/base.c b/fs/proc/base.c index 0bba4b3a985e..44d73f1ade4a 100644 --- a/fs/proc/base.c +++ b/fs/proc/base.c @@ -3944,6 +3944,36 @@ static int proc_all_io(struct seq_file *m, void *v) } #endif +#ifdef CONFIG_PROC_PID_CPUSET +static int proc_all_cpuset(struct seq_file *m, void *v) +{ + struct all_iter *iter = (struct all_iter *) v; + struct pid_namespace *ns = iter->ns; + struct task_struct *task = iter->tgid_iter.task; + struct pid *pid = task->thread_pid; + + seq_put_decimal_ull(m, "", pid_nr_ns(pid, ns)); + seq_puts(m, " "); + + return proc_cpuset_show(m, ns, pid, task); +} +#endif + +#ifdef CONFIG_SCHED_INFO +static int proc_all_schedstat(struct seq_file *m, void *v) +{ + struct all_iter *iter = (struct all_iter *) v; + struct pid_namespace *ns = iter->ns; + struct task_struct *task = iter->tgid_iter.task; + struct pid *pid = task->thread_pid; + + seq_put_decimal_ull(m, "", pid_nr_ns(pid, ns)); + seq_puts(m, " "); + + return proc_pid_schedstat(m, ns, pid, task); +} +#endif + static int proc_all_statx(struct seq_file *m, void *v) { struct all_iter *iter = (struct all_iter *) v; @@ -3990,6 +4020,12 @@ PROC_ALL_OPS(status); #ifdef CONFIG_TASK_IO_ACCOUNTING PROC_ALL_OPS(io); #endif +#ifdef CONFIG_SCHED_INFO + PROC_ALL_OPS(schedstat); +#endif +#ifdef CONFIG_PROC_PID_CPUSET + PROC_ALL_OPS(cpuset); +#endif #define PROC_ALL_CREATE(NAME) \ do { \ @@ -4011,4 +4047,10 @@ void __init proc_all_init(void) #ifdef CONFIG_TASK_IO_ACCOUNTING PROC_ALL_CREATE(io); #endif +#ifdef CONFIG_SCHED_INFO + PROC_ALL_CREATE(schedstat); +#endif +#ifdef CONFIG_PROC_PID_CPUSET + PROC_ALL_CREATE(cpuset); +#endif } -- 2.25.1