On 8/12/20 1:51 AM, Andrei Vagin wrote: > > I rebased the task_diag patches on top of v5.8: > https://github.com/avagin/linux-task-diag/tree/v5.8-task-diag Thanks for updating the patches. > > /proc/pid files have three major limitations: > * Requires at least three syscalls per process per file > open(), read(), close() > * Variety of formats, mostly text based > The kernel spent time to encode binary data into a text format and > then tools like top and ps spent time to decode them back to a binary > format. > * Sometimes slow due to extra attributes > For example, /proc/PID/smaps contains a lot of useful informations > about memory mappings and memory consumption for each of them. But > even if we don't need memory consumption fields, the kernel will > spend time to collect this information. that's what I recall as well. > > More details and numbers are in this article: > https://avagin.github.io/how-fast-is-procfs > > This new interface doesn't have only one of these limitations, but > task_diag doesn't have all of them. > > And I compared how fast each of these interfaces: > > The test environment: > CPU: Intel(R) Core(TM) i5-6300U CPU @ 2.40GHz > RAM: 16GB > kernel: v5.8 with task_diag and /proc/all patches. > 100K processes: > $ ps ax | wc -l > 10228 100k processes but showing 10k here?? > > $ time cat /proc/all/status > /dev/null > > real 0m0.577s > user 0m0.017s > sys 0m0.559s > > task_proc_all is used to read /proc/pid/status for all tasks: > https://github.com/avagin/linux-task-diag/blob/master/tools/testing/selftests/task_diag/task_proc_all.c > > $ time ./task_proc_all status > tasks: 100230 > > real 0m0.924s > user 0m0.054s > sys 0m0.858s > > > /proc/all/status is about 40% faster than /proc/*/status. > > Now let's take a look at the perf output: > > $ time perf record -g cat /proc/all/status > /dev/null > $ perf report > - 98.08% 1.38% cat [kernel.vmlinux] [k] entry_SYSCALL_64 > - 96.70% entry_SYSCALL_64 > - do_syscall_64 > - 94.97% ksys_read > - 94.80% vfs_read > - 94.58% proc_reg_read > - seq_read > - 87.95% proc_pid_status > + 13.10% seq_put_decimal_ull_width > - 11.69% task_mem > + 9.48% seq_put_decimal_ull_width > + 10.63% seq_printf > - 10.35% cpuset_task_status_allowed > + seq_printf > - 9.84% render_sigset_t > 1.61% seq_putc > + 1.61% seq_puts > + 4.99% proc_task_name > + 4.11% seq_puts > - 3.76% render_cap_t > 2.38% seq_put_hex_ll > + 1.25% seq_puts > 2.64% __task_pid_nr_ns > + 1.54% get_task_mm > + 1.34% __lock_task_sighand > + 0.70% from_kuid_munged > 0.61% get_task_cred > 0.56% seq_putc > 0.52% hugetlb_report_usage > 0.52% from_kgid_munged > + 4.30% proc_all_next > + 0.82% _copy_to_user > > We can see that the kernel spent more than 50% of the time to encode binary > data into a text format. > > Now let's see how fast task_diag: > > $ time ./task_diag_all all -c -q > > real 0m0.087s > user 0m0.001s > sys 0m0.082s > > Maybe we need resurrect the task_diag series instead of inventing > another less-effective interface... I think the netlink message design is the better way to go. As system sizes continue to increase (> 100 cpus is common now) you need to be able to pass the right data to userspace as fast as possible to keep up with what can be a very dynamic userspace and set of processes. When you first proposed this idea I was working on systems with >= 1k cpus and the netlink option was able to keep up with a 'make -j N' on those systems. `perf record` walking /proc would never finish initializing - I had to add a "done initializing" message to know when to start a test. With the task_diag approach, perf could collect the data in short order and move on to recording data.