Re: [RFC PATCH 0/5] Introduce /proc/all/ to gather stats from all processes

David Ahern <dsahern@xxxxxxxxx> · Wed, 12 Aug 2020 22:47:32 -0600

On 8/12/20 1:51 AM, Andrei Vagin wrote:
> 
> I rebased the task_diag patches on top of v5.8:
> https://github.com/avagin/linux-task-diag/tree/v5.8-task-diag

Thanks for updating the patches.

> 
> /proc/pid files have three major limitations:
> * Requires at least three syscalls per process per file
>   open(), read(), close()
> * Variety of formats, mostly text based
>   The kernel spent time to encode binary data into a text format and
>   then tools like top and ps spent time to decode them back to a binary
>   format.
> * Sometimes slow due to extra attributes
>   For example, /proc/PID/smaps contains a lot of useful informations
>   about memory mappings and memory consumption for each of them. But
>   even if we don't need memory consumption fields, the kernel will
>   spend time to collect this information.

that's what I recall as well.

> 
> More details and numbers are in this article:
> https://avagin.github.io/how-fast-is-procfs
> 
> This new interface doesn't have only one of these limitations, but
> task_diag doesn't have all of them.
> 
> And I compared how fast each of these interfaces:
> 
> The test environment:
> CPU: Intel(R) Core(TM) i5-6300U CPU @ 2.40GHz
> RAM: 16GB
> kernel: v5.8 with task_diag and /proc/all patches.
> 100K processes:
> $ ps ax | wc -l
> 10228

100k processes but showing 10k here??

> 
> $ time cat /proc/all/status > /dev/null
> 
> real	0m0.577s
> user	0m0.017s
> sys	0m0.559s
> 
> task_proc_all is used to read /proc/pid/status for all tasks:
> https://github.com/avagin/linux-task-diag/blob/master/tools/testing/selftests/task_diag/task_proc_all.c
> 
> $ time ./task_proc_all status
> tasks: 100230
> 
> real	0m0.924s
> user	0m0.054s
> sys	0m0.858s
> 
> 
> /proc/all/status is about 40% faster than /proc/*/status.
> 
> Now let's take a look at the perf output:
> 
> $ time perf record -g cat /proc/all/status > /dev/null
> $ perf report
> -   98.08%     1.38%  cat      [kernel.vmlinux]  [k] entry_SYSCALL_64
>    - 96.70% entry_SYSCALL_64
>       - do_syscall_64
>          - 94.97% ksys_read
>             - 94.80% vfs_read
>                - 94.58% proc_reg_read
>                   - seq_read
>                      - 87.95% proc_pid_status
>                         + 13.10% seq_put_decimal_ull_width
>                         - 11.69% task_mem
>                            + 9.48% seq_put_decimal_ull_width
>                         + 10.63% seq_printf
>                         - 10.35% cpuset_task_status_allowed
>                            + seq_printf
>                         - 9.84% render_sigset_t
>                              1.61% seq_putc
>                            + 1.61% seq_puts
>                         + 4.99% proc_task_name
>                         + 4.11% seq_puts
>                         - 3.76% render_cap_t
>                              2.38% seq_put_hex_ll
>                            + 1.25% seq_puts
>                           2.64% __task_pid_nr_ns
>                         + 1.54% get_task_mm
>                         + 1.34% __lock_task_sighand
>                         + 0.70% from_kuid_munged
>                           0.61% get_task_cred
>                           0.56% seq_putc
>                           0.52% hugetlb_report_usage
>                           0.52% from_kgid_munged
>                      + 4.30% proc_all_next
>                      + 0.82% _copy_to_user 
> 
> We can see that the kernel spent more than 50% of the time to encode binary
> data into a text format.
> 
> Now let's see how fast task_diag:
> 
> $ time ./task_diag_all all -c -q
> 
> real	0m0.087s
> user	0m0.001s
> sys	0m0.082s
> 
> Maybe we need resurrect the task_diag series instead of inventing
> another less-effective interface...

I think the netlink message design is the better way to go. As system
sizes continue to increase (> 100 cpus is common now) you need to be
able to pass the right data to userspace as fast as possible to keep up
with what can be a very dynamic userspace and set of processes.

When you first proposed this idea I was working on systems with >= 1k
cpus and the netlink option was able to keep up with a 'make -j N' on
those systems. `perf record` walking /proc would never finish
initializing - I had to add a "done initializing" message to know when
to start a test. With the task_diag approach, perf could collect the
data in short order and move on to recording data.