Re: [RFC PATCH 0/5] Introduce /proc/all/ to gather stats from all processes

Andrei Vagin <avagin@xxxxxxxxx> · Wed, 12 Aug 2020 00:51:35 -0700

On Tue, Aug 11, 2020 at 12:58:47AM +1000, Eugene Lubarsky wrote:
> This is an idea for substantially reducing the number of syscalls needed
> by monitoring tools whilst mostly re-using the existing API.
> 
> The proposed files in this proof-of-concept patch set are:
> 
> * /proc/all/stat
>       A stat line for each process in the existing format.
> 
> * /proc/all/statm
>       statm lines but starting with a PID column.
> 
> * /proc/all/status
>       status info for all processes in the existing format.
> 
> * /proc/all/io
>       The existing /proc/pid/io data but formatted as a single line for
>       each process, similarly to stat/statm, with a PID column added.
> 
> * /proc/all/statx
>       Gathers info from stat, statm and io; the purpose is actually
>       not so much to reduce syscalls but to help userspace be more
>       efficient by not having to store data in e.g. hashtables in order
>       to gather it from separate /proc/all/ files.
> 
>       The format proposed here starts with the unchanged stat line
>       and begins the other info with a few characters, repeating for
>       each process:
> 
>       ...
>       25 (cat) R 1 1 0 0 -1 4194304 185 0 16 0 2 0 0 0 20 ...
>       m 662 188 167 5 0 112 0
>       io 4292 0 12 0 0 0 0
>       ...
> 
> 
> There has been a proposal with some overlapping goals: /proc/task-diag
> (https://github.com/avagin/linux-task-diag), but I'm not sure about
> its current status.

I rebased the task_diag patches on top of v5.8:
https://github.com/avagin/linux-task-diag/tree/v5.8-task-diag

/proc/pid files have three major limitations:
* Requires at least three syscalls per process per file
  open(), read(), close()
* Variety of formats, mostly text based
  The kernel spent time to encode binary data into a text format and
  then tools like top and ps spent time to decode them back to a binary
  format.
* Sometimes slow due to extra attributes
  For example, /proc/PID/smaps contains a lot of useful informations
  about memory mappings and memory consumption for each of them. But
  even if we don't need memory consumption fields, the kernel will
  spend time to collect this information.

More details and numbers are in this article:
https://avagin.github.io/how-fast-is-procfs

This new interface doesn't have only one of these limitations, but
task_diag doesn't have all of them.

And I compared how fast each of these interfaces:

The test environment:
CPU: Intel(R) Core(TM) i5-6300U CPU @ 2.40GHz
RAM: 16GB
kernel: v5.8 with task_diag and /proc/all patches.
100K processes:
$ ps ax | wc -l
10228

$ time cat /proc/all/status > /dev/null

real	0m0.577s
user	0m0.017s
sys	0m0.559s

task_proc_all is used to read /proc/pid/status for all tasks:
https://github.com/avagin/linux-task-diag/blob/master/tools/testing/selftests/task_diag/task_proc_all.c

$ time ./task_proc_all status
tasks: 100230

real	0m0.924s
user	0m0.054s
sys	0m0.858s

/proc/all/status is about 40% faster than /proc/*/status.

Now let's take a look at the perf output:

$ time perf record -g cat /proc/all/status > /dev/null
$ perf report
-   98.08%     1.38%  cat      [kernel.vmlinux]  [k] entry_SYSCALL_64
   - 96.70% entry_SYSCALL_64
      - do_syscall_64
         - 94.97% ksys_read
            - 94.80% vfs_read
               - 94.58% proc_reg_read
                  - seq_read
                     - 87.95% proc_pid_status
                        + 13.10% seq_put_decimal_ull_width
                        - 11.69% task_mem
                           + 9.48% seq_put_decimal_ull_width
                        + 10.63% seq_printf
                        - 10.35% cpuset_task_status_allowed
                           + seq_printf
                        - 9.84% render_sigset_t
                             1.61% seq_putc
                           + 1.61% seq_puts
                        + 4.99% proc_task_name
                        + 4.11% seq_puts
                        - 3.76% render_cap_t
                             2.38% seq_put_hex_ll
                           + 1.25% seq_puts
                          2.64% __task_pid_nr_ns
                        + 1.54% get_task_mm
                        + 1.34% __lock_task_sighand
                        + 0.70% from_kuid_munged
                          0.61% get_task_cred
                          0.56% seq_putc
                          0.52% hugetlb_report_usage
                          0.52% from_kgid_munged
                     + 4.30% proc_all_next
                     + 0.82% _copy_to_user 

We can see that the kernel spent more than 50% of the time to encode binary
data into a text format.

Now let's see how fast task_diag:

$ time ./task_diag_all all -c -q

real	0m0.087s
user	0m0.001s
sys	0m0.082s

Maybe we need resurrect the task_diag series instead of inventing
another less-effective interface...

Thanks,
Andrei

> 
> 
> 
> Best Wishes,
> 
> Eugene
> 
> 
> Eugene Lubarsky (5):
>   fs/proc: Introduce /proc/all/stat
>   fs/proc: Introduce /proc/all/statm
>   fs/proc: Introduce /proc/all/status
>   fs/proc: Introduce /proc/all/io
>   fs/proc: Introduce /proc/all/statx
> 
>  fs/proc/base.c     | 215 +++++++++++++++++++++++++++++++++++++++++++--
>  fs/proc/internal.h |   1 +
>  fs/proc/root.c     |   1 +
>  3 files changed, 210 insertions(+), 7 deletions(-)
> 
> -- 
> 2.25.1
>