Re: [PATCH bpf-next 2/5] bpf: add main_thread_only customization for task/task_file iterators

Andrii Nakryiko <andrii.nakryiko@xxxxxxxxx> · Tue, 1 Sep 2020 17:34:21 -0700

On Thu, Aug 27, 2020 at 11:09 AM Yonghong Song <yhs@xxxxxx> wrote:
>
>
>
> On 8/26/20 10:07 PM, Andrii Nakryiko wrote:
> > On Wed, Aug 26, 2020 at 5:07 PM Yonghong Song <yhs@xxxxxx> wrote:
> >>
> >> Currently, task and task_file by default iterates through
> >> all tasks. For task_file, by default, all files from all tasks
> >> will be traversed.
> >>
> >> But for a user process, the file_table is shared by all threads
> >> of that process. So traversing the main thread per process should
> >> be enough to traverse all files and this can save a lot of cpu
> >> time if some process has large number of threads and each thread
> >> has lots of open files.
> >>
> >> This patch implemented a customization for task/task_file iterator,
> >> permitting to traverse only the kernel task where its pid equal
> >> to tgid in the kernel. This includes some kernel threads, and
> >> main threads of user processes. This will solve the above potential
> >> performance issue for task_file. This customization may be useful
> >> for task iterator too if only traversing main threads is enough.
> >>
> >> Signed-off-by: Yonghong Song <yhs@xxxxxx>
> >> ---
> >>   include/linux/bpf.h            |  3 ++-
> >>   include/uapi/linux/bpf.h       |  5 ++++
> >>   kernel/bpf/task_iter.c         | 46 +++++++++++++++++++++++-----------
> >>   tools/include/uapi/linux/bpf.h |  5 ++++
> >>   4 files changed, 43 insertions(+), 16 deletions(-)
> >>
> >> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> >> index a6131d95e31e..058eb9b0ba78 100644
> >> --- a/include/linux/bpf.h
> >> +++ b/include/linux/bpf.h
> >> @@ -1220,7 +1220,8 @@ int bpf_obj_get_user(const char __user *pathname, int flags);
> >>          int __init bpf_iter_ ## target(args) { return 0; }
> >>
> >>   struct bpf_iter_aux_info {
> >> -       struct bpf_map *map;
> >> +       struct bpf_map *map;    /* for iterator traversing map elements */
> >> +       bool main_thread_only;  /* for task/task_file iterator */
> >
> > As a user of task_file iterator I'd hate to make this decision,
> > honestly, especially if I can't prove that all processes share the
> > same file table (I think clone() syscall allows to do weird
> > combinations like that, right?). It does make sense for a task/
>
> Right. the clone() syscall permits different kinds of sharing,
> sharing address space and sharing files are separated. It is possible
> that address space is shared and files are not shared. That is
> why I want to add flags for task_file so that if user knows
> they do not have cases where address space shared and files not
> shared, they can use main_thread_only to improve performance.

The problem with such options is that for a lot of users it won't be
clear at all when and if those options can be used. E.g., when I
imagine myself building some generic tool utilizing task_file bpf_iter
that is supposed to be run on any server, how do I know if it's safe
to specify "main_thread_only"? I can't guarantee that. So I'll be
forced to either risk it or fallback to default and unnecessarily slow
behavior.

This is different for task/ bpf_iter, though, so I support that for
task/ only. But you've already done that split, so thank you! :)

>
> > iterator, though, if I need to iterate a user-space process (group of
> > tasks). So can we do this:
> >
> > 1a. Either add a new bpf_iter type process/ (or in kernel lingo
> > task_group/) to iterate only main threads (group_leader in kernel
> > lingo);
> > 1b. Or make this main_thread_only an option for only a task/ iterator
> > (and maybe call it "group_leader_only" or something to reflect kernel
> > terminology?)
>
> The following is the kernel pid_type definition,
>
> enum pid_type
> {
>          PIDTYPE_PID,
>          PIDTYPE_TGID,
>          PIDTYPE_PGID,
>          PIDTYPE_SID,
>          PIDTYPE_MAX,
> };
>
> Right now, task iterator is traversed following
> PIDTYPE_PID, i.e., every tasks in the system.
>
> To iterate through main thread, we need to traverse following
> PIDTYPE_TGID.
>
> In the future, it is possible, people want to traverse
> following PIDTYPE_PGID (process group) or PIDTYPE_SID (session).
>
> Or we might have other customization, e.g., cgroup_id, which can
> be filtered in the bpf program, but in-kernel filtering can
> definitely improve performance.
>
> So I prefer 1b.

Sounds good, but let's use a proper enum, not a set of bit fields. We
can support all 4 from the get go. There is no need to wait for the
actual use case to appear, as the iteration semantics is pretty clear.

>
> Yes, naming is hard.
> The name "main_thread_only" is mostly from userspace perspective.
> "group_leader_only" might not be good as it may be confused with
> possible future process group.
>
> >
> > 2. For task/file iterator, still iterate all tasks, but if the task's
> > files struct is the same as group_leader's files struct, then go to
> > the next one. This should eliminate tons of duplication of iterating
> > the same files over and over. It would still iterate a bunch of tasks,
> > but compared to the number of files that's generally a much smaller
> > number, so should still give sizable savings. I don't think we need an
> > extra option for this, tbh, this behavior was my intuitive
> > expectation, so I don't think you'll be breaking any sane user of this
> > iterator.
>
> What you suggested makes sense. for task_file iterator, we only promise
> to visit all files from all tasks. We can do necessary filtering to
> remove duplicates in the kernel and did not break promise.
> I will remove customization from task_file iterator.

Thanks for doing that!

>
> >
> > Disclaimer: I haven't got a chance to look through kernel code much,
> > so I'm sorry if what I'm proposing is something that is impossible to
> > implement or extremely hard/unreliable. But thought I'll put this idea
> > out there before we decide on this.
> >
> > WDYT?
> >
> > [...]
> >