Re: [PATCH bpf-next v6 1/4] bpf: Parameterize task iterators.

Andrii Nakryiko <andrii.nakryiko@xxxxxxxxx> · Thu, 25 Aug 2022 14:13:47 -0700

On Wed, Aug 24, 2022 at 5:16 PM Kui-Feng Lee <kuifeng@xxxxxx> wrote:
>
> On Wed, 2022-08-24 at 15:20 -0700, Andrii Nakryiko wrote:
> > On Fri, Aug 19, 2022 at 3:09 PM Kui-Feng Lee <kuifeng@xxxxxx> wrote:
> > >
> > > Allow creating an iterator that loops through resources of one
> > > task/thread.
> > >
> > > People could only create iterators to loop through all resources of
> > > files, vma, and tasks in the system, even though they were
> > > interested
> > > in only the resources of a specific task or process.  Passing the
> > > additional parameters, people can now create an iterator to go
> > > through all resources or only the resources of a task.
> > >
> > > Signed-off-by: Kui-Feng Lee <kuifeng@xxxxxx>
> > > ---
> > >  include/linux/bpf.h            |  25 +++++++
> > >  include/uapi/linux/bpf.h       |   6 ++
> > >  kernel/bpf/task_iter.c         | 116 ++++++++++++++++++++++++++---
> > > ----
> > >  tools/include/uapi/linux/bpf.h |   6 ++
> > >  4 files changed, 129 insertions(+), 24 deletions(-)
> > >
> > > diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> > > index 39bd36359c1e..59712dd917d8 100644
> > > --- a/include/linux/bpf.h
> > > +++ b/include/linux/bpf.h
> > > @@ -1729,8 +1729,33 @@ int bpf_obj_get_user(const char __user
> > > *pathname, int flags);
> > >         extern int bpf_iter_ ## target(args);                   \
> > >         int __init bpf_iter_ ## target(args) { return 0; }
> > >
> > > +/*
> > > + * The task type of iterators.
> > > + *
> > > + * For BPF task iterators, they can be parameterized with various
> > > + * parameters to visit only some of tasks.
> > > + *
> > > + * BPF_TASK_ITER_ALL (default)
> > > + *     Iterate over resources of every task.
> > > + *
> > > + * BPF_TASK_ITER_TID
> > > + *     Iterate over resources of a task/tid.
> > > + *
> > > + * BPF_TASK_ITER_TGID
> > > + *     Iterate over reosurces of evevry task of a process / task
> > > group.
> >
> > typos: resources, every
> >
> > > + */
> > > +enum bpf_iter_task_type {
> > > +       BPF_TASK_ITER_ALL = 0,
> > > +       BPF_TASK_ITER_TID,
> > > +       BPF_TASK_ITER_TGID,
> > > +};
> > > +
> >
> > [...]
> >
> > >         rcu_read_lock();
> > >  retry:
> > > -       pid = find_ge_pid(*tid, ns);
> > > +       pid = find_ge_pid(*tid, common->ns);
> > >         if (pid) {
> > > -               *tid = pid_nr_ns(pid, ns);
> > > +               *tid = pid_nr_ns(pid, common->ns);
> > >                 task = get_pid_task(pid, PIDTYPE_PID);
> > >                 if (!task) {
> > >                         ++*tid;
> > >                         goto retry;
> > > -               } else if (skip_if_dup_files &&
> > > !thread_group_leader(task) &&
> > > -                          task->files == task->group_leader-
> > > >files) {
> > > +               } else if ((skip_if_dup_files &&
> > > !thread_group_leader(task) &&
> > > +                           task->files == task->group_leader-
> > > >files) ||
> > > +                          (common->type == BPF_TASK_ITER_TGID &&
> > > +                           __task_pid_nr_ns(task, PIDTYPE_TGID,
> > > common->ns) != common->pid)) {
> >
> > it gets super hard to follow this logic, would a simple helper
> > function to calculate this condition (and maybe some comments to
> > explain the logic behind these checks?) make it a bit more readable?
>
> !matched_task(task, common, skip_if_dup_file)?
>
> bool matched_task(struct task_struct *task,
>                   struct bpf_iter_seq_task_common *common,
>                   bool skip_if_dup_file) {
>         /* Should not have the same 'files' if skip_if_dup_file is true
> */
>         bool diff_files_if =
>                 !skip_if_dup_file ||
>                 (thread_group_leader(task) &&
>                 task->file != task->gorup_leader->fies);
>         /* Should have the given tgid if the type is BPF_TASK_ITER_TGI
> */
>         bool have_tgid_if =
>                 common->type != BPF_TASK_ITER_TGID ||
>                 __task_pid_nr_ns(task, PIDTYPE_TGID,
>                 common->ns) == common->pid;
>         return diff_files_if && have_tgid_if;
> }
>
> How about this?
>

Hm... "matched_task" doesn't mean much, tbh, so not really. I wanted
to suggest having a separate helper just for your TGID check and call
it something more meaningful like "task_belongs_to_tgid". Can't come
up with a good name for existing dup_file check, so I'd probably keep
it as is. But also seems like there is same_thread_group() helper in
include/linux/sched/signal.h, so let's look if we can use it, it seems
like it's just comparing signal pointers (probably quite faster than
what you are doing right now).

But looking at this some more made me realize that even if we specify
pid (tgid in kernel terms) we are still going to iterate through all
the tasks, essentially. Is that right? So TGID mode isn't great for
speeding up, we should point out to users that if they want to iterate
files of the process, they probably want to use TID mode and set tid
to pid to use the early termination condition in TID.

It wasn't obvious to me until I re-read this patch like 3 times and
wrote three different replies here :)

But then I also went looking at what procfs doing for
/proc/<pid/task/* dirs. It does seem like there are faster ways to
iterate all threads of a process. See next_tid() which uses
next_thread(), etc. Can you please check those and see if we can have
faster in-process iteration?

> >
> > >                         put_task_struct(task);
> > >                         task = NULL;
> > >                         ++*tid;
> > > @@ -56,7 +73,7 @@ static void *task_seq_start(struct seq_file *seq,
> > > loff_t *pos)
> > >         struct bpf_iter_seq_task_info *info = seq->private;
> > >         struct task_struct *task;
> > >
> > > -       task = task_seq_get_next(info->common.ns, &info->tid,
> > > false);
> > > +       task = task_seq_get_next(&info->common, &info->tid, false);
> > >         if (!task)
> > >                 return NULL;
> > >

[...]