On Mon, 2022-08-01 at 20:30 -0700, Andrii Nakryiko wrote: > On Mon, Aug 1, 2022 at 4:27 PM Kui-Feng Lee <kuifeng@xxxxxx> wrote: > > > > Allow creating an iterator that loops through resources of one > > task/thread. > > > > People could only create iterators to loop through all resources of > > files, vma, and tasks in the system, even though they were > > interested > > in only the resources of a specific task or process. Passing the > > additional parameters, people can now create an iterator to go > > through all resources or only the resources of a task. > > > > Signed-off-by: Kui-Feng Lee <kuifeng@xxxxxx> > > --- > > include/linux/bpf.h | 4 ++ > > include/uapi/linux/bpf.h | 23 +++++++++ > > kernel/bpf/task_iter.c | 93 ++++++++++++++++++++++++++---- > > ---- > > tools/include/uapi/linux/bpf.h | 23 +++++++++ > > 4 files changed, 121 insertions(+), 22 deletions(-) > > > > diff --git a/include/linux/bpf.h b/include/linux/bpf.h > > index 11950029284f..3c26dbfc9cef 100644 > > --- a/include/linux/bpf.h > > +++ b/include/linux/bpf.h > > @@ -1718,6 +1718,10 @@ int bpf_obj_get_user(const char __user > > *pathname, int flags); > > > > struct bpf_iter_aux_info { > > struct bpf_map *map; > > + struct { > > + u32 tid; > > + u8 type; > > + } task; > > }; > > > > typedef int (*bpf_iter_attach_target_t)(struct bpf_prog *prog, > > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h > > index ffcbf79a556b..ed5ba501609f 100644 > > --- a/include/uapi/linux/bpf.h > > +++ b/include/uapi/linux/bpf.h > > @@ -87,10 +87,33 @@ struct bpf_cgroup_storage_key { > > __u32 attach_type; /* program attach type > > (enum bpf_attach_type) */ > > }; > > > > +enum bpf_task_iter_type { > > + BPF_TASK_ITER_ALL = 0, > > + BPF_TASK_ITER_TID, > > +}; > > + > > union bpf_iter_link_info { > > struct { > > __u32 map_fd; > > } map; > > + /* > > + * Parameters of task iterators. > > + */ > > + struct { > > + __u32 pid_fd; > > I was a bit late to the discussion about pidfd vs plain pid. I think > we should support both in this API. While pid_fd has some nice > guarantees like avoiding the risk of accidental PID reuse, in a lot > (if not all) cases where task/task_vma/task_file iterators are going > to be used this is never a risk, because pid will usually come from > some tracing BPF program (kprobe/tp/fentry/etc), like in case of > profiling, and then will be used by user-space almost immediately to > query some additional information (fetching relevant vma information > for profiling use case). So main benefit of pidfd is not that > relevant > for BPF tracing use cases, because PIDs are not going to be reused so > fast within such a short time frame. > > But pidfd does have downsides. It requires 2 syscalls (pidfd_open and > close) for each PID, it creates struct file for each such active > pidfd. So it will have non-trivial overhead for high-frequency BPF > iterator use cases (imagine querying some simple stats for a big set > of tasks, frequently: you'll spend more time in pidfd syscalls and > more resources just keeping corresponding struct file open than > actually doing useful BPF work). For simple BPF iter cases it will > unnecessarily complicate program flow while giving no benefit > instead. It is a good point to have more syscalls. > > So I propose we support both in UAPI. Internally either way we > resolve > to plain pid/tid, so this won't cause added maintenance burden. But > simple cases will keep simple, while more long-lived and/or > complicated ones will still be supported. We then can have > BPF_TASK_ITER_PIDFD vs BPF_TASK_ITER_TID to differentiate whether the > above __u32 pid_fd (which we should probably rename to something more > generic like "target") is pid FD or TID/PID. See also below about TID > vs PID. > > > + /* > > + * The type of the iterator. > > + * > > + * It can be one of enum bpf_task_iter_type. > > + * > > + * BPF_TASK_ITER_ALL (default) > > + * The iterator iterates over resources of > > everyprocess. > > + * > > + * BPF_TASK_ITER_TID > > + * You should also set *pid_fd* to iterate > > over one task. > > naming nit: we should decide whether we use TID (thread) and PID > (process) terminology (more usual for user-space) or PID (process == > task == user-space thread) and TGID (thread group, i.e. user-space > process). I haven't investigated much what's we use most > consistently, > but curious to hear what others think. > > Also I can see use-cases where we want to iterate just specified task > (i.e., just specified thread) vs all the tasks that belong to the > same > process group (i.e., thread within process). Naming TBD, but we > should > have BPF_TASK_ITER_TID and BPF_TASK_ITER_TGID (or some other naming). I discussed with Yonghong about iterators over resources of all tasks of a process. User code should create iterators for each thread of the process if necessary. We may add the support of tgid if it is higly demanded. In a discussion of using pidfd, people mentioned to extend pidfd to threads if there is a good use-case. It also applies to our case. Most of the time, if not always, vma & files are shared by all threads of a process. So, an iteration over all resources of every threads of a process doesn't get obvious benefit. It is also true for an iterator over the resources of a specific thread instead of a process. > > One might ask why do we need single-task mode if we can always stop > iteration from BPF program, but this is trivial only for iter/task, > while for iter/task_vma and iter/task_file it becomes inconvenient to > detect switch from one task to another. It costs us essentially > nothing to support this mode, so I advocate to do that. > > I have similar thoughts about cgroup iteration modes and actually > supporting cgroup_fd as target for task iterators (which will mean > iterating tasks belonging to provided cgroup(s)), but I'll reply on > cgroup iterator patch first, and we can just reuse the same cgroup > target specification between iter/cgroup and iter/task afterwards. > > > > + */ > > + __u8 type; /* BPF_TASK_ITER_* */ > > + } task; > > }; > > > > [...]