On Tue, 2022-08-02 at 14:17 -0700, Andrii Nakryiko wrote: > On Tue, Aug 2, 2022 at 9:42 AM Kui-Feng Lee <kuifeng@xxxxxx> wrote: > > > > On Mon, 2022-08-01 at 20:30 -0700, Andrii Nakryiko wrote: > > > On Mon, Aug 1, 2022 at 4:27 PM Kui-Feng Lee <kuifeng@xxxxxx> > > > wrote: > > > > > > > > Allow creating an iterator that loops through resources of one > > > > task/thread. > > > > > > > > People could only create iterators to loop through all > > > > resources of > > > > files, vma, and tasks in the system, even though they were > > > > interested > > > > in only the resources of a specific task or process. Passing > > > > the > > > > additional parameters, people can now create an iterator to go > > > > through all resources or only the resources of a task. > > > > > > > > Signed-off-by: Kui-Feng Lee <kuifeng@xxxxxx> > > > > --- > > > > include/linux/bpf.h | 4 ++ > > > > include/uapi/linux/bpf.h | 23 +++++++++ > > > > kernel/bpf/task_iter.c | 93 > > > > ++++++++++++++++++++++++++---- > > > > ---- > > > > tools/include/uapi/linux/bpf.h | 23 +++++++++ > > > > 4 files changed, 121 insertions(+), 22 deletions(-) > > > > > > > > diff --git a/include/linux/bpf.h b/include/linux/bpf.h > > > > index 11950029284f..3c26dbfc9cef 100644 > > > > --- a/include/linux/bpf.h > > > > +++ b/include/linux/bpf.h > > > > @@ -1718,6 +1718,10 @@ int bpf_obj_get_user(const char __user > > > > *pathname, int flags); > > > > > > > > struct bpf_iter_aux_info { > > > > struct bpf_map *map; > > > > + struct { > > > > + u32 tid; > > > > + u8 type; > > > > + } task; > > > > }; > > > > > > > > typedef int (*bpf_iter_attach_target_t)(struct bpf_prog *prog, > > > > diff --git a/include/uapi/linux/bpf.h > > > > b/include/uapi/linux/bpf.h > > > > index ffcbf79a556b..ed5ba501609f 100644 > > > > --- a/include/uapi/linux/bpf.h > > > > +++ b/include/uapi/linux/bpf.h > > > > @@ -87,10 +87,33 @@ struct bpf_cgroup_storage_key { > > > > __u32 attach_type; /* program attach type > > > > (enum bpf_attach_type) */ > > > > }; > > > > > > > > +enum bpf_task_iter_type { > > > > + BPF_TASK_ITER_ALL = 0, > > > > + BPF_TASK_ITER_TID, > > > > +}; > > > > + > > > > union bpf_iter_link_info { > > > > struct { > > > > __u32 map_fd; > > > > } map; > > > > + /* > > > > + * Parameters of task iterators. > > > > + */ > > > > + struct { > > > > + __u32 pid_fd; > > > > > > I was a bit late to the discussion about pidfd vs plain pid. I > > > think > > > we should support both in this API. While pid_fd has some nice > > > guarantees like avoiding the risk of accidental PID reuse, in a > > > lot > > > (if not all) cases where task/task_vma/task_file iterators are > > > going > > > to be used this is never a risk, because pid will usually come > > > from > > > some tracing BPF program (kprobe/tp/fentry/etc), like in case of > > > profiling, and then will be used by user-space almost immediately > > > to > > > query some additional information (fetching relevant vma > > > information > > > for profiling use case). So main benefit of pidfd is not that > > > relevant > > > for BPF tracing use cases, because PIDs are not going to be > > > reused so > > > fast within such a short time frame. > > > > > > But pidfd does have downsides. It requires 2 syscalls (pidfd_open > > > and > > > close) for each PID, it creates struct file for each such active > > > pidfd. So it will have non-trivial overhead for high-frequency > > > BPF > > > iterator use cases (imagine querying some simple stats for a big > > > set > > > of tasks, frequently: you'll spend more time in pidfd syscalls > > > and > > > more resources just keeping corresponding struct file open than > > > actually doing useful BPF work). For simple BPF iter cases it > > > will > > > unnecessarily complicate program flow while giving no benefit > > > instead. > > > > It is a good point to have more syscalls. > > > > > > > > So I propose we support both in UAPI. Internally either way we > > > resolve > > > to plain pid/tid, so this won't cause added maintenance burden. > > > But > > > simple cases will keep simple, while more long-lived and/or > > > complicated ones will still be supported. We then can have > > > BPF_TASK_ITER_PIDFD vs BPF_TASK_ITER_TID to differentiate whether > > > the > > > above __u32 pid_fd (which we should probably rename to something > > > more > > > generic like "target") is pid FD or TID/PID. See also below about > > > TID > > > vs PID. > > > > > > > + /* > > > > + * The type of the iterator. > > > > + * > > > > + * It can be one of enum bpf_task_iter_type. > > > > + * > > > > + * BPF_TASK_ITER_ALL (default) > > > > + * The iterator iterates over resources of > > > > everyprocess. > > > > + * > > > > + * BPF_TASK_ITER_TID > > > > + * You should also set *pid_fd* to iterate > > > > over one task. > > > > > > naming nit: we should decide whether we use TID (thread) and PID > > > (process) terminology (more usual for user-space) or PID (process > > > == > > > task == user-space thread) and TGID (thread group, i.e. user- > > > space > > > process). I haven't investigated much what's we use most > > > consistently, > > > but curious to hear what others think. > > > > > > Also I can see use-cases where we want to iterate just specified > > > task > > > (i.e., just specified thread) vs all the tasks that belong to the > > > same > > > process group (i.e., thread within process). Naming TBD, but we > > > should > > > have BPF_TASK_ITER_TID and BPF_TASK_ITER_TGID (or some other > > > naming). > > > > > > I discussed with Yonghong about iterators over resources of all > > tasks > > of a process. User code should create iterators for each thread of > > the > > process if necessary. We may add the support of tgid if it is > > higly > > demanded. > > > > In a discussion of using pidfd, people mentioned to extend pidfd to > > threads if there is a good use-case. It also applies to our case. > > Most of the time, if not always, vma & files are shared by all > > threads > > of a process. So, an iteration over all resources of every threads > > of > > a process doesn't get obvious benefit. It is also true for an > > iterator > > over the resources of a specific thread instead of a process. > > > > Ok, so two different points here. > > First, TID (thread) vs TGID (process) modes. I'd define TGID mode as: > a) user specifies some TID and we resolve that to thread group leader > TID (that is we resolve thread to process), and then iterate all > threads within that process. For TID (thread) mode, we accept > specified TID as exactly the thread we iterate (even if it's thread > group leader, we iterate only that specific thread, not all threads > in > a process). > > Second, about the point that all threads within a process share vma, > file table, etc. That's true. But you are forgetting about iter/task > that is iterating just tasks. TGID mode for such use case is very > useful. For task_vma/task_file we can probably do the same logic we > have today where if the thread has the same file table or mm_struct > as > thread group leader, we skip such thread when iterating vmas and > files. Yes, you are right. Iterators of all tasks in a procss is useful. Just like our discussion offline, it is worth to supports pidfd, tid and tgid. For pidfd, it would works just like tgid. We just do a translation at the kernel from pidfd to tgid. > > Thoughts? > > > > > > > > One might ask why do we need single-task mode if we can always > > > stop > > > iteration from BPF program, but this is trivial only for > > > iter/task, > > > while for iter/task_vma and iter/task_file it becomes > > > inconvenient to > > > detect switch from one task to another. It costs us essentially > > > nothing to support this mode, so I advocate to do that. > > > > > > I have similar thoughts about cgroup iteration modes and actually > > > supporting cgroup_fd as target for task iterators (which will > > > mean > > > iterating tasks belonging to provided cgroup(s)), but I'll reply > > > on > > > cgroup iterator patch first, and we can just reuse the same > > > cgroup > > > target specification between iter/cgroup and iter/task > > > afterwards. > > > > > > > > > > + */ > > > > + __u8 type; /* BPF_TASK_ITER_* */ > > > > + } task; > > > > }; > > > > > > > > > > [...] > >