On Thu, Aug 27, 2020 at 11:09 AM Yonghong Song <yhs@xxxxxx> wrote: > > > > On 8/26/20 10:07 PM, Andrii Nakryiko wrote: > > On Wed, Aug 26, 2020 at 5:07 PM Yonghong Song <yhs@xxxxxx> wrote: > >> > >> Currently, task and task_file by default iterates through > >> all tasks. For task_file, by default, all files from all tasks > >> will be traversed. > >> > >> But for a user process, the file_table is shared by all threads > >> of that process. So traversing the main thread per process should > >> be enough to traverse all files and this can save a lot of cpu > >> time if some process has large number of threads and each thread > >> has lots of open files. > >> > >> This patch implemented a customization for task/task_file iterator, > >> permitting to traverse only the kernel task where its pid equal > >> to tgid in the kernel. This includes some kernel threads, and > >> main threads of user processes. This will solve the above potential > >> performance issue for task_file. This customization may be useful > >> for task iterator too if only traversing main threads is enough. > >> > >> Signed-off-by: Yonghong Song <yhs@xxxxxx> > >> --- > >> include/linux/bpf.h | 3 ++- > >> include/uapi/linux/bpf.h | 5 ++++ > >> kernel/bpf/task_iter.c | 46 +++++++++++++++++++++++----------- > >> tools/include/uapi/linux/bpf.h | 5 ++++ > >> 4 files changed, 43 insertions(+), 16 deletions(-) > >> > >> diff --git a/include/linux/bpf.h b/include/linux/bpf.h > >> index a6131d95e31e..058eb9b0ba78 100644 > >> --- a/include/linux/bpf.h > >> +++ b/include/linux/bpf.h > >> @@ -1220,7 +1220,8 @@ int bpf_obj_get_user(const char __user *pathname, int flags); > >> int __init bpf_iter_ ## target(args) { return 0; } > >> > >> struct bpf_iter_aux_info { > >> - struct bpf_map *map; > >> + struct bpf_map *map; /* for iterator traversing map elements */ > >> + bool main_thread_only; /* for task/task_file iterator */ > > > > As a user of task_file iterator I'd hate to make this decision, > > honestly, especially if I can't prove that all processes share the > > same file table (I think clone() syscall allows to do weird > > combinations like that, right?). It does make sense for a task/ > > Right. the clone() syscall permits different kinds of sharing, > sharing address space and sharing files are separated. It is possible > that address space is shared and files are not shared. That is > why I want to add flags for task_file so that if user knows > they do not have cases where address space shared and files not > shared, they can use main_thread_only to improve performance. The problem with such options is that for a lot of users it won't be clear at all when and if those options can be used. E.g., when I imagine myself building some generic tool utilizing task_file bpf_iter that is supposed to be run on any server, how do I know if it's safe to specify "main_thread_only"? I can't guarantee that. So I'll be forced to either risk it or fallback to default and unnecessarily slow behavior. This is different for task/ bpf_iter, though, so I support that for task/ only. But you've already done that split, so thank you! :) > > > iterator, though, if I need to iterate a user-space process (group of > > tasks). So can we do this: > > > > 1a. Either add a new bpf_iter type process/ (or in kernel lingo > > task_group/) to iterate only main threads (group_leader in kernel > > lingo); > > 1b. Or make this main_thread_only an option for only a task/ iterator > > (and maybe call it "group_leader_only" or something to reflect kernel > > terminology?) > > The following is the kernel pid_type definition, > > enum pid_type > { > PIDTYPE_PID, > PIDTYPE_TGID, > PIDTYPE_PGID, > PIDTYPE_SID, > PIDTYPE_MAX, > }; > > Right now, task iterator is traversed following > PIDTYPE_PID, i.e., every tasks in the system. > > To iterate through main thread, we need to traverse following > PIDTYPE_TGID. > > In the future, it is possible, people want to traverse > following PIDTYPE_PGID (process group) or PIDTYPE_SID (session). > > Or we might have other customization, e.g., cgroup_id, which can > be filtered in the bpf program, but in-kernel filtering can > definitely improve performance. > > So I prefer 1b. Sounds good, but let's use a proper enum, not a set of bit fields. We can support all 4 from the get go. There is no need to wait for the actual use case to appear, as the iteration semantics is pretty clear. > > Yes, naming is hard. > The name "main_thread_only" is mostly from userspace perspective. > "group_leader_only" might not be good as it may be confused with > possible future process group. > > > > > 2. For task/file iterator, still iterate all tasks, but if the task's > > files struct is the same as group_leader's files struct, then go to > > the next one. This should eliminate tons of duplication of iterating > > the same files over and over. It would still iterate a bunch of tasks, > > but compared to the number of files that's generally a much smaller > > number, so should still give sizable savings. I don't think we need an > > extra option for this, tbh, this behavior was my intuitive > > expectation, so I don't think you'll be breaking any sane user of this > > iterator. > > What you suggested makes sense. for task_file iterator, we only promise > to visit all files from all tasks. We can do necessary filtering to > remove duplicates in the kernel and did not break promise. > I will remove customization from task_file iterator. Thanks for doing that! > > > > > Disclaimer: I haven't got a chance to look through kernel code much, > > so I'm sorry if what I'm proposing is something that is impossible to > > implement or extremely hard/unreliable. But thought I'll put this idea > > out there before we decide on this. > > > > WDYT? > > > > [...] > >