Re: [PATCH bpf-next 1/3] bpf: Parameterize task iterators.

Yonghong Song <yhs@xxxxxx> · Thu, 28 Jul 2022 10:08:44 -0700

On 7/28/22 9:40 AM, Kui-Feng Lee wrote:
On Thu, 2022-07-28 at 18:22 +0200, Kumar Kartikeya Dwivedi wrote:
On Thu, 28 Jul 2022 at 17:16, Kui-Feng Lee <kuifeng@xxxxxx> wrote:

On Thu, 2022-07-28 at 10:47 +0200, Kumar Kartikeya Dwivedi wrote:
On Thu, 28 Jul 2022 at 07:25, Kui-Feng Lee <kuifeng@xxxxxx>
wrote:

On Wed, 2022-07-27 at 10:19 +0200, Kumar Kartikeya Dwivedi
wrote:
On Wed, 27 Jul 2022 at 09:01, Kui-Feng Lee <kuifeng@xxxxxx>
wrote:

On Tue, 2022-07-26 at 14:13 +0200, Jiri Olsa wrote:
On Mon, Jul 25, 2022 at 10:17:11PM -0700, Kui-Feng Lee
wrote:
Allow creating an iterator that loops through resources
of
one
task/thread.

People could only create iterators to loop through all
resources of
files, vma, and tasks in the system, even though they
were
interested
in only the resources of a specific task or process.
Passing
the
additional parameters, people can now create an
iterator to
go
through all resources or only the resources of a task.

Signed-off-by: Kui-Feng Lee <kuifeng@xxxxxx>
---
  include/linux/bpf.h            |  4 ++
  include/uapi/linux/bpf.h       | 23 ++++++++++
  kernel/bpf/task_iter.c         | 81
+++++++++++++++++++++++++-
----
----
  tools/include/uapi/linux/bpf.h | 23 ++++++++++
  4 files changed, 109 insertions(+), 22 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 11950029284f..c8d164404e20 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -1718,6 +1718,10 @@ int bpf_obj_get_user(const char
__user
*pathname, int flags);

  struct bpf_iter_aux_info {
         struct bpf_map *map;
+       struct {
+               __u32   tid;

should be just u32 ?

Or, should change the following 'type' to __u8?

Would it be better to use a pidfd instead of a tid here?
Unset
pidfd
would mean going over all tasks, and any fd > 0 implies
attaching
to
a
specific task (as is the convention in BPF land). Most of the
new
UAPIs working on processes are using pidfds (to work with a
stable
handle instead of a reusable ID).
The iterator taking an fd also gives an opportunity to BPF
LSMs
to
attach permissions/policies to it (once we have a file local
storage
map) e.g. whether creating a task iterator for that specific
pidfd
instance (backed by the struct file) would be allowed or not.
You are using getpid in the selftest and keeping track of
last_tgid
in
the iterator, so I guess you don't even need to extend
pidfd_open
to
work on thread IDs right now for your use case (and fdtable
and
mm
are
shared for POSIX threads anyway, so for those two it won't
make a
difference).

There is one problem here. The current pidfd_open syscall
only supports thread-group leader, i.e., main thread, i.e.,
it won't support any non-main-thread tid's. Yes, thread-group
leader and other threads should share the same vma and files
in most of times, but it still possible different threads
in the same process may have different files which is why
in current task_iter.c we have:
                *tid = pid_nr_ns(pid, ns);
                task = get_pid_task(pid, PIDTYPE_PID);
                if (!task) {
                        ++*tid;
                        goto retry;
                } else if (skip_if_dup_files && 
!thread_group_leader(task) &&
                           task->files == task->group_leader->files) {
                        put_task_struct(task);
                        task = NULL;
                        ++*tid;
                        goto retry;
                }


Each thread (tid) will have some fields different from
thread-group leader (tgid), e.g., comm and most (if not all)
scheduling related fields.

So it would be good to support for each tid from the start
as it is not clear when pidfd_open will support non
thread-group leader.

If it worries wrap around, a reference to the task
can be held when tid passed to the kernel at link
create time. This is similar to pid is passed to
the kernel at pidfd_open syscall. But in practice,
I think taking the reference during read() should
also fine. The race always exist anyway.

Kumar, could you give more details about security
concerns? I am not sure about the tight relationship
between bpf_iter and security. bpf_iter just for
iterating kernel data structures.


What is your opinion?

Do you mean removed both tid and type, and replace them with a
pidfd?
We can do that in uapi, struct bpf_link_info.  But, the interal
types,
ex. bpf_iter_aux_info, still need to use tid or struct file to
avoid
getting file from the per-process fdtable.  Is that what you
mean?


Yes, just for the UAPI, it is similar to taking map_fd for map
iter.
In bpf_link_info we should report just the tid, just like map
iter
reports map_id.

It sounds good to me.

One thing I need a clarification. You mentioned that a fd > 0
implies
attaching to a specific task, however fd can be 0. So, it should be
fd
= 0. So, it forces the user to initialize the value of pidfd to -
1.
So, for convenience, we still need a field like 'type' to make it
easy
to create iterators without a filter.


Right, but in lots of BPF UAPI fields, fd 0 means fd is unset, so it
is fine to rely on that assumption. For e.g. even for map_fd,
bpf_map_elem iterator considers fd 0 to be unset. Then you don't need
the type field.

I just realize that pidfd may be meaningless for the bpf_link_info
returned by bpf_obj_get_info_by_fd() since the origin fd might be
closed already.  So, I will always set it a value of 0.