Re: [PATCH bpf-next v2 03/20] bpf: support bpf tracing/iter programs for BPF_LINK_CREATE

Alexei Starovoitov <ast@xxxxxx> · Wed, 6 May 2020 11:08:24 -0700

On 5/5/20 8:09 PM, Andrii Nakryiko wrote:
On Tue, May 5, 2020 at 5:54 PM Alexei Starovoitov <ast@xxxxxx> wrote:

On 5/5/20 5:14 PM, Yonghong Song wrote:

On 5/5/20 2:30 PM, Andrii Nakryiko wrote:
On Sun, May 3, 2020 at 11:26 PM Yonghong Song <yhs@xxxxxx> wrote:

Given a bpf program, the step to create an anonymous bpf iterator is:
    - create a bpf_iter_link, which combines bpf program and the target.
      In the future, there could be more information recorded in the
link.
      A link_fd will be returned to the user space.
    - create an anonymous bpf iterator with the given link_fd.

The bpf_iter_link can be pinned to bpffs mount file system to
create a file based bpf iterator as well.

The benefit to use of bpf_iter_link:
    - using bpf link simplifies design and implementation as bpf link
      is used for other tracing bpf programs.
    - for file based bpf iterator, bpf_iter_link provides a standard
      way to replace underlying bpf programs.
    - for both anonymous and free based iterators, bpf link query
      capability can be leveraged.

The patch added support of tracing/iter programs for BPF_LINK_CREATE.
A new link type BPF_LINK_TYPE_ITER is added to facilitate link
querying. Currently, only prog_id is needed, so there is no
additional in-kernel show_fdinfo() and fill_link_info() hook
is needed for BPF_LINK_TYPE_ITER link.

Signed-off-by: Yonghong Song <yhs@xxxxxx>
---

LGTM. See small nit about __GFP_NOWARN.

Acked-by: Andrii Nakryiko <andriin@xxxxxx>

   include/linux/bpf.h            |  1 +
   include/linux/bpf_types.h      |  1 +
   include/uapi/linux/bpf.h       |  1 +
   kernel/bpf/bpf_iter.c          | 62 ++++++++++++++++++++++++++++++++++
   kernel/bpf/syscall.c           | 14 ++++++++
   tools/include/uapi/linux/bpf.h |  1 +
   6 files changed, 80 insertions(+)

[...]

+int bpf_iter_link_attach(const union bpf_attr *attr, struct bpf_prog
*prog)
+{
+       struct bpf_link_primer link_primer;
+       struct bpf_iter_target_info *tinfo;
+       struct bpf_iter_link *link;
+       bool existed = false;
+       u32 prog_btf_id;
+       int err;
+
+       if (attr->link_create.target_fd || attr->link_create.flags)
+               return -EINVAL;
+
+       prog_btf_id = prog->aux->attach_btf_id;
+       mutex_lock(&targets_mutex);
+       list_for_each_entry(tinfo, &targets, list) {
+               if (tinfo->btf_id == prog_btf_id) {
+                       existed = true;
+                       break;
+               }
+       }
+       mutex_unlock(&targets_mutex);
+       if (!existed)
+               return -ENOENT;
+
+       link = kzalloc(sizeof(*link), GFP_USER | __GFP_NOWARN);

nit: all existing link implementation don't specify __GFP_NOWARN,
wonder if bpf_iter_link should be special?

Nothing special. Just feel __GFP_NOWARN is the right thing to do to
avoid pollute dmesg since -ENOMEM is returned to user space. But in
reality, unlike some key/value allocation where the size could be huge
and __GFP_NOWARN might be more useful, here, sizeof(*link) is fixed
and small, __GFP_NOWARN probably not that useful.

Will drop it.

actually all existing user space driven allocation have nowarn.

Can you define "user space driven"? I understand why for map, map key,
map value, program we want to do that, because it's way too easy for
user-space to specify huge sizes and allocation is proportional to
that size. But in this case links are fixed-sized objects, same as
struct file and struct inode. From BPF world, for instance, there is
struct bpf_prog_list, which is created when user is attaching BPF
program to cgroup, so it is user-space driven in similar sense. Yet we
allocate it without __GFP_NOWARN.

For tiny objects it doesn't really matter. If slab cannot allocate
another single page the system is in bad shape and warn is good
to have in most cases, but when it's user driven like here that
warn won't help kernel developers debug ooms. Most likely NIC driver
is spamming page alloc warn at this point.
In this particular case bpf_iter arguments will likely grow
and struct will grow too, but probably not the point of kmalloc_large,
so it's really fine which ever way.
Personally I would keep nowarn here.