Re: [PATCH bpf-next v2 1/4] bpf/crib: Introduce task_file open-coded iterator kfuncs

Andrii Nakryiko <andrii.nakryiko@xxxxxxxxx> · Wed, 6 Nov 2024 12:02:03 -0800

On Fri, Nov 1, 2024 at 1:22 PM Juntong Deng <juntong.deng@xxxxxxxxxxx> wrote:
>
> On 2024/11/1 19:06, Andrii Nakryiko wrote:
> > On Tue, Oct 29, 2024 at 5:15 PM Juntong Deng <juntong.deng@xxxxxxxxxxx> wrote:
> >>
> >> This patch adds the open-coded iterator style process file iterator
> >> kfuncs bpf_iter_task_file_{new,next,destroy} that iterates over all
> >> files opened by the specified process.
> >>
> >> In addition, this patch adds bpf_iter_task_file_get_fd() getter to get
> >> the file descriptor corresponding to the file in the current iteration.
> >>
> >> The reference to struct file acquired by the previous
> >> bpf_iter_task_file_next() is released in the next
> >> bpf_iter_task_file_next(), and the last reference is released in the
> >> last bpf_iter_task_file_next() that returns NULL.
> >>
> >> In the bpf_iter_task_file_destroy(), if the iterator does not iterate to
> >> the end, then the last struct file reference is released at this time.
> >>
> >> Signed-off-by: Juntong Deng <juntong.deng@xxxxxxxxxxx>
> >> ---
> >>   kernel/bpf/Makefile      |   1 +
> >>   kernel/bpf/crib/Makefile |   3 ++
> >>   kernel/bpf/crib/crib.c   |  29 +++++++++++
> >>   kernel/bpf/crib/files.c  | 105 +++++++++++++++++++++++++++++++++++++++
> >>   4 files changed, 138 insertions(+)
> >>   create mode 100644 kernel/bpf/crib/Makefile
> >>   create mode 100644 kernel/bpf/crib/crib.c
> >>   create mode 100644 kernel/bpf/crib/files.c
> >>
> >> diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile
> >> index 105328f0b9c0..933d36264e5e 100644
> >> --- a/kernel/bpf/Makefile
> >> +++ b/kernel/bpf/Makefile
> >> @@ -53,3 +53,4 @@ obj-$(CONFIG_BPF_SYSCALL) += relo_core.o
> >>   obj-$(CONFIG_BPF_SYSCALL) += btf_iter.o
> >>   obj-$(CONFIG_BPF_SYSCALL) += btf_relocate.o
> >>   obj-$(CONFIG_BPF_SYSCALL) += kmem_cache_iter.o
> >> +obj-$(CONFIG_BPF_SYSCALL) += crib/
> >> diff --git a/kernel/bpf/crib/Makefile b/kernel/bpf/crib/Makefile
> >> new file mode 100644
> >> index 000000000000..4e1bae1972dd
> >> --- /dev/null
> >> +++ b/kernel/bpf/crib/Makefile
> >> @@ -0,0 +1,3 @@
> >> +# SPDX-License-Identifier: GPL-2.0
> >> +
> >> +obj-$(CONFIG_BPF_SYSCALL) += crib.o files.o
> >> diff --git a/kernel/bpf/crib/crib.c b/kernel/bpf/crib/crib.c
> >> new file mode 100644
> >> index 000000000000..e6536ee9a845
> >> --- /dev/null
> >> +++ b/kernel/bpf/crib/crib.c
> >> @@ -0,0 +1,29 @@
> >> +// SPDX-License-Identifier: GPL-2.0
> >> +/*
> >> + * Checkpoint/Restore In eBPF (CRIB)
> >> + */
> >> +
> >> +#include <linux/bpf.h>
> >> +#include <linux/btf.h>
> >> +#include <linux/btf_ids.h>
> >> +
> >> +BTF_KFUNCS_START(bpf_crib_kfuncs)
> >> +
> >> +BTF_ID_FLAGS(func, bpf_iter_task_file_new, KF_ITER_NEW | KF_TRUSTED_ARGS)
> >> +BTF_ID_FLAGS(func, bpf_iter_task_file_next, KF_ITER_NEXT | KF_RET_NULL)
> >> +BTF_ID_FLAGS(func, bpf_iter_task_file_get_fd)
> >> +BTF_ID_FLAGS(func, bpf_iter_task_file_destroy, KF_ITER_DESTROY)
> >
> > This is in no way CRIB-specific, right? So I'd drop the CRIB reference
> > and move code next to task_file BPF iterator program type
> > implementation, this is a generic functionality.
> >
> > Even more so, given Namhyung's recent work on adding kmem_cache
> > iterator (both program type and open-coded iterator), it seems like it
> > should be possible to cut down on code duplication by using open-coded
> > iterator logic inside the BPF iterator program. Now that you are
> > adding task_file open-coded iterator, can you please check if it can
> > be reused. See kernel/bpf/task_iter.c (and I think that's where your
> > code should live as well, btw).
> >
> > pw-bot: cr
> >
>
> Thanks for your reply!
>
> Yes, I agree that it would be better to put the task_file open-coded
> iterator together with the traditional task_file iterator (in the
> same file).
>
> I will move it in the next patch series.
>
> >> +
> >> +BTF_KFUNCS_END(bpf_crib_kfuncs)
> >> +
> >> +static const struct btf_kfunc_id_set bpf_crib_kfunc_set = {
> >> +       .owner = THIS_MODULE,
> >> +       .set   = &bpf_crib_kfuncs,
> >> +};
> >> +
> >> +static int __init bpf_crib_init(void)
> >> +{
> >> +       return register_btf_kfunc_id_set(BPF_PROG_TYPE_SYSCALL, &bpf_crib_kfunc_set);
> >> +}
> >> +
> >> +late_initcall(bpf_crib_init);
> >> diff --git a/kernel/bpf/crib/files.c b/kernel/bpf/crib/files.c
> >> new file mode 100644
> >> index 000000000000..ececf150303f
> >> --- /dev/null
> >> +++ b/kernel/bpf/crib/files.c
> >> @@ -0,0 +1,105 @@
> >> +// SPDX-License-Identifier: GPL-2.0
> >> +
> >> +#include <linux/btf.h>
> >> +#include <linux/file.h>
> >> +#include <linux/fdtable.h>
> >> +#include <linux/net.h>
> >> +
> >> +struct bpf_iter_task_file {
> >> +       __u64 __opaque[3];
> >> +} __aligned(8);
> >> +
> >> +struct bpf_iter_task_file_kern {
> >> +       struct task_struct *task;
> >> +       struct file *file;
> >> +       int fd;
> >> +} __aligned(8);
> >> +
> >> +__bpf_kfunc_start_defs();
> >> +
> >> +/**
> >> + * bpf_iter_task_file_new() - Initialize a new task file iterator for a task,
> >> + * used to iterate over all files opened by a specified task
> >> + *
> >> + * @it: the new bpf_iter_task_file to be created
> >> + * @task: a pointer pointing to a task to be iterated over
> >> + */
> >> +__bpf_kfunc int bpf_iter_task_file_new(struct bpf_iter_task_file *it,
> >> +               struct task_struct *task)
> >> +{
> >> +       struct bpf_iter_task_file_kern *kit = (void *)it;
> >> +
> >> +       BUILD_BUG_ON(sizeof(struct bpf_iter_task_file_kern) > sizeof(struct bpf_iter_task_file));
> >> +       BUILD_BUG_ON(__alignof__(struct bpf_iter_task_file_kern) !=
> >> +                    __alignof__(struct bpf_iter_task_file));
> >> +
> >> +       kit->task = task;
> >> +       kit->fd = -1;
> >> +       kit->file = NULL;
> >> +
> >> +       return 0;
> >> +}
> >> +
> >> +/**
> >> + * bpf_iter_task_file_next() - Get the next file in bpf_iter_task_file
> >> + *
> >> + * bpf_iter_task_file_next acquires a reference to the returned struct file.
> >> + *
> >> + * The reference to struct file acquired by the previous
> >> + * bpf_iter_task_file_next() is released in the next bpf_iter_task_file_next(),
> >> + * and the last reference is released in the last bpf_iter_task_file_next()
> >> + * that returns NULL.
> >> + *
> >> + * @it: the bpf_iter_task_file to be checked
> >> + *
> >> + * @returns a pointer to the struct file of the next file if further files
> >> + * are available, otherwise returns NULL
> >> + */
> >> +__bpf_kfunc struct file *bpf_iter_task_file_next(struct bpf_iter_task_file *it)
> >> +{
> >> +       struct bpf_iter_task_file_kern *kit = (void *)it;
> >> +
> >> +       if (kit->file)
> >> +               fput(kit->file);
> >> +
> >> +       kit->fd++;
> >> +
> >> +       rcu_read_lock();
> >> +       kit->file = task_lookup_next_fdget_rcu(kit->task, &kit->fd);
> >> +       rcu_read_unlock();
> >> +
> >> +       return kit->file;
> >> +}
> >> +
> >> +/**
> >> + * bpf_iter_task_file_get_fd() - Get the file descriptor corresponding to
> >> + * the file in the current iteration
> >> + *
> >> + * @it: the bpf_iter_task_file to be checked
> >> + *
> >> + * @returns the file descriptor
> >> + */
> >> +__bpf_kfunc int bpf_iter_task_file_get_fd(struct bpf_iter_task_file *it__iter)
> >> +{
> >> +       struct bpf_iter_task_file_kern *kit = (void *)it__iter;
> >> +
> >> +       return kit->fd;
> >> +}
> >> +
> >
> > I don't think we need this. It's probably better to return a pointer
> > to a small struct representing "item" returned from this iterator.
> > Something like
> >
> > struct bpf_iter_task_file_item {
> >      struct task_struct *task;
> >      struct file *file;
> >      int fd;
> > };
> >
> > You can then embed this struct into struct bpf_iter_task_file and
> > return a pointer to it on each next() call (avoiding memory
> > allocation)
> >
> >
> > (naming just for illustrative purposes, I spent 0 seconds thinking about it)
> >
>
> Yes, I agree that it is feasible.
>
> But there is a question here, should we expose the internal state
> structure of the iterator (If we want to embed) ?
>
> I guess that we need two versions of data structures struct bpf_iter_xxx
> and struct bpf_iter_xxx_kern is for the purpose of encapsulation?

Yes, that's what we do for iterator state structure, and you should do
that as well. bpf_iter_xxx one will be opaque (see other examples, we
literally add `u64 __opaque[N];` there).

But this bpf_iter_task_file_item will be sort of internal API that is
returned from bpf_iter_task_file_next(). So you'll have something like

struct bpf_iter_task_file {
    .... additional state ...
    struct bpf_iter_task_file_item item;
};

then you have

struct bpf_iter_task_file_item bpf_iter_task_file_next(struct
bpf_iter_task_file *it)
{
    struct bpf_iter_task_file_kern *kit = (void *)it;

    ...
    kit->item.task = <sometask>;
    kit->item.file = <file>; /* and so on */

    return &kit->item;
}

>
> With two versions of the data structure, users can only manipulate
> the iterator using the iterator kfuncs, avoiding users from directly
> accessing the internal state.
>
> After we decide to return struct bpf_iter_task_file_item, these members
> will not be able to change and users can directly access/change the
> internal state of the iterator.

Yes, you have to carefully set up bpf_iter_task_file_item, but you
could expand it in the future without breaking any old users, because
you only return it by pointer and with BPF CO-RE all the field shifts
will be correctly handled.

>
> >> +/**
> >> + * bpf_iter_task_file_destroy() - Destroy a bpf_iter_task_file
> >> + *
> >> + * If the iterator does not iterate to the end, then the last
> >> + * struct file reference is released at this time.
> >> + *
> >> + * @it: the bpf_iter_task_file to be destroyed
> >> + */
> >> +__bpf_kfunc void bpf_iter_task_file_destroy(struct bpf_iter_task_file *it)
> >> +{
> >> +       struct bpf_iter_task_file_kern *kit = (void *)it;
> >> +
> >> +       if (kit->file)
> >> +               fput(kit->file);
> >> +}
> >> +
> >> +__bpf_kfunc_end_defs();
> >> --
> >> 2.39.5
> >>
>