Re: [RFC PATCH bpf-next 2/4] bpf: Introduce process open coded iterator kfuncs

Andrii Nakryiko <andrii.nakryiko@xxxxxxxxx> · Tue, 12 Sep 2023 15:12:33 -0700

On Wed, Sep 6, 2023 at 10:18 AM Alexei Starovoitov
<alexei.starovoitov@xxxxxxxxx> wrote:
>
> On Wed, Sep 6, 2023 at 5:38 AM Chuyi Zhou <zhouchuyi@xxxxxxxxxxxxx> wrote:
> >
> > Hello, Alexei.
> >
> > 在 2023/9/6 04:09, Alexei Starovoitov 写道:
> > > On Sun, Aug 27, 2023 at 12:21 AM Chuyi Zhou <zhouchuyi@xxxxxxxxxxxxx> wrote:
> > >>
> > >> This patch adds kfuncs bpf_iter_process_{new,next,destroy} which allow
> > >> creation and manipulation of struct bpf_iter_process in open-coded iterator
> > >> style. BPF programs can use these kfuncs or through bpf_for_each macro to
> > >> iterate all processes in the system.
> > >>
> > >> Signed-off-by: Chuyi Zhou <zhouchuyi@xxxxxxxxxxxxx>
> > >> ---
> > >>   include/uapi/linux/bpf.h       |  4 ++++
> > >>   kernel/bpf/helpers.c           |  3 +++
> > >>   kernel/bpf/task_iter.c         | 31 +++++++++++++++++++++++++++++++
> > >>   tools/include/uapi/linux/bpf.h |  4 ++++
> > >>   tools/lib/bpf/bpf_helpers.h    |  5 +++++
> > >>   5 files changed, 47 insertions(+)
> > >>
> > >> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> > >> index 2a6e9b99564b..cfbd527e3733 100644
> > >> --- a/include/uapi/linux/bpf.h
> > >> +++ b/include/uapi/linux/bpf.h
> > >> @@ -7199,4 +7199,8 @@ struct bpf_iter_css_task {
> > >>          __u64 __opaque[1];
> > >>   } __attribute__((aligned(8)));
> > >>
> > >> +struct bpf_iter_process {
> > >> +       __u64 __opaque[1];
> > >> +} __attribute__((aligned(8)));
> > >> +
> > >>   #endif /* _UAPI__LINUX_BPF_H__ */
> > >> diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
> > >> index cf113ad24837..81a2005edc26 100644
> > >> --- a/kernel/bpf/helpers.c
> > >> +++ b/kernel/bpf/helpers.c
> > >> @@ -2458,6 +2458,9 @@ BTF_ID_FLAGS(func, bpf_iter_num_destroy, KF_ITER_DESTROY)
> > >>   BTF_ID_FLAGS(func, bpf_iter_css_task_new, KF_ITER_NEW)
> > >>   BTF_ID_FLAGS(func, bpf_iter_css_task_next, KF_ITER_NEXT | KF_RET_NULL)
> > >>   BTF_ID_FLAGS(func, bpf_iter_css_task_destroy, KF_ITER_DESTROY)
> > >> +BTF_ID_FLAGS(func, bpf_iter_process_new, KF_ITER_NEW)
> > >> +BTF_ID_FLAGS(func, bpf_iter_process_next, KF_ITER_NEXT | KF_RET_NULL)
> > >> +BTF_ID_FLAGS(func, bpf_iter_process_destroy, KF_ITER_DESTROY)
> > >>   BTF_ID_FLAGS(func, bpf_dynptr_adjust)
> > >>   BTF_ID_FLAGS(func, bpf_dynptr_is_null)
> > >>   BTF_ID_FLAGS(func, bpf_dynptr_is_rdonly)
> > >> diff --git a/kernel/bpf/task_iter.c b/kernel/bpf/task_iter.c
> > >> index b1bdba40b684..a6717a76c1e0 100644
> > >> --- a/kernel/bpf/task_iter.c
> > >> +++ b/kernel/bpf/task_iter.c
> > >> @@ -862,6 +862,37 @@ __bpf_kfunc void bpf_iter_css_task_destroy(struct bpf_iter_css_task *it)
> > >>          kfree(kit->css_it);
> > >>   }
> > >>
> > >> +struct bpf_iter_process_kern {
> > >> +       struct task_struct *tsk;
> > >> +} __attribute__((aligned(8)));
> > >> +
> > >> +__bpf_kfunc int bpf_iter_process_new(struct bpf_iter_process *it)
> > >> +{
> > >> +       struct bpf_iter_process_kern *kit = (void *)it;
> > >> +
> > >> +       BUILD_BUG_ON(sizeof(struct bpf_iter_process_kern) != sizeof(struct bpf_iter_process));
> > >> +       BUILD_BUG_ON(__alignof__(struct bpf_iter_process_kern) !=
> > >> +                                       __alignof__(struct bpf_iter_process));
> > >> +
> > >> +       rcu_read_lock();
> > >> +       kit->tsk = &init_task;
> > >> +       return 0;
> > >> +}
> > >> +
> > >> +__bpf_kfunc struct task_struct *bpf_iter_process_next(struct bpf_iter_process *it)
> > >> +{
> > >> +       struct bpf_iter_process_kern *kit = (void *)it;
> > >> +
> > >> +       kit->tsk = next_task(kit->tsk);
> > >> +
> > >> +       return kit->tsk == &init_task ? NULL : kit->tsk;
> > >> +}
> > >> +
> > >> +__bpf_kfunc void bpf_iter_process_destroy(struct bpf_iter_process *it)
> > >> +{
> > >> +       rcu_read_unlock();
> > >> +}
> > >
> > > This iter can be used in all ctx-s which is nice, but let's
> > > make the verifier enforce rcu_read_lock/unlock done by bpf prog
> > > instead of doing in the ctor/dtor of iter, since
> > > in sleepable progs the verifier won't recognize that body is RCU CS.
> > > We'd need to teach the verifier to allow bpf_iter_process_new()
> > > inside in_rcu_cs() and make sure there is no rcu_read_unlock
> > > while BPF_ITER_STATE_ACTIVE.
> > > bpf_iter_process_destroy() would become a nop.
> >
> > Thanks for your review!
> >
> > I think bpf_iter_process_{new, next, destroy} should be protected by
> > bpf_rcu_read_lock/unlock explicitly whether the prog is sleepable or
> > not, right?
>
> Correct. By explicit bpf_rcu_read_lock() in case of sleepable progs
> or just by using them in normal bpf progs that have implicit rcu_read_lock()
> done before calling into them.
>
> > I'm not very familiar with the BPF verifier, but I believe
> > there is still a risk in directly calling these kfuns even if
> > in_rcu_cs() is true.
> >
> > Maby what we actually need here is to enforce BPF verifier to check
> > env->cur_state->active_rcu_lock is true when we want to call these kfuncs.
>
> active_rcu_lock means explicit bpf_rcu_read_lock.
> Currently we do allow bpf_rcu_read_lock in non-sleepable, but it's pointless.
>
> Technically we can extend the check:
>                 if (in_rbtree_lock_required_cb(env) && (rcu_lock ||
> rcu_unlock)) {
>                         verbose(env, "Calling
> bpf_rcu_read_{lock,unlock} in unnecessary rbtree callback\n");
>                         return -EACCES;
>                 }
> to discourage their use in all non-sleepable, but it will break some progs.
>
> I think it's ok to check in_rcu_cs() to allow bpf_iter_process_*().
> If bpf prog adds explicit and unnecessary bpf_rcu_read_lock() around
> the iter ops it won't do any harm.
> Just need to make sure that rcu unlock logic:
>                 } else if (rcu_unlock) {
>                         bpf_for_each_reg_in_vstate(env->cur_state,
> state, reg, ({
>                                 if (reg->type & MEM_RCU) {
>                                         reg->type &= ~(MEM_RCU |
> PTR_MAYBE_NULL);
>                                         reg->type |= PTR_UNTRUSTED;
>                                 }
>                         }));
> clears iter state that depends on rcu.
>
> I thought about changing mark_stack_slots_iter() to do
> st->type = PTR_TO_STACK | MEM_RCU;
> so that the above clearing logic kicks in,
> but it might be better to have something iter specific.
> is_iter_reg_valid_init() should probably be changed to
> make sure reg->type is not UNTRUSTED.
>
> Andrii,
> do you have better suggestions?

What if we just remember inside bpf_reg_state.iter state whether
iterator needs to be RCU protected (it's just one bit if we don't
allow nesting rcu_read_lock()/rcu_read_unlock(), or we'd need to
remember RCU nestedness level), and then when validating iter_next and
iter_destroy() kfuncs, check that we are still in RCU-protected region
(if we have nestedness, then iter->rcu_nest_level <=
cur_rcu_nest_level, if I understand correctly). And if not, provide a
clear and nice message.

That seems straightforward enough, but am I missing anything subtle?