Re: [PATCH v3 bpf-next 2/3] bpf: Introduce task_vma open-coded iterator kfuncs

Andrii Nakryiko <andrii.nakryiko@xxxxxxxxx> · Wed, 23 Aug 2023 10:14:30 -0700

On Wed, Aug 23, 2023 at 8:03 AM Alexei Starovoitov
<alexei.starovoitov@xxxxxxxxx> wrote:
>
> On Wed, Aug 23, 2023 at 12:26 AM David Marchevsky
> <david.marchevsky@xxxxxxxxx> wrote:
> >
> > On 8/22/23 7:52 PM, Andrii Nakryiko wrote:
> > > On Mon, Aug 21, 2023 at 10:06 PM Dave Marchevsky <davemarchevsky@xxxxxx> wrote:
> > >>
> > >> This patch adds kfuncs bpf_iter_task_vma_{new,next,destroy} which allow
> > >> creation and manipulation of struct bpf_iter_task_vma in open-coded
> > >> iterator style. BPF programs can use these kfuncs directly or through
> > >> bpf_for_each macro for natural-looking iteration of all task vmas.
> > >>
> > >> The implementation borrows heavily from bpf_find_vma helper's locking -
> > >> differing only in that it holds the mmap_read lock for all iterations
> > >> while the helper only executes its provided callback on a maximum of 1
> > >> vma. Aside from locking, struct vma_iterator and vma_next do all the
> > >> heavy lifting.
> > >>
> > >> The newly-added struct bpf_iter_task_vma has a name collision with a
> > >> selftest for the seq_file task_vma iter's bpf skel, so the selftests/bpf/progs
> > >> file is renamed in order to avoid the collision.
> > >>
> > >> A pointer to an inner data struct, struct bpf_iter_task_vma_kern_data, is the
> > >> only field in struct bpf_iter_task_vma. This is because the inner data
> > >> struct contains a struct vma_iterator (not ptr), whose size is likely to
> > >> change under us. If bpf_iter_task_vma_kern contained vma_iterator directly
> > >> such a change would require change in opaque bpf_iter_task_vma struct's
> > >> size. So better to allocate vma_iterator using BPF allocator, and since
> > >> that alloc must already succeed, might as well allocate all iter fields,
> > >> thereby freezing struct bpf_iter_task_vma size.
> > >>
> > >> Signed-off-by: Dave Marchevsky <davemarchevsky@xxxxxx>
> > >> Cc: Nathan Slingerland <slinger@xxxxxxxx>
> > >> ---
> > >>  include/uapi/linux/bpf.h                      |  4 +
> > >>  kernel/bpf/helpers.c                          |  3 +
> > >>  kernel/bpf/task_iter.c                        | 84 +++++++++++++++++++
> > >>  tools/include/uapi/linux/bpf.h                |  4 +
> > >>  tools/lib/bpf/bpf_helpers.h                   |  8 ++
> > >>  .../selftests/bpf/prog_tests/bpf_iter.c       | 26 +++---
> > >>  ...f_iter_task_vma.c => bpf_iter_task_vmas.c} |  0
> > >>  7 files changed, 116 insertions(+), 13 deletions(-)
> > >>  rename tools/testing/selftests/bpf/progs/{bpf_iter_task_vma.c => bpf_iter_task_vmas.c} (100%)
> > >>
> > >> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> > >> index 8790b3962e4b..49fc1989a548 100644
> > >> --- a/include/uapi/linux/bpf.h
> > >> +++ b/include/uapi/linux/bpf.h
> > >> @@ -7311,4 +7311,8 @@ struct bpf_iter_num {
> > >>         __u64 __opaque[1];
> > >>  } __attribute__((aligned(8)));
> > >>
> > >> +struct bpf_iter_task_vma {
> > >> +       __u64 __opaque[1]; /* See bpf_iter_num comment above */
> > >> +} __attribute__((aligned(8)));
> > >> +
> > >>  #endif /* _UAPI__LINUX_BPF_H__ */
> > >> diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
> > >> index eb91cae0612a..7a06dea749f1 100644
> > >> --- a/kernel/bpf/helpers.c
> > >> +++ b/kernel/bpf/helpers.c
> > >> @@ -2482,6 +2482,9 @@ BTF_ID_FLAGS(func, bpf_dynptr_slice_rdwr, KF_RET_NULL)
> > >>  BTF_ID_FLAGS(func, bpf_iter_num_new, KF_ITER_NEW)
> > >>  BTF_ID_FLAGS(func, bpf_iter_num_next, KF_ITER_NEXT | KF_RET_NULL)
> > >>  BTF_ID_FLAGS(func, bpf_iter_num_destroy, KF_ITER_DESTROY)
> > >> +BTF_ID_FLAGS(func, bpf_iter_task_vma_new, KF_ITER_NEW)
> > >> +BTF_ID_FLAGS(func, bpf_iter_task_vma_next, KF_ITER_NEXT | KF_RET_NULL)
> > >> +BTF_ID_FLAGS(func, bpf_iter_task_vma_destroy, KF_ITER_DESTROY)
> > >>  BTF_ID_FLAGS(func, bpf_dynptr_adjust)
> > >>  BTF_ID_FLAGS(func, bpf_dynptr_is_null)
> > >>  BTF_ID_FLAGS(func, bpf_dynptr_is_rdonly)
> > >> diff --git a/kernel/bpf/task_iter.c b/kernel/bpf/task_iter.c
> > >> index c4ab9d6cdbe9..51c2dce435c1 100644
> > >> --- a/kernel/bpf/task_iter.c
> > >> +++ b/kernel/bpf/task_iter.c
> > >> @@ -7,7 +7,9 @@
> > >>  #include <linux/fs.h>
> > >>  #include <linux/fdtable.h>
> > >>  #include <linux/filter.h>
> > >> +#include <linux/bpf_mem_alloc.h>
> > >>  #include <linux/btf_ids.h>
> > >> +#include <linux/mm_types.h>
> > >>  #include "mmap_unlock_work.h"
> > >>
> > >>  static const char * const iter_task_type_names[] = {
> > >> @@ -823,6 +825,88 @@ const struct bpf_func_proto bpf_find_vma_proto = {
> > >>         .arg5_type      = ARG_ANYTHING,
> > >>  };
> > >>
> > >> +struct bpf_iter_task_vma_kern_data {
> > >> +       struct task_struct *task;
> > >> +       struct mm_struct *mm;
> > >> +       struct mmap_unlock_irq_work *work;
> > >> +       struct vma_iterator vmi;
> > >> +};
> > >> +
> > >> +/* Non-opaque version of uapi bpf_iter_task_vma */
> > >> +struct bpf_iter_task_vma_kern {
> > >> +       struct bpf_iter_task_vma_kern_data *data;
> > >> +} __attribute__((aligned(8)));
> > >> +
> > >
> > > it's a bit worrying that we'll rely on memory allocation inside NMI to
> > > be able to use this. I'm missing previous email discussion (I declared
> > > email bankruptcy after long vacation), but I suppose the option to fix
> > > bpf_iter_task_vma (to 88 bytes: 64 for vma_iterator + 24 for extra
> > > pointers), or even to 96 to have a bit of headroom in case we need a
> > > bit more space was rejected? It seems unlikely that vma_iterator will
> > > have to grow, but if it does, it has 5 bytes of padding right now for
> > > various flags, plus we can have extra 8 bytes reserved just in case.
> > >
> > > I know it's a big struct and will take a big chunk of the BPF stack,
> > > but I'm a bit worried about both the performance implication of mem
> > > alloc under NMI, and allocation failing.
> > >
> > > Maybe the worry is overblown, but I thought I'll bring it up anyways.
> > >
> >
> > Few tangential trains of thought here, separated by multiple newlines
> > for easier reading.
> >
> >
> > IIUC the any-context BPF allocator will not actually allocate memory in NMI
> > context, instead relying on its existing pre-filled caches.
> >
> > Alexei's patch adding the allocator says ([0]):
> >
> >   The allocators are NMI-safe from bpf programs only. They are not NMI-safe in general.
> >
> > So sounds bpf_mem_alloc in a kfunc called by a BPF program is NMI-safe.
> >
> >
> > That's not to say that I'm happy about adding a fallible bpf_mem_alloc call here
> > before the kfunc can do anything useful. But it seems like the best way to
> > guarantee that we never see a mailing list message like:
> >
> >   Hello, I just added a field to 'struct ma_state' in my subsystem and it seems
> >   I've triggered a BUILD_BUG_ON in this far-away BPF subsystem. It looks like
> >   you're making stability guarantees based on the size of my internal struct.
> >   What the hell?
> >
> > Sure, after I remove the kfuncs and struct bpf_iter_task_vma fwd decl from
> > bpf_helpers.h - per your other comment below - we can do the whole "kfuncs
> > aren't uapi and this struct bpf_iter_task_vma is coming from vmlinux.h,
> > not some stable header" spiel and convince this hypothetical person. Not having
> > to do the spiel here reinforces the more general "Modern BPF exposes
> > functionality w/ kfuncs and kptrs, which are inherently _unstable_" messaging
> > more effectively than having to explain.
> >
> >
> > If we go back to putting struct vma_iterator on the BPF stack, I think we
> > definitely want to keep the BUILD_BUG_ON. If it were removed and vma_iterator
> > size changes, that would affect portability of BPF programs that assume the old
> > size of bpf_iter_task_vma, no? Which bpf_for_each is doing since it puts
> > bpf_iter_task_vma on the stack.
> >
> > Is there some CO-RE technique that would handle above scenario portably? I
> > can't think of anything straightforward. Maybe if BPF prog BTF only had
> > a fwd decl for bpf_iter_task_vma, and size thus had to be taken from
> > vmlinux BTF. But that would fail to compile since it the struct goes
> > on the stack. Maybe use some placeholder size for compilation and use
> > BTF tag to tell libbpf to patch insns w/ vmlinux's size for this struct?
> >
> >
> > Re: padding bytes, seems worse to me than not using them. Have to make
> > assumptions about far-away struct, specifically vma_iterator
> > which landed quite recently as part of maple tree series. The assumptions
> > don't prevent my hypothetical mailing list confusion from happening, increases
> > the confusion if it does happen ("I added a small field recently, why didn't
> > this break then? If it's explicitly and intentionally unstable, why add
> > padding bytes?")
> >
> >   [0]: https://lore.kernel.org/bpf/20220902211058.60789-2-alexei.starovoitov@xxxxxxxxx
>
> +1
> imo struct bpf_iter_task_vma_kern is a bit too big to put on bpf prog stack.
> bpf_mem_alloc short term is a lesser evil imo.

Yep, a bit big, though still reasonable to put on the stack. As I
mentioned above, my worry is the unreliability of being able to
instantiate task_vma iterator in NMI context (which is probably where
this will be used often). We'll see if this actually the problem in
practice once this is used and deployed fleet-wide. If it is, we might
want to revisit it by introducing v2.

> Long term we need to think how to extend bpf ISA with alloca.

To be frank, I'm not following how alloca is relevant here. We don't
have anything dynamically sized on the stack.

Unless you envision protocol where we have a separate function to get
size of iter struct, then alloca enough space, then pass that to
bpf_iter_xxx_new()? Not sure whether this is statically verifiable,
but given it's long-term, we can put it on backburner for now.

> It's time to figure out how to grow the stack.

We effectively already grow the stack based on exact prog's usage.
It's more of "how to support a bigger stack" during verification with
reasonable increase in memory and cpu usage.

Anyways, as I said, I suspected the answer will be what it is, but
documenting the reasons and what alternatives were considered is still
worthwhile, IMO. Thanks for the discussion.