Re: [PATCH v3 bpf-next 01/14] bpf: Introduce bpf_arena.

Alexei Starovoitov <alexei.starovoitov@xxxxxxxxx> · Mon, 11 Mar 2024 15:41:22 -0700

On Mon, Mar 11, 2024 at 3:01 PM Andrii Nakryiko
<andrii.nakryiko@xxxxxxxxx> wrote:
>
> On Thu, Mar 7, 2024 at 5:08 PM Alexei Starovoitov
> <alexei.starovoitov@xxxxxxxxx> wrote:
> >
> > From: Alexei Starovoitov <ast@xxxxxxxxxx>
> >
> > Introduce bpf_arena, which is a sparse shared memory region between the bpf
> > program and user space.
> >
> > Use cases:
> > 1. User space mmap-s bpf_arena and uses it as a traditional mmap-ed
> >    anonymous region, like memcached or any key/value storage. The bpf
> >    program implements an in-kernel accelerator. XDP prog can search for
> >    a key in bpf_arena and return a value without going to user space.
> > 2. The bpf program builds arbitrary data structures in bpf_arena (hash
> >    tables, rb-trees, sparse arrays), while user space consumes it.
> > 3. bpf_arena is a "heap" of memory from the bpf program's point of view.
> >    The user space may mmap it, but bpf program will not convert pointers
> >    to user base at run-time to improve bpf program speed.
> >
> > Initially, the kernel vm_area and user vma are not populated. User space
> > can fault in pages within the range. While servicing a page fault,
> > bpf_arena logic will insert a new page into the kernel and user vmas. The
> > bpf program can allocate pages from that region via
> > bpf_arena_alloc_pages(). This kernel function will insert pages into the
> > kernel vm_area. The subsequent fault-in from user space will populate that
> > page into the user vma. The BPF_F_SEGV_ON_FAULT flag at arena creation time
> > can be used to prevent fault-in from user space. In such a case, if a page
> > is not allocated by the bpf program and not present in the kernel vm_area,
> > the user process will segfault. This is useful for use cases 2 and 3 above.
> >
> > bpf_arena_alloc_pages() is similar to user space mmap(). It allocates pages
> > either at a specific address within the arena or allocates a range with the
> > maple tree. bpf_arena_free_pages() is analogous to munmap(), which frees
> > pages and removes the range from the kernel vm_area and from user process
> > vmas.
> >
> > bpf_arena can be used as a bpf program "heap" of up to 4GB. The speed of
> > bpf program is more important than ease of sharing with user space. This is
> > use case 3. In such a case, the BPF_F_NO_USER_CONV flag is recommended.
> > It will tell the verifier to treat the rX = bpf_arena_cast_user(rY)
> > instruction as a 32-bit move wX = wY, which will improve bpf prog
> > performance. Otherwise, bpf_arena_cast_user is translated by JIT to
> > conditionally add the upper 32 bits of user vm_start (if the pointer is not
> > NULL) to arena pointers before they are stored into memory. This way, user
> > space sees them as valid 64-bit pointers.
> >
> > Diff https://github.com/llvm/llvm-project/pull/84410 enables LLVM BPF
> > backend generate the bpf_addr_space_cast() instruction to cast pointers
> > between address_space(1) which is reserved for bpf_arena pointers and
> > default address space zero. All arena pointers in a bpf program written in
> > C language are tagged as __attribute__((address_space(1))). Hence, clang
> > provides helpful diagnostics when pointers cross address space. Libbpf and
> > the kernel support only address_space == 1. All other address space
> > identifiers are reserved.
> >
> > rX = bpf_addr_space_cast(rY, /* dst_as */ 1, /* src_as */ 0) tells the
> > verifier that rX->type = PTR_TO_ARENA. Any further operations on
> > PTR_TO_ARENA register have to be in the 32-bit domain. The verifier will
> > mark load/store through PTR_TO_ARENA with PROBE_MEM32. JIT will generate
> > them as kern_vm_start + 32bit_addr memory accesses. The behavior is similar
> > to copy_from_kernel_nofault() except that no address checks are necessary.
> > The address is guaranteed to be in the 4GB range. If the page is not
> > present, the destination register is zeroed on read, and the operation is
> > ignored on write.
> >
> > rX = bpf_addr_space_cast(rY, 0, 1) tells the verifier that rX->type =
> > unknown scalar. If arena->map_flags has BPF_F_NO_USER_CONV set, then the
> > verifier converts such cast instructions to mov32. Otherwise, JIT will emit
> > native code equivalent to:
> > rX = (u32)rY;
> > if (rY)
> >   rX |= clear_lo32_bits(arena->user_vm_start); /* replace hi32 bits in rX */
> >
> > After such conversion, the pointer becomes a valid user pointer within
> > bpf_arena range. The user process can access data structures created in
> > bpf_arena without any additional computations. For example, a linked list
> > built by a bpf program can be walked natively by user space.
> >
> > Reviewed-by: Barret Rhoden <brho@xxxxxxxxxx>
> > Signed-off-by: Alexei Starovoitov <ast@xxxxxxxxxx>
> > ---
> >  include/linux/bpf.h            |   7 +-
> >  include/linux/bpf_types.h      |   1 +
> >  include/uapi/linux/bpf.h       |  10 +
> >  kernel/bpf/Makefile            |   3 +
> >  kernel/bpf/arena.c             | 558 +++++++++++++++++++++++++++++++++
> >  kernel/bpf/core.c              |  11 +
> >  kernel/bpf/syscall.c           |  36 +++
> >  kernel/bpf/verifier.c          |   1 +
> >  tools/include/uapi/linux/bpf.h |  10 +
> >  9 files changed, 635 insertions(+), 2 deletions(-)
> >  create mode 100644 kernel/bpf/arena.c
> >
>
> [...]
>
> >
> >  struct bpf_offload_dev;
> > @@ -2215,6 +2216,8 @@ int  generic_map_delete_batch(struct bpf_map *map,
> >  struct bpf_map *bpf_map_get_curr_or_next(u32 *id);
> >  struct bpf_prog *bpf_prog_get_curr_or_next(u32 *id);
> >
> > +int bpf_map_alloc_pages(const struct bpf_map *map, gfp_t gfp, int nid,
>
> nit: you use more meaningful node_id in arena_alloc_pages(), here
> "nid" was a big mystery when looking at just function definition

nid is a standard name in mm/.
git grep -w nid mm/|wc -l
1064
git grep -w node_id mm/|wc -l
77

Also see slab_nid(), folio_nid().

> > +                       unsigned long nr_pages, struct page **page_array);
> >  #ifdef CONFIG_MEMCG_KMEM
> >  void *bpf_map_kmalloc_node(const struct bpf_map *map, size_t size, gfp_t flags,
> >                            int node);
>
> [...]
>
> > +u64 bpf_arena_get_kern_vm_start(struct bpf_arena *arena)
> > +{
> > +       return arena ? (u64) (long) arena->kern_vm->addr + GUARD_SZ / 2 : 0;
> > +}
> > +
> > +u64 bpf_arena_get_user_vm_start(struct bpf_arena *arena)
> > +{
> > +       return arena ? arena->user_vm_start : 0;
> > +}
> > +
>
> is it anticipated that these helpers can be called with NULL? I might
> see this later in the patch set, but if not, these NULL checks would
> be best removed to not create wrong expectations.

Looks like you figured it out.

> > +static long arena_map_peek_elem(struct bpf_map *map, void *value)
> > +{
> > +       return -EOPNOTSUPP;
> > +}
> > +
> > +static long arena_map_push_elem(struct bpf_map *map, void *value, u64 flags)
> > +{
> > +       return -EOPNOTSUPP;
> > +}
> > +
> > +static long arena_map_pop_elem(struct bpf_map *map, void *value)
> > +{
> > +       return -EOPNOTSUPP;
> > +}
> > +
> > +static long arena_map_delete_elem(struct bpf_map *map, void *value)
> > +{
> > +       return -EOPNOTSUPP;
> > +}
> > +
> > +static int arena_map_get_next_key(struct bpf_map *map, void *key, void *next_key)
> > +{
> > +       return -EOPNOTSUPP;
> > +}
> > +
>
> This is a separate topic, but I'll just mention it here. It was always
> confusing to me why we don't just treat all these callbacks as
> optional and return -EOPNOTSUPP in generic map code. Unless I miss
> something subtle, we should do a round of clean ups and remove dozens
> of unnecessary single line callbacks like these throughout the entire
> BPF kernel code.

yeah. could be a generic follow up. agree.

> > +static long compute_pgoff(struct bpf_arena *arena, long uaddr)
> > +{
> > +       return (u32)(uaddr - (u32)arena->user_vm_start) >> PAGE_SHIFT;
> > +}
> > +
>
> [...]
>
> > +static vm_fault_t arena_vm_fault(struct vm_fault *vmf)
> > +{
> > +       struct bpf_map *map = vmf->vma->vm_file->private_data;
> > +       struct bpf_arena *arena = container_of(map, struct bpf_arena, map);
> > +       struct page *page;
> > +       long kbase, kaddr;
> > +       int ret;
> > +
> > +       kbase = bpf_arena_get_kern_vm_start(arena);
> > +       kaddr = kbase + (u32)(vmf->address & PAGE_MASK);
> > +
> > +       guard(mutex)(&arena->lock);
> > +       page = vmalloc_to_page((void *)kaddr);
> > +       if (page)
> > +               /* already have a page vmap-ed */
> > +               goto out;
> > +
> > +       if (arena->map.map_flags & BPF_F_SEGV_ON_FAULT)
> > +               /* User space requested to segfault when page is not allocated by bpf prog */
> > +               return VM_FAULT_SIGSEGV;
> > +
> > +       ret = mtree_insert(&arena->mt, vmf->pgoff, MT_ENTRY, GFP_KERNEL);
> > +       if (ret)
> > +               return VM_FAULT_SIGSEGV;
> > +
> > +       /* Account into memcg of the process that created bpf_arena */
> > +       ret = bpf_map_alloc_pages(map, GFP_KERNEL | __GFP_ZERO, NUMA_NO_NODE, 1, &page);
>
> any specific reason to not take into account map->numa_node here?

There are several reasons for this.
1. is to avoid expectations that map->num_node is meaningful
for actual run-time allocations.
Since arena_alloc_pages() take explicit nid and it would conflict
if map->numa_node was also used.
2. In this case it's user space faulting in a page,
so best to let NUMA_NO_NODE pick the right one.

> > +       if (ret) {
> > +               mtree_erase(&arena->mt, vmf->pgoff);
> > +               return VM_FAULT_SIGSEGV;
> > +       }
> > +
> > +       ret = vm_area_map_pages(arena->kern_vm, kaddr, kaddr + PAGE_SIZE, &page);
> > +       if (ret) {
> > +               mtree_erase(&arena->mt, vmf->pgoff);
> > +               __free_page(page);
> > +               return VM_FAULT_SIGSEGV;
> > +       }
> > +out:
> > +       page_ref_add(page, 1);
> > +       vmf->page = page;
> > +       return 0;
> > +}
> > +
>
> [...]
>
> > +/*
> > + * Allocate pages and vmap them into kernel vmalloc area.
> > + * Later the pages will be mmaped into user space vma.
> > + */
> > +static long arena_alloc_pages(struct bpf_arena *arena, long uaddr, long page_cnt, int node_id)
> > +{
> > +       /* user_vm_end/start are fixed before bpf prog runs */
> > +       long page_cnt_max = (arena->user_vm_end - arena->user_vm_start) >> PAGE_SHIFT;
> > +       u64 kern_vm_start = bpf_arena_get_kern_vm_start(arena);
> > +       struct page **pages;
> > +       long pgoff = 0;
> > +       u32 uaddr32;
> > +       int ret, i;
> > +
> > +       if (page_cnt > page_cnt_max)
> > +               return 0;
> > +
> > +       if (uaddr) {
> > +               if (uaddr & ~PAGE_MASK)
> > +                       return 0;
> > +               pgoff = compute_pgoff(arena, uaddr);
> > +               if (pgoff + page_cnt > page_cnt_max)
>
> As I mentioned offline, is this guaranteed to not overflow? it's not
> obvious because at least according to all the types (longs), uaddr can
> be arbitrary, so pgoff can be quite large, etc. Might be worthwhile
> rewriting as `pgoff > page_cnt_max - page_cnt` or something, just to
> make it clear in code it has no chance of overflowing.

It cannot overflow. All three variables:
page_cnt, pgoff, page_cnt_max are within u32 range.
Look at how they're computed.

> > +                       /* requested address will be outside of user VMA */
> > +                       return 0;
> > +       }
> > +
> > +       /* zeroing is needed, since alloc_pages_bulk_array() only fills in non-zero entries */
> > +       pages = kvcalloc(page_cnt, sizeof(struct page *), GFP_KERNEL);
> > +       if (!pages)
> > +               return 0;
> > +
> > +       guard(mutex)(&arena->lock);
> > +
> > +       if (uaddr)
> > +               ret = mtree_insert_range(&arena->mt, pgoff, pgoff + page_cnt - 1,
> > +                                        MT_ENTRY, GFP_KERNEL);
> > +       else
> > +               ret = mtree_alloc_range(&arena->mt, &pgoff, MT_ENTRY,
> > +                                       page_cnt, 0, page_cnt_max - 1, GFP_KERNEL);
>
> mtree_alloc_range() is lacking documentation, unfortunately, so it's
> not clear to me whether max should be just `page_cnt_max - 1` as you
> have or `page_cnt_max - page_cnt`. There is a "Test a single entry" in
> lib/test_maple_tree.c where min == max and size == 4096 which is
> expected to work, so I have a feeling that the correct max should be
> up to the maximum possible beginning of range, but I might be
> mistaken. Can you please double check?

Not only I double checked this, the patch 12 selftest covers
this boundary condition :)

>
> > +       if (ret)
> > +               goto out_free_pages;
> > +
> > +       ret = bpf_map_alloc_pages(&arena->map, GFP_KERNEL | __GFP_ZERO,
> > +                                 node_id, page_cnt, pages);
> > +       if (ret)
> > +               goto out;
> > +
> > +       uaddr32 = (u32)(arena->user_vm_start + pgoff * PAGE_SIZE);
> > +       /* Earlier checks make sure that uaddr32 + page_cnt * PAGE_SIZE will not overflow 32-bit */
>
> we checked that `uaddr32 + page_cnt * PAGE_SIZE - 1` won't overflow,
> full page_cnt * PAGE_SIZE can actually overflow, so comment is a bit
> imprecise. But it's not really clear why it matters here, tbh.
> kern_vm_start + uaddr32 + page_cnt * PAGE_SIZE can actually update
> upper 32-bits of the kernel-side memory address, is that a problem?

There is a giant thread in v2 series between myself and Barrett
that goes into length why we decided to prohibit overflow within lower 32-bit.
The shortest tldr: technically we can allow lower 32-bit overflow,
but the code gets very complex.

>
> > +       ret = vm_area_map_pages(arena->kern_vm, kern_vm_start + uaddr32,
> > +                               kern_vm_start + uaddr32 + page_cnt * PAGE_SIZE, pages);
> > +       if (ret) {
> > +               for (i = 0; i < page_cnt; i++)
> > +                       __free_page(pages[i]);
> > +               goto out;
> > +       }
> > +       kvfree(pages);
> > +       return clear_lo32(arena->user_vm_start) + uaddr32;
> > +out:
> > +       mtree_erase(&arena->mt, pgoff);
> > +out_free_pages:
> > +       kvfree(pages);
> > +       return 0;
> > +}
> > +
> > +/*
> > + * If page is present in vmalloc area, unmap it from vmalloc area,
> > + * unmap it from all user space vma-s,
> > + * and free it.
> > + */
> > +static void zap_pages(struct bpf_arena *arena, long uaddr, long page_cnt)
> > +{
> > +       struct vma_list *vml;
> > +
> > +       list_for_each_entry(vml, &arena->vma_list, head)
> > +               zap_page_range_single(vml->vma, uaddr,
> > +                                     PAGE_SIZE * page_cnt, NULL);
> > +}
> > +
> > +static void arena_free_pages(struct bpf_arena *arena, long uaddr, long page_cnt)
> > +{
> > +       u64 full_uaddr, uaddr_end;
> > +       long kaddr, pgoff, i;
> > +       struct page *page;
> > +
> > +       /* only aligned lower 32-bit are relevant */
> > +       uaddr = (u32)uaddr;
> > +       uaddr &= PAGE_MASK;
> > +       full_uaddr = clear_lo32(arena->user_vm_start) + uaddr;
> > +       uaddr_end = min(arena->user_vm_end, full_uaddr + (page_cnt << PAGE_SHIFT));
> > +       if (full_uaddr >= uaddr_end)
> > +               return;
> > +
> > +       page_cnt = (uaddr_end - full_uaddr) >> PAGE_SHIFT;
> > +
> > +       guard(mutex)(&arena->lock);
> > +
> > +       pgoff = compute_pgoff(arena, uaddr);
> > +       /* clear range */
> > +       mtree_store_range(&arena->mt, pgoff, pgoff + page_cnt - 1, NULL, GFP_KERNEL);
> > +
> > +       if (page_cnt > 1)
> > +               /* bulk zap if multiple pages being freed */
> > +               zap_pages(arena, full_uaddr, page_cnt);
> > +
> > +       kaddr = bpf_arena_get_kern_vm_start(arena) + uaddr;
> > +       for (i = 0; i < page_cnt; i++, kaddr += PAGE_SIZE, full_uaddr += PAGE_SIZE) {
> > +               page = vmalloc_to_page((void *)kaddr);
> > +               if (!page)
> > +                       continue;
> > +               if (page_cnt == 1 && page_mapped(page)) /* mapped by some user process */
> > +                       zap_pages(arena, full_uaddr, 1);
>
> The way you split these zap_pages for page_cnt == 1 and page_cnt > 1
> is quite confusing. Why can't you just unconditionally zap_pages()
> regardless of page_cnt before this loop? And why for page_cnt == 1 we
> have `page_mapped(page)` check, but it's ok to not check this for
> page_cnt>1 case?
>
> This asymmetric handling is confusing and suggests something more is
> going on here. Or am I overthinking it?

It's an important optimization for the common case of page_cnt==1.
If page wasn't mapped into some user vma there is no need to call zap_pages
which is slow.
But when page_cnt is big it's much faster to do the batched zap
which is what this code does.
For the case of page_cnt=2 or small number there is no good optimization
to do other than try to count whether all pages in this range are
not page_mapped() and omit zap_page().
I don't think it's worth doing such optimization at this point,
since page_cnt=1 is likely the most common case.
If it changes, it can be optimized later.

> > +               vm_area_unmap_pages(arena->kern_vm, kaddr, kaddr + PAGE_SIZE);
> > +               __free_page(page);
>
> can something else in the kernel somehow get a refcnt on this page?
> I.e., is it ok to unconditionally free page here instead of some sort
> of put_page() instead?

hmm. __free_page() does the refcnt. It's not an unconditional free.
put_page() is for the case when folio is present. Here all of them
are privately managed pages. All are single page individual allocations
and a normal refcnt scheme applies. Which is what __free_page() does.

Thanks for the review.
I believe I answered all the questions. Looks like the only
followup is to generalize all 'return -EOPNOTSUPP'
across all map types. I'll add it to my todo. Unless somebody
will have more free cycles.