Re: [PATCH v3 bpf-next 01/14] bpf: Introduce bpf_arena.

Andrii Nakryiko <andrii.nakryiko@xxxxxxxxx> · Mon, 11 Mar 2024 15:01:23 -0700

On Thu, Mar 7, 2024 at 5:08 PM Alexei Starovoitov
<alexei.starovoitov@xxxxxxxxx> wrote:
>
> From: Alexei Starovoitov <ast@xxxxxxxxxx>
>
> Introduce bpf_arena, which is a sparse shared memory region between the bpf
> program and user space.
>
> Use cases:
> 1. User space mmap-s bpf_arena and uses it as a traditional mmap-ed
>    anonymous region, like memcached or any key/value storage. The bpf
>    program implements an in-kernel accelerator. XDP prog can search for
>    a key in bpf_arena and return a value without going to user space.
> 2. The bpf program builds arbitrary data structures in bpf_arena (hash
>    tables, rb-trees, sparse arrays), while user space consumes it.
> 3. bpf_arena is a "heap" of memory from the bpf program's point of view.
>    The user space may mmap it, but bpf program will not convert pointers
>    to user base at run-time to improve bpf program speed.
>
> Initially, the kernel vm_area and user vma are not populated. User space
> can fault in pages within the range. While servicing a page fault,
> bpf_arena logic will insert a new page into the kernel and user vmas. The
> bpf program can allocate pages from that region via
> bpf_arena_alloc_pages(). This kernel function will insert pages into the
> kernel vm_area. The subsequent fault-in from user space will populate that
> page into the user vma. The BPF_F_SEGV_ON_FAULT flag at arena creation time
> can be used to prevent fault-in from user space. In such a case, if a page
> is not allocated by the bpf program and not present in the kernel vm_area,
> the user process will segfault. This is useful for use cases 2 and 3 above.
>
> bpf_arena_alloc_pages() is similar to user space mmap(). It allocates pages
> either at a specific address within the arena or allocates a range with the
> maple tree. bpf_arena_free_pages() is analogous to munmap(), which frees
> pages and removes the range from the kernel vm_area and from user process
> vmas.
>
> bpf_arena can be used as a bpf program "heap" of up to 4GB. The speed of
> bpf program is more important than ease of sharing with user space. This is
> use case 3. In such a case, the BPF_F_NO_USER_CONV flag is recommended.
> It will tell the verifier to treat the rX = bpf_arena_cast_user(rY)
> instruction as a 32-bit move wX = wY, which will improve bpf prog
> performance. Otherwise, bpf_arena_cast_user is translated by JIT to
> conditionally add the upper 32 bits of user vm_start (if the pointer is not
> NULL) to arena pointers before they are stored into memory. This way, user
> space sees them as valid 64-bit pointers.
>
> Diff https://github.com/llvm/llvm-project/pull/84410 enables LLVM BPF
> backend generate the bpf_addr_space_cast() instruction to cast pointers
> between address_space(1) which is reserved for bpf_arena pointers and
> default address space zero. All arena pointers in a bpf program written in
> C language are tagged as __attribute__((address_space(1))). Hence, clang
> provides helpful diagnostics when pointers cross address space. Libbpf and
> the kernel support only address_space == 1. All other address space
> identifiers are reserved.
>
> rX = bpf_addr_space_cast(rY, /* dst_as */ 1, /* src_as */ 0) tells the
> verifier that rX->type = PTR_TO_ARENA. Any further operations on
> PTR_TO_ARENA register have to be in the 32-bit domain. The verifier will
> mark load/store through PTR_TO_ARENA with PROBE_MEM32. JIT will generate
> them as kern_vm_start + 32bit_addr memory accesses. The behavior is similar
> to copy_from_kernel_nofault() except that no address checks are necessary.
> The address is guaranteed to be in the 4GB range. If the page is not
> present, the destination register is zeroed on read, and the operation is
> ignored on write.
>
> rX = bpf_addr_space_cast(rY, 0, 1) tells the verifier that rX->type =
> unknown scalar. If arena->map_flags has BPF_F_NO_USER_CONV set, then the
> verifier converts such cast instructions to mov32. Otherwise, JIT will emit
> native code equivalent to:
> rX = (u32)rY;
> if (rY)
>   rX |= clear_lo32_bits(arena->user_vm_start); /* replace hi32 bits in rX */
>
> After such conversion, the pointer becomes a valid user pointer within
> bpf_arena range. The user process can access data structures created in
> bpf_arena without any additional computations. For example, a linked list
> built by a bpf program can be walked natively by user space.
>
> Reviewed-by: Barret Rhoden <brho@xxxxxxxxxx>
> Signed-off-by: Alexei Starovoitov <ast@xxxxxxxxxx>
> ---
>  include/linux/bpf.h            |   7 +-
>  include/linux/bpf_types.h      |   1 +
>  include/uapi/linux/bpf.h       |  10 +
>  kernel/bpf/Makefile            |   3 +
>  kernel/bpf/arena.c             | 558 +++++++++++++++++++++++++++++++++
>  kernel/bpf/core.c              |  11 +
>  kernel/bpf/syscall.c           |  36 +++
>  kernel/bpf/verifier.c          |   1 +
>  tools/include/uapi/linux/bpf.h |  10 +
>  9 files changed, 635 insertions(+), 2 deletions(-)
>  create mode 100644 kernel/bpf/arena.c
>

[...]

>
>  struct bpf_offload_dev;
> @@ -2215,6 +2216,8 @@ int  generic_map_delete_batch(struct bpf_map *map,
>  struct bpf_map *bpf_map_get_curr_or_next(u32 *id);
>  struct bpf_prog *bpf_prog_get_curr_or_next(u32 *id);
>
> +int bpf_map_alloc_pages(const struct bpf_map *map, gfp_t gfp, int nid,

nit: you use more meaningful node_id in arena_alloc_pages(), here
"nid" was a big mystery when looking at just function definition

> +                       unsigned long nr_pages, struct page **page_array);
>  #ifdef CONFIG_MEMCG_KMEM
>  void *bpf_map_kmalloc_node(const struct bpf_map *map, size_t size, gfp_t flags,
>                            int node);

[...]

> +u64 bpf_arena_get_kern_vm_start(struct bpf_arena *arena)
> +{
> +       return arena ? (u64) (long) arena->kern_vm->addr + GUARD_SZ / 2 : 0;
> +}
> +
> +u64 bpf_arena_get_user_vm_start(struct bpf_arena *arena)
> +{
> +       return arena ? arena->user_vm_start : 0;
> +}
> +

is it anticipated that these helpers can be called with NULL? I might
see this later in the patch set, but if not, these NULL checks would
be best removed to not create wrong expectations.

> +static long arena_map_peek_elem(struct bpf_map *map, void *value)
> +{
> +       return -EOPNOTSUPP;
> +}
> +
> +static long arena_map_push_elem(struct bpf_map *map, void *value, u64 flags)
> +{
> +       return -EOPNOTSUPP;
> +}
> +
> +static long arena_map_pop_elem(struct bpf_map *map, void *value)
> +{
> +       return -EOPNOTSUPP;
> +}
> +
> +static long arena_map_delete_elem(struct bpf_map *map, void *value)
> +{
> +       return -EOPNOTSUPP;
> +}
> +
> +static int arena_map_get_next_key(struct bpf_map *map, void *key, void *next_key)
> +{
> +       return -EOPNOTSUPP;
> +}
> +

This is a separate topic, but I'll just mention it here. It was always
confusing to me why we don't just treat all these callbacks as
optional and return -EOPNOTSUPP in generic map code. Unless I miss
something subtle, we should do a round of clean ups and remove dozens
of unnecessary single line callbacks like these throughout the entire
BPF kernel code.

> +static long compute_pgoff(struct bpf_arena *arena, long uaddr)
> +{
> +       return (u32)(uaddr - (u32)arena->user_vm_start) >> PAGE_SHIFT;
> +}
> +

[...]

> +static vm_fault_t arena_vm_fault(struct vm_fault *vmf)
> +{
> +       struct bpf_map *map = vmf->vma->vm_file->private_data;
> +       struct bpf_arena *arena = container_of(map, struct bpf_arena, map);
> +       struct page *page;
> +       long kbase, kaddr;
> +       int ret;
> +
> +       kbase = bpf_arena_get_kern_vm_start(arena);
> +       kaddr = kbase + (u32)(vmf->address & PAGE_MASK);
> +
> +       guard(mutex)(&arena->lock);
> +       page = vmalloc_to_page((void *)kaddr);
> +       if (page)
> +               /* already have a page vmap-ed */
> +               goto out;
> +
> +       if (arena->map.map_flags & BPF_F_SEGV_ON_FAULT)
> +               /* User space requested to segfault when page is not allocated by bpf prog */
> +               return VM_FAULT_SIGSEGV;
> +
> +       ret = mtree_insert(&arena->mt, vmf->pgoff, MT_ENTRY, GFP_KERNEL);
> +       if (ret)
> +               return VM_FAULT_SIGSEGV;
> +
> +       /* Account into memcg of the process that created bpf_arena */
> +       ret = bpf_map_alloc_pages(map, GFP_KERNEL | __GFP_ZERO, NUMA_NO_NODE, 1, &page);

any specific reason to not take into account map->numa_node here?

> +       if (ret) {
> +               mtree_erase(&arena->mt, vmf->pgoff);
> +               return VM_FAULT_SIGSEGV;
> +       }
> +
> +       ret = vm_area_map_pages(arena->kern_vm, kaddr, kaddr + PAGE_SIZE, &page);
> +       if (ret) {
> +               mtree_erase(&arena->mt, vmf->pgoff);
> +               __free_page(page);
> +               return VM_FAULT_SIGSEGV;
> +       }
> +out:
> +       page_ref_add(page, 1);
> +       vmf->page = page;
> +       return 0;
> +}
> +

[...]

> +/*
> + * Allocate pages and vmap them into kernel vmalloc area.
> + * Later the pages will be mmaped into user space vma.
> + */
> +static long arena_alloc_pages(struct bpf_arena *arena, long uaddr, long page_cnt, int node_id)
> +{
> +       /* user_vm_end/start are fixed before bpf prog runs */
> +       long page_cnt_max = (arena->user_vm_end - arena->user_vm_start) >> PAGE_SHIFT;
> +       u64 kern_vm_start = bpf_arena_get_kern_vm_start(arena);
> +       struct page **pages;
> +       long pgoff = 0;
> +       u32 uaddr32;
> +       int ret, i;
> +
> +       if (page_cnt > page_cnt_max)
> +               return 0;
> +
> +       if (uaddr) {
> +               if (uaddr & ~PAGE_MASK)
> +                       return 0;
> +               pgoff = compute_pgoff(arena, uaddr);
> +               if (pgoff + page_cnt > page_cnt_max)

As I mentioned offline, is this guaranteed to not overflow? it's not
obvious because at least according to all the types (longs), uaddr can
be arbitrary, so pgoff can be quite large, etc. Might be worthwhile
rewriting as `pgoff > page_cnt_max - page_cnt` or something, just to
make it clear in code it has no chance of overflowing.

> +                       /* requested address will be outside of user VMA */
> +                       return 0;
> +       }
> +
> +       /* zeroing is needed, since alloc_pages_bulk_array() only fills in non-zero entries */
> +       pages = kvcalloc(page_cnt, sizeof(struct page *), GFP_KERNEL);
> +       if (!pages)
> +               return 0;
> +
> +       guard(mutex)(&arena->lock);
> +
> +       if (uaddr)
> +               ret = mtree_insert_range(&arena->mt, pgoff, pgoff + page_cnt - 1,
> +                                        MT_ENTRY, GFP_KERNEL);
> +       else
> +               ret = mtree_alloc_range(&arena->mt, &pgoff, MT_ENTRY,
> +                                       page_cnt, 0, page_cnt_max - 1, GFP_KERNEL);

mtree_alloc_range() is lacking documentation, unfortunately, so it's
not clear to me whether max should be just `page_cnt_max - 1` as you
have or `page_cnt_max - page_cnt`. There is a "Test a single entry" in
lib/test_maple_tree.c where min == max and size == 4096 which is
expected to work, so I have a feeling that the correct max should be
up to the maximum possible beginning of range, but I might be
mistaken. Can you please double check?

> +       if (ret)
> +               goto out_free_pages;
> +
> +       ret = bpf_map_alloc_pages(&arena->map, GFP_KERNEL | __GFP_ZERO,
> +                                 node_id, page_cnt, pages);
> +       if (ret)
> +               goto out;
> +
> +       uaddr32 = (u32)(arena->user_vm_start + pgoff * PAGE_SIZE);
> +       /* Earlier checks make sure that uaddr32 + page_cnt * PAGE_SIZE will not overflow 32-bit */

we checked that `uaddr32 + page_cnt * PAGE_SIZE - 1` won't overflow,
full page_cnt * PAGE_SIZE can actually overflow, so comment is a bit
imprecise. But it's not really clear why it matters here, tbh.
kern_vm_start + uaddr32 + page_cnt * PAGE_SIZE can actually update
upper 32-bits of the kernel-side memory address, is that a problem?

> +       ret = vm_area_map_pages(arena->kern_vm, kern_vm_start + uaddr32,
> +                               kern_vm_start + uaddr32 + page_cnt * PAGE_SIZE, pages);
> +       if (ret) {
> +               for (i = 0; i < page_cnt; i++)
> +                       __free_page(pages[i]);
> +               goto out;
> +       }
> +       kvfree(pages);
> +       return clear_lo32(arena->user_vm_start) + uaddr32;
> +out:
> +       mtree_erase(&arena->mt, pgoff);
> +out_free_pages:
> +       kvfree(pages);
> +       return 0;
> +}
> +
> +/*
> + * If page is present in vmalloc area, unmap it from vmalloc area,
> + * unmap it from all user space vma-s,
> + * and free it.
> + */
> +static void zap_pages(struct bpf_arena *arena, long uaddr, long page_cnt)
> +{
> +       struct vma_list *vml;
> +
> +       list_for_each_entry(vml, &arena->vma_list, head)
> +               zap_page_range_single(vml->vma, uaddr,
> +                                     PAGE_SIZE * page_cnt, NULL);
> +}
> +
> +static void arena_free_pages(struct bpf_arena *arena, long uaddr, long page_cnt)
> +{
> +       u64 full_uaddr, uaddr_end;
> +       long kaddr, pgoff, i;
> +       struct page *page;
> +
> +       /* only aligned lower 32-bit are relevant */
> +       uaddr = (u32)uaddr;
> +       uaddr &= PAGE_MASK;
> +       full_uaddr = clear_lo32(arena->user_vm_start) + uaddr;
> +       uaddr_end = min(arena->user_vm_end, full_uaddr + (page_cnt << PAGE_SHIFT));
> +       if (full_uaddr >= uaddr_end)
> +               return;
> +
> +       page_cnt = (uaddr_end - full_uaddr) >> PAGE_SHIFT;
> +
> +       guard(mutex)(&arena->lock);
> +
> +       pgoff = compute_pgoff(arena, uaddr);
> +       /* clear range */
> +       mtree_store_range(&arena->mt, pgoff, pgoff + page_cnt - 1, NULL, GFP_KERNEL);
> +
> +       if (page_cnt > 1)
> +               /* bulk zap if multiple pages being freed */
> +               zap_pages(arena, full_uaddr, page_cnt);
> +
> +       kaddr = bpf_arena_get_kern_vm_start(arena) + uaddr;
> +       for (i = 0; i < page_cnt; i++, kaddr += PAGE_SIZE, full_uaddr += PAGE_SIZE) {
> +               page = vmalloc_to_page((void *)kaddr);
> +               if (!page)
> +                       continue;
> +               if (page_cnt == 1 && page_mapped(page)) /* mapped by some user process */
> +                       zap_pages(arena, full_uaddr, 1);

The way you split these zap_pages for page_cnt == 1 and page_cnt > 1
is quite confusing. Why can't you just unconditionally zap_pages()
regardless of page_cnt before this loop? And why for page_cnt == 1 we
have `page_mapped(page)` check, but it's ok to not check this for
page_cnt>1 case?

This asymmetric handling is confusing and suggests something more is
going on here. Or am I overthinking it?

> +               vm_area_unmap_pages(arena->kern_vm, kaddr, kaddr + PAGE_SIZE);
> +               __free_page(page);

can something else in the kernel somehow get a refcnt on this page?
I.e., is it ok to unconditionally free page here instead of some sort
of put_page() instead?

> +       }
> +}
> +
> +__bpf_kfunc_start_defs();
> +

[...]