Re: [PATCH v2 bpf-next 05/20] bpf: Introduce bpf_arena.

Andrii Nakryiko <andrii.nakryiko@xxxxxxxxx> · Tue, 13 Feb 2024 15:14:39 -0800

On Thu, Feb 8, 2024 at 8:06 PM Alexei Starovoitov
<alexei.starovoitov@xxxxxxxxx> wrote:
>
> From: Alexei Starovoitov <ast@xxxxxxxxxx>
>
> Introduce bpf_arena, which is a sparse shared memory region between the bpf
> program and user space.
>
> Use cases:
> 1. User space mmap-s bpf_arena and uses it as a traditional mmap-ed anonymous
>    region, like memcached or any key/value storage. The bpf program implements an
>    in-kernel accelerator. XDP prog can search for a key in bpf_arena and return a
>    value without going to user space.
> 2. The bpf program builds arbitrary data structures in bpf_arena (hash tables,
>    rb-trees, sparse arrays), while user space consumes it.
> 3. bpf_arena is a "heap" of memory from the bpf program's point of view.
>    The user space may mmap it, but bpf program will not convert pointers
>    to user base at run-time to improve bpf program speed.
>
> Initially, the kernel vm_area and user vma are not populated. User space can
> fault in pages within the range. While servicing a page fault, bpf_arena logic
> will insert a new page into the kernel and user vmas. The bpf program can
> allocate pages from that region via bpf_arena_alloc_pages(). This kernel
> function will insert pages into the kernel vm_area. The subsequent fault-in
> from user space will populate that page into the user vma. The
> BPF_F_SEGV_ON_FAULT flag at arena creation time can be used to prevent fault-in
> from user space. In such a case, if a page is not allocated by the bpf program
> and not present in the kernel vm_area, the user process will segfault. This is
> useful for use cases 2 and 3 above.
>
> bpf_arena_alloc_pages() is similar to user space mmap(). It allocates pages
> either at a specific address within the arena or allocates a range with the
> maple tree. bpf_arena_free_pages() is analogous to munmap(), which frees pages
> and removes the range from the kernel vm_area and from user process vmas.
>
> bpf_arena can be used as a bpf program "heap" of up to 4GB. The speed of bpf
> program is more important than ease of sharing with user space. This is use
> case 3. In such a case, the BPF_F_NO_USER_CONV flag is recommended. It will
> tell the verifier to treat the rX = bpf_arena_cast_user(rY) instruction as a
> 32-bit move wX = wY, which will improve bpf prog performance. Otherwise,
> bpf_arena_cast_user is translated by JIT to conditionally add the upper 32 bits
> of user vm_start (if the pointer is not NULL) to arena pointers before they are
> stored into memory. This way, user space sees them as valid 64-bit pointers.
>
> Diff https://github.com/llvm/llvm-project/pull/79902 taught LLVM BPF backend to
> generate the bpf_cast_kern() instruction before dereference of the arena
> pointer and the bpf_cast_user() instruction when the arena pointer is formed.
> In a typical bpf program there will be very few bpf_cast_user().
>
> From LLVM's point of view, arena pointers are tagged as
> __attribute__((address_space(1))). Hence, clang provides helpful diagnostics
> when pointers cross address space. Libbpf and the kernel support only
> address_space == 1. All other address space identifiers are reserved.
>
> rX = bpf_cast_kern(rY, addr_space) tells the verifier that
> rX->type = PTR_TO_ARENA. Any further operations on PTR_TO_ARENA register have
> to be in the 32-bit domain. The verifier will mark load/store through
> PTR_TO_ARENA with PROBE_MEM32. JIT will generate them as
> kern_vm_start + 32bit_addr memory accesses. The behavior is similar to
> copy_from_kernel_nofault() except that no address checks are necessary. The
> address is guaranteed to be in the 4GB range. If the page is not present, the
> destination register is zeroed on read, and the operation is ignored on write.
>
> rX = bpf_cast_user(rY, addr_space) tells the verifier that
> rX->type = unknown scalar. If arena->map_flags has BPF_F_NO_USER_CONV set, then
> the verifier converts cast_user to mov32. Otherwise, JIT will emit native code
> equivalent to:
> rX = (u32)rY;
> if (rY)
>   rX |= clear_lo32_bits(arena->user_vm_start); /* replace hi32 bits in rX */
>
> After such conversion, the pointer becomes a valid user pointer within
> bpf_arena range. The user process can access data structures created in
> bpf_arena without any additional computations. For example, a linked list built
> by a bpf program can be walked natively by user space.
>
> Signed-off-by: Alexei Starovoitov <ast@xxxxxxxxxx>
> ---
>  include/linux/bpf.h            |   5 +-
>  include/linux/bpf_types.h      |   1 +
>  include/uapi/linux/bpf.h       |   7 +
>  kernel/bpf/Makefile            |   3 +
>  kernel/bpf/arena.c             | 557 +++++++++++++++++++++++++++++++++
>  kernel/bpf/core.c              |  11 +
>  kernel/bpf/syscall.c           |   3 +
>  kernel/bpf/verifier.c          |   1 +
>  tools/include/uapi/linux/bpf.h |   7 +
>  9 files changed, 593 insertions(+), 2 deletions(-)
>  create mode 100644 kernel/bpf/arena.c
>
> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> index 8b0dcb66eb33..de557c6c42e0 100644
> --- a/include/linux/bpf.h
> +++ b/include/linux/bpf.h
> @@ -37,6 +37,7 @@ struct perf_event;
>  struct bpf_prog;
>  struct bpf_prog_aux;
>  struct bpf_map;
> +struct bpf_arena;
>  struct sock;
>  struct seq_file;
>  struct btf;
> @@ -534,8 +535,8 @@ void bpf_list_head_free(const struct btf_field *field, void *list_head,
>                         struct bpf_spin_lock *spin_lock);
>  void bpf_rb_root_free(const struct btf_field *field, void *rb_root,
>                       struct bpf_spin_lock *spin_lock);
> -
> -
> +u64 bpf_arena_get_kern_vm_start(struct bpf_arena *arena);
> +u64 bpf_arena_get_user_vm_start(struct bpf_arena *arena);
>  int bpf_obj_name_cpy(char *dst, const char *src, unsigned int size);
>
>  struct bpf_offload_dev;
> diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h
> index 94baced5a1ad..9f2a6b83b49e 100644
> --- a/include/linux/bpf_types.h
> +++ b/include/linux/bpf_types.h
> @@ -132,6 +132,7 @@ BPF_MAP_TYPE(BPF_MAP_TYPE_STRUCT_OPS, bpf_struct_ops_map_ops)
>  BPF_MAP_TYPE(BPF_MAP_TYPE_RINGBUF, ringbuf_map_ops)
>  BPF_MAP_TYPE(BPF_MAP_TYPE_BLOOM_FILTER, bloom_filter_map_ops)
>  BPF_MAP_TYPE(BPF_MAP_TYPE_USER_RINGBUF, user_ringbuf_map_ops)
> +BPF_MAP_TYPE(BPF_MAP_TYPE_ARENA, arena_map_ops)
>
>  BPF_LINK_TYPE(BPF_LINK_TYPE_RAW_TRACEPOINT, raw_tracepoint)
>  BPF_LINK_TYPE(BPF_LINK_TYPE_TRACING, tracing)
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index d96708380e52..f6648851eae6 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -983,6 +983,7 @@ enum bpf_map_type {
>         BPF_MAP_TYPE_BLOOM_FILTER,
>         BPF_MAP_TYPE_USER_RINGBUF,
>         BPF_MAP_TYPE_CGRP_STORAGE,
> +       BPF_MAP_TYPE_ARENA,
>         __MAX_BPF_MAP_TYPE
>  };
>
> @@ -1370,6 +1371,12 @@ enum {
>
>  /* BPF token FD is passed in a corresponding command's token_fd field */
>         BPF_F_TOKEN_FD          = (1U << 16),
> +
> +/* When user space page faults in bpf_arena send SIGSEGV instead of inserting new page */
> +       BPF_F_SEGV_ON_FAULT     = (1U << 17),
> +
> +/* Do not translate kernel bpf_arena pointers to user pointers */
> +       BPF_F_NO_USER_CONV      = (1U << 18),
>  };
>
>  /* Flags for BPF_PROG_QUERY. */
> diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile
> index 4ce95acfcaa7..368c5d86b5b7 100644
> --- a/kernel/bpf/Makefile
> +++ b/kernel/bpf/Makefile
> @@ -15,6 +15,9 @@ obj-${CONFIG_BPF_LSM}   += bpf_inode_storage.o
>  obj-$(CONFIG_BPF_SYSCALL) += disasm.o mprog.o
>  obj-$(CONFIG_BPF_JIT) += trampoline.o
>  obj-$(CONFIG_BPF_SYSCALL) += btf.o memalloc.o
> +ifeq ($(CONFIG_MMU)$(CONFIG_64BIT),yy)
> +obj-$(CONFIG_BPF_SYSCALL) += arena.o
> +endif
>  obj-$(CONFIG_BPF_JIT) += dispatcher.o
>  ifeq ($(CONFIG_NET),y)
>  obj-$(CONFIG_BPF_SYSCALL) += devmap.o
> diff --git a/kernel/bpf/arena.c b/kernel/bpf/arena.c
> new file mode 100644
> index 000000000000..5c1014471740
> --- /dev/null
> +++ b/kernel/bpf/arena.c
> @@ -0,0 +1,557 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/* Copyright (c) 2024 Meta Platforms, Inc. and affiliates. */
> +#include <linux/bpf.h>
> +#include <linux/btf.h>
> +#include <linux/err.h>
> +#include <linux/btf_ids.h>
> +#include <linux/vmalloc.h>
> +#include <linux/pagemap.h>
> +
> +/*
> + * bpf_arena is a sparsely populated shared memory region between bpf program and
> + * user space process.
> + *
> + * For example on x86-64 the values could be:
> + * user_vm_start 7f7d26200000     // picked by mmap()
> + * kern_vm_start ffffc90001e69000 // picked by get_vm_area()
> + * For user space all pointers within the arena are normal 8-byte addresses.
> + * In this example 7f7d26200000 is the address of the first page (pgoff=0).
> + * The bpf program will access it as: kern_vm_start + lower_32bit_of_user_ptr
> + * (u32)7f7d26200000 -> 26200000
> + * hence
> + * ffffc90001e69000 + 26200000 == ffffc90028069000 is "pgoff=0" within 4Gb
> + * kernel memory region.
> + *
> + * BPF JITs generate the following code to access arena:
> + *   mov eax, eax  // eax has lower 32-bit of user pointer
> + *   mov word ptr [rax + r12 + off], bx
> + * where r12 == kern_vm_start and off is s16.
> + * Hence allocate 4Gb + GUARD_SZ/2 on each side.
> + *
> + * Initially kernel vm_area and user vma are not populated.
> + * User space can fault-in any address which will insert the page
> + * into kernel and user vma.
> + * bpf program can allocate a page via bpf_arena_alloc_pages() kfunc
> + * which will insert it into kernel vm_area.
> + * The later fault-in from user space will populate that page into user vma.
> + */
> +
> +/* number of bytes addressable by LDX/STX insn with 16-bit 'off' field */
> +#define GUARD_SZ (1ull << sizeof(((struct bpf_insn *)0)->off) * 8)
> +#define KERN_VM_SZ ((1ull << 32) + GUARD_SZ)

I feel like we need another named constant for those 4GB limits here,
something like:

#define MAX_ARENA_SZ (1ull << 32)
#define KERN_VM_SZ (MAX_ARENA_SZ + GUARD_SZ)

see below why

> +
> +struct bpf_arena {
> +       struct bpf_map map;
> +       u64 user_vm_start;
> +       u64 user_vm_end;
> +       struct vm_struct *kern_vm;
> +       struct maple_tree mt;
> +       struct list_head vma_list;
> +       struct mutex lock;
> +};
> +

[...]

> +static struct bpf_map *arena_map_alloc(union bpf_attr *attr)
> +{
> +       struct vm_struct *kern_vm;
> +       int numa_node = bpf_map_attr_numa_node(attr);
> +       struct bpf_arena *arena;
> +       u64 vm_range;
> +       int err = -ENOMEM;
> +
> +       if (attr->key_size || attr->value_size || attr->max_entries == 0 ||
> +           /* BPF_F_MMAPABLE must be set */
> +           !(attr->map_flags & BPF_F_MMAPABLE) ||
> +           /* No unsupported flags present */
> +           (attr->map_flags & ~(BPF_F_SEGV_ON_FAULT | BPF_F_MMAPABLE | BPF_F_NO_USER_CONV)))
> +               return ERR_PTR(-EINVAL);
> +
> +       if (attr->map_extra & ~PAGE_MASK)
> +               /* If non-zero the map_extra is an expected user VMA start address */
> +               return ERR_PTR(-EINVAL);
> +
> +       vm_range = (u64)attr->max_entries * PAGE_SIZE;
> +       if (vm_range > (1ull << 32))

here we can then use MAX_ARENA_SZ

> +               return ERR_PTR(-E2BIG);
> +
> +       if ((attr->map_extra >> 32) != ((attr->map_extra + vm_range - 1) >> 32))
> +               /* user vma must not cross 32-bit boundary */
> +               return ERR_PTR(-ERANGE);
> +
> +       kern_vm = get_vm_area(KERN_VM_SZ, VM_MAP | VM_USERMAP);
> +       if (!kern_vm)
> +               return ERR_PTR(-ENOMEM);
> +
> +       arena = bpf_map_area_alloc(sizeof(*arena), numa_node);
> +       if (!arena)
> +               goto err;
> +
> +       arena->kern_vm = kern_vm;
> +       arena->user_vm_start = attr->map_extra;
> +       if (arena->user_vm_start)
> +               arena->user_vm_end = arena->user_vm_start + vm_range;
> +
> +       INIT_LIST_HEAD(&arena->vma_list);
> +       bpf_map_init_from_attr(&arena->map, attr);
> +       mt_init_flags(&arena->mt, MT_FLAGS_ALLOC_RANGE);
> +       mutex_init(&arena->lock);
> +
> +       return &arena->map;
> +err:
> +       free_vm_area(kern_vm);
> +       return ERR_PTR(err);
> +}
> +
> +static int for_each_pte(pte_t *ptep, unsigned long addr, void *data)
> +{
> +       struct page *page;
> +       pte_t pte;
> +
> +       pte = ptep_get(ptep);
> +       if (!pte_present(pte))
> +               return 0;
> +       page = pte_page(pte);
> +       /*
> +        * We do not update pte here:
> +        * 1. Nobody should be accessing bpf_arena's range outside of a kernel bug
> +        * 2. TLB flushing is batched or deferred. Even if we clear pte,
> +        * the TLB entries can stick around and continue to permit access to
> +        * the freed page. So it all relies on 1.
> +        */
> +       __free_page(page);
> +       return 0;
> +}
> +
> +static void arena_map_free(struct bpf_map *map)
> +{
> +       struct bpf_arena *arena = container_of(map, struct bpf_arena, map);
> +
> +       /*
> +        * Check that user vma-s are not around when bpf map is freed.
> +        * mmap() holds vm_file which holds bpf_map refcnt.
> +        * munmap() must have happened on vma followed by arena_vm_close()
> +        * which would clear arena->vma_list.
> +        */
> +       if (WARN_ON_ONCE(!list_empty(&arena->vma_list)))
> +               return;
> +
> +       /*
> +        * free_vm_area() calls remove_vm_area() that calls free_unmap_vmap_area().
> +        * It unmaps everything from vmalloc area and clears pgtables.
> +        * Call apply_to_existing_page_range() first to find populated ptes and
> +        * free those pages.
> +        */
> +       apply_to_existing_page_range(&init_mm, bpf_arena_get_kern_vm_start(arena),
> +                                    KERN_VM_SZ - GUARD_SZ / 2, for_each_pte, NULL);

I'm still reading the rest (so it might become obvious), but this
KERN_VM_SZ - GUARD_SZ / 2 is a bit surprising. I understand that
kern_vm_start is shifted by GUARD_SZ/2, but is the intent here is to
actually go beyond maximum 4GB by GUARD_SZ/2, or the intent was to
unmap 4GB (MAX_ARENA_SZ)?

> +       free_vm_area(arena->kern_vm);
> +       mtree_destroy(&arena->mt);
> +       bpf_map_area_free(arena);
> +}
> +

[...]

> +static unsigned long arena_get_unmapped_area(struct file *filp, unsigned long addr,
> +                                            unsigned long len, unsigned long pgoff,
> +                                            unsigned long flags)
> +{
> +       struct bpf_map *map = filp->private_data;
> +       struct bpf_arena *arena = container_of(map, struct bpf_arena, map);
> +       long ret;
> +
> +       if (pgoff)
> +               return -EINVAL;
> +       if (len > (1ull << 32))

MAX_ARENA_SZ ?

> +               return -E2BIG;
> +
> +       /* if user_vm_start was specified at arena creation time */
> +       if (arena->user_vm_start) {
> +               if (len > arena->user_vm_end - arena->user_vm_start)
> +                       return -E2BIG;
> +               if (len != arena->user_vm_end - arena->user_vm_start)
> +                       return -EINVAL;
> +               if (addr != arena->user_vm_start)
> +                       return -EINVAL;
> +       }
> +
> +       ret = current->mm->get_unmapped_area(filp, addr, len * 2, 0, flags);
> +       if (IS_ERR_VALUE(ret))
> +                return 0;

Can you leave a comment why we are swallowing errors, if this is intentional?

> +       if ((ret >> 32) == ((ret + len - 1) >> 32))
> +               return ret;
> +       if (WARN_ON_ONCE(arena->user_vm_start))
> +               /* checks at map creation time should prevent this */
> +               return -EFAULT;
> +       return round_up(ret, 1ull << 32);

this is still probably MAX_ARENA_SZ, no?

> +}
> +
> +static int arena_map_mmap(struct bpf_map *map, struct vm_area_struct *vma)
> +{
> +       struct bpf_arena *arena = container_of(map, struct bpf_arena, map);
> +
> +       guard(mutex)(&arena->lock);
> +       if (arena->user_vm_start && arena->user_vm_start != vma->vm_start)
> +               /*
> +                * If map_extra was not specified at arena creation time then
> +                * 1st user process can do mmap(NULL, ...) to pick user_vm_start
> +                * 2nd user process must pass the same addr to mmap(addr, MAP_FIXED..);
> +                *   or
> +                * specify addr in map_extra and
> +                * use the same addr later with mmap(addr, MAP_FIXED..);
> +                */
> +               return -EBUSY;
> +
> +       if (arena->user_vm_end && arena->user_vm_end != vma->vm_end)
> +               /* all user processes must have the same size of mmap-ed region */
> +               return -EBUSY;
> +
> +       /* Earlier checks should prevent this */
> +       if (WARN_ON_ONCE(vma->vm_end - vma->vm_start > (1ull << 32) || vma->vm_pgoff))

MAX_ARENA_SZ ?

> +               return -EFAULT;
> +
> +       if (remember_vma(arena, vma))
> +               return -ENOMEM;
> +
> +       arena->user_vm_start = vma->vm_start;
> +       arena->user_vm_end = vma->vm_end;
> +       /*
> +        * bpf_map_mmap() checks that it's being mmaped as VM_SHARED and
> +        * clears VM_MAYEXEC. Set VM_DONTEXPAND as well to avoid
> +        * potential change of user_vm_start.
> +        */
> +       vm_flags_set(vma, VM_DONTEXPAND);
> +       vma->vm_ops = &arena_vm_ops;
> +       return 0;
> +}
> +

[...]