Alexei Starovoitov <alexei.starovoitov@xxxxxxxxx> writes: > From: Alexei Starovoitov <ast@xxxxxxxxxx> > > bpf programs have multiple options to communicate with user space: > - Various ring buffers (perf, ftrace, bpf): The data is streamed > unidirectionally from bpf to user space. > - Hash map: The bpf program populates elements, and user space consumes them > via bpf syscall. > - mmap()-ed array map: Libbpf creates an array map that is directly accessed by > the bpf program and mmap-ed to user space. It's the fastest way. Its > disadvantage is that memory for the whole array is reserved at the start. > > These patches introduce bpf_arena, which is a sparse shared memory region > between the bpf program and user space. This will need to be documented, probably in a new file at Documentation/bpf/map_arena.rst since it's cosplaying as a BPF map. Why is it a map, when it doesn't have map semantics as evidenced by the -EOPNOTSUPP map accessors? Is it the only way you can reuse the kernel / userspace plumbing? > Use cases: > 1. User space mmap-s bpf_arena and uses it as a traditional mmap-ed anonymous > region, like memcached or any key/value storage. The bpf program implements an > in-kernel accelerator. XDP prog can search for a key in bpf_arena and return a > value without going to user space. > 2. The bpf program builds arbitrary data structures in bpf_arena (hash tables, > rb-trees, sparse arrays), while user space occasionally consumes it. > 3. bpf_arena is a "heap" of memory from the bpf program's point of view. It is > not shared with user space. > > Initially, the kernel vm_area and user vma are not populated. User space can > fault in pages within the range. While servicing a page fault, bpf_arena logic > will insert a new page into the kernel and user vmas. The bpf program can > allocate pages from that region via bpf_arena_alloc_pages(). This kernel > function will insert pages into the kernel vm_area. The subsequent fault-in > from user space will populate that page into the user vma. The > BPF_F_SEGV_ON_FAULT flag at arena creation time can be used to prevent fault-in > from user space. In such a case, if a page is not allocated by the bpf program > and not present in the kernel vm_area, the user process will segfault. This is > useful for use cases 2 and 3 above. > > bpf_arena_alloc_pages() is similar to user space mmap(). It allocates pages > either at a specific address within the arena or allocates a range with the > maple tree. bpf_arena_free_pages() is analogous to munmap(), which frees pages > and removes the range from the kernel vm_area and from user process vmas. > > bpf_arena can be used as a bpf program "heap" of up to 4GB. The memory is not > shared with user space. This is use case 3. In such a case, the > BPF_F_NO_USER_CONV flag is recommended. It will tell the verifier to treat the I can see _what_ this flag does but it's not clear what the consequences of this flag are. Perhaps it would be better named BPF_F_NO_USER_ACCESS? > rX = bpf_arena_cast_user(rY) instruction as a 32-bit move wX = wY, which will > improve bpf prog performance. Otherwise, bpf_arena_cast_user is translated by > JIT to conditionally add the upper 32 bits of user vm_start (if the pointer is > not NULL) to arena pointers before they are stored into memory. This way, user > space sees them as valid 64-bit pointers. > > Diff https://github.com/llvm/llvm-project/pull/79902 taught LLVM BPF backend to > generate the bpf_cast_kern() instruction before dereference of the arena > pointer and the bpf_cast_user() instruction when the arena pointer is formed. > In a typical bpf program there will be very few bpf_cast_user(). > > From LLVM's point of view, arena pointers are tagged as > __attribute__((address_space(1))). Hence, clang provides helpful diagnostics > when pointers cross address space. Libbpf and the kernel support only > address_space == 1. All other address space identifiers are reserved. > > rX = bpf_cast_kern(rY, addr_space) tells the verifier that > rX->type = PTR_TO_ARENA. Any further operations on PTR_TO_ARENA register have > to be in the 32-bit domain. The verifier will mark load/store through > PTR_TO_ARENA with PROBE_MEM32. JIT will generate them as > kern_vm_start + 32bit_addr memory accesses. The behavior is similar to > copy_from_kernel_nofault() except that no address checks are necessary. The > address is guaranteed to be in the 4GB range. If the page is not present, the > destination register is zeroed on read, and the operation is ignored on write. > > rX = bpf_cast_user(rY, addr_space) tells the verifier that > rX->type = unknown scalar. If arena->map_flags has BPF_F_NO_USER_CONV set, then > the verifier converts cast_user to mov32. Otherwise, JIT will emit native code > equivalent to: > rX = (u32)rY; > if (rX) > rX |= arena->user_vm_start & ~(u64)~0U; > > After such conversion, the pointer becomes a valid user pointer within > bpf_arena range. The user process can access data structures created in > bpf_arena without any additional computations. For example, a linked list built > by a bpf program can be walked natively by user space. The last two patches > demonstrate how algorithms in the C language can be compiled as a bpf program > and as native code. > > Followup patches are planned: > . selftests in asm > . support arena variables in global data. Example: > void __arena * ptr; // works > int __arena var; // supported by llvm, but not by kernel and libbpf yet > . support bpf_spin_lock in arena > bpf programs running on different CPUs can synchronize access to the arena via > existing bpf_spin_lock mechanisms (spin_locks in bpf_array or in bpf hash map). > It will be more convenient to allow spin_locks inside the arena too. > > Patch set overview: > - patch 1,2: minor verifier enhancements to enable bpf_arena kfuncs > - patch 3: export vmap_pages_range() to be used out side of mm directory > - patch 4: main patch that introduces bpf_arena map type. See commit log > - patch 6: probe_mem32 support in x86 JIT > - patch 7: bpf_cast_user support in x86 JIT > - patch 8: main verifier patch to support bpf_arena > - patch 9: __arg_arena to tag arena pointers in bpf globla functions > - patch 11: libbpf support for arena > - patch 12: __ulong() macro to pass 64-bit constants in BTF > - patch 13: export PAGE_SIZE constant into vmlinux BTF to be used from bpf programs > - patch 14: bpf_arena_cast instruction as inline asm for setups with old LLVM > - patch 15,16: testcases in C > > Alexei Starovoitov (16): > bpf: Allow kfuncs return 'void *' > bpf: Recognize '__map' suffix in kfunc arguments > mm: Expose vmap_pages_range() to the rest of the kernel. > bpf: Introduce bpf_arena. > bpf: Disasm support for cast_kern/user instructions. > bpf: Add x86-64 JIT support for PROBE_MEM32 pseudo instructions. > bpf: Add x86-64 JIT support for bpf_cast_user instruction. > bpf: Recognize cast_kern/user instructions in the verifier. > bpf: Recognize btf_decl_tag("arg:arena") as PTR_TO_ARENA. > libbpf: Add __arg_arena to bpf_helpers.h > libbpf: Add support for bpf_arena. > libbpf: Allow specifying 64-bit integers in map BTF. > bpf: Tell bpf programs kernel's PAGE_SIZE > bpf: Add helper macro bpf_arena_cast() > selftests/bpf: Add bpf_arena_list test. > selftests/bpf: Add bpf_arena_htab test. > > arch/x86/net/bpf_jit_comp.c | 222 +++++++- > include/linux/bpf.h | 8 +- > include/linux/bpf_types.h | 1 + > include/linux/bpf_verifier.h | 1 + > include/linux/filter.h | 4 + > include/linux/vmalloc.h | 2 + > include/uapi/linux/bpf.h | 12 + > kernel/bpf/Makefile | 3 + > kernel/bpf/arena.c | 518 ++++++++++++++++++ > kernel/bpf/btf.c | 19 +- > kernel/bpf/core.c | 23 +- > kernel/bpf/disasm.c | 11 + > kernel/bpf/log.c | 3 + > kernel/bpf/syscall.c | 3 + > kernel/bpf/verifier.c | 127 ++++- > mm/vmalloc.c | 4 +- > tools/include/uapi/linux/bpf.h | 12 + > tools/lib/bpf/bpf_helpers.h | 2 + > tools/lib/bpf/libbpf.c | 62 ++- > tools/lib/bpf/libbpf_probes.c | 6 + > tools/testing/selftests/bpf/DENYLIST.aarch64 | 1 + > tools/testing/selftests/bpf/DENYLIST.s390x | 1 + > tools/testing/selftests/bpf/bpf_arena_alloc.h | 58 ++ > .../testing/selftests/bpf/bpf_arena_common.h | 70 +++ > tools/testing/selftests/bpf/bpf_arena_htab.h | 100 ++++ > tools/testing/selftests/bpf/bpf_arena_list.h | 95 ++++ > .../testing/selftests/bpf/bpf_experimental.h | 41 ++ > .../selftests/bpf/prog_tests/arena_htab.c | 88 +++ > .../selftests/bpf/prog_tests/arena_list.c | 65 +++ > .../testing/selftests/bpf/progs/arena_htab.c | 48 ++ > .../selftests/bpf/progs/arena_htab_asm.c | 5 + > .../testing/selftests/bpf/progs/arena_list.c | 75 +++ > 32 files changed, 1669 insertions(+), 21 deletions(-) > create mode 100644 kernel/bpf/arena.c > create mode 100644 tools/testing/selftests/bpf/bpf_arena_alloc.h > create mode 100644 tools/testing/selftests/bpf/bpf_arena_common.h > create mode 100644 tools/testing/selftests/bpf/bpf_arena_htab.h > create mode 100644 tools/testing/selftests/bpf/bpf_arena_list.h > create mode 100644 tools/testing/selftests/bpf/prog_tests/arena_htab.c > create mode 100644 tools/testing/selftests/bpf/prog_tests/arena_list.c > create mode 100644 tools/testing/selftests/bpf/progs/arena_htab.c > create mode 100644 tools/testing/selftests/bpf/progs/arena_htab_asm.c > create mode 100644 tools/testing/selftests/bpf/progs/arena_list.c