On Fri, 9 Feb 2024 at 05:06, Alexei Starovoitov <alexei.starovoitov@xxxxxxxxx> wrote: > > From: Alexei Starovoitov <ast@xxxxxxxxxx> > > Introduce bpf_arena, which is a sparse shared memory region between the bpf > program and user space. > > Use cases: > 1. User space mmap-s bpf_arena and uses it as a traditional mmap-ed anonymous > region, like memcached or any key/value storage. The bpf program implements an > in-kernel accelerator. XDP prog can search for a key in bpf_arena and return a > value without going to user space. > 2. The bpf program builds arbitrary data structures in bpf_arena (hash tables, > rb-trees, sparse arrays), while user space consumes it. > 3. bpf_arena is a "heap" of memory from the bpf program's point of view. > The user space may mmap it, but bpf program will not convert pointers > to user base at run-time to improve bpf program speed. > > Initially, the kernel vm_area and user vma are not populated. User space can > fault in pages within the range. While servicing a page fault, bpf_arena logic > will insert a new page into the kernel and user vmas. The bpf program can > allocate pages from that region via bpf_arena_alloc_pages(). This kernel > function will insert pages into the kernel vm_area. The subsequent fault-in > from user space will populate that page into the user vma. The > BPF_F_SEGV_ON_FAULT flag at arena creation time can be used to prevent fault-in > from user space. In such a case, if a page is not allocated by the bpf program > and not present in the kernel vm_area, the user process will segfault. This is > useful for use cases 2 and 3 above. > > bpf_arena_alloc_pages() is similar to user space mmap(). It allocates pages > either at a specific address within the arena or allocates a range with the > maple tree. bpf_arena_free_pages() is analogous to munmap(), which frees pages > and removes the range from the kernel vm_area and from user process vmas. > > bpf_arena can be used as a bpf program "heap" of up to 4GB. The speed of bpf > program is more important than ease of sharing with user space. This is use > case 3. In such a case, the BPF_F_NO_USER_CONV flag is recommended. It will > tell the verifier to treat the rX = bpf_arena_cast_user(rY) instruction as a > 32-bit move wX = wY, which will improve bpf prog performance. Otherwise, > bpf_arena_cast_user is translated by JIT to conditionally add the upper 32 bits > of user vm_start (if the pointer is not NULL) to arena pointers before they are > stored into memory. This way, user space sees them as valid 64-bit pointers. > > Diff https://github.com/llvm/llvm-project/pull/79902 taught LLVM BPF backend to > generate the bpf_cast_kern() instruction before dereference of the arena > pointer and the bpf_cast_user() instruction when the arena pointer is formed. > In a typical bpf program there will be very few bpf_cast_user(). > > From LLVM's point of view, arena pointers are tagged as > __attribute__((address_space(1))). Hence, clang provides helpful diagnostics > when pointers cross address space. Libbpf and the kernel support only > address_space == 1. All other address space identifiers are reserved. > > rX = bpf_cast_kern(rY, addr_space) tells the verifier that > rX->type = PTR_TO_ARENA. Any further operations on PTR_TO_ARENA register have > to be in the 32-bit domain. The verifier will mark load/store through > PTR_TO_ARENA with PROBE_MEM32. JIT will generate them as > kern_vm_start + 32bit_addr memory accesses. The behavior is similar to > copy_from_kernel_nofault() except that no address checks are necessary. The > address is guaranteed to be in the 4GB range. If the page is not present, the > destination register is zeroed on read, and the operation is ignored on write. > > rX = bpf_cast_user(rY, addr_space) tells the verifier that > rX->type = unknown scalar. If arena->map_flags has BPF_F_NO_USER_CONV set, then > the verifier converts cast_user to mov32. Otherwise, JIT will emit native code > equivalent to: > rX = (u32)rY; > if (rY) > rX |= clear_lo32_bits(arena->user_vm_start); /* replace hi32 bits in rX */ > > After such conversion, the pointer becomes a valid user pointer within > bpf_arena range. The user process can access data structures created in > bpf_arena without any additional computations. For example, a linked list built > by a bpf program can be walked natively by user space. > > Signed-off-by: Alexei Starovoitov <ast@xxxxxxxxxx> > --- A few questions on the patch. 1. Is the expectation that user space would use syscall progs to manipulate mappings in the arena? 2. I may have missed it, but which memcg are the allocations being accounted against? Will it be the process that created the map? When trying to explore bpf_map_alloc_pages, I could not figure out if the obj_cgroup was being looked up anywhere. I think it would be nice if it were accounted for against the caller of bpf_map_alloc_pages, since potentially the arena map can be shared across multiple processes. Tying it to bpf_map on bpf_map_alloc may be too coarse for arena maps. 3. A bit tangential, but what would be the path to having huge page mappings look like (mostly from an interface standpoint)? I gather we could use the flags argument on the kernel side, and if 1 is true above, it would mean userspace would do it from inside a syscall program and then trigger a page fault? Because experience with use case 1 in the commit log suggests it is desirable to have such memory be backed by huge pages to reduce TLB misses. > [...] >