On Mon, Mar 11, 2024 at 3:01 PM Andrii Nakryiko <andrii.nakryiko@xxxxxxxxx> wrote: > > On Thu, Mar 7, 2024 at 5:08 PM Alexei Starovoitov > <alexei.starovoitov@xxxxxxxxx> wrote: > > > > From: Alexei Starovoitov <ast@xxxxxxxxxx> > > > > Introduce bpf_arena, which is a sparse shared memory region between the bpf > > program and user space. > > > > Use cases: > > 1. User space mmap-s bpf_arena and uses it as a traditional mmap-ed > > anonymous region, like memcached or any key/value storage. The bpf > > program implements an in-kernel accelerator. XDP prog can search for > > a key in bpf_arena and return a value without going to user space. > > 2. The bpf program builds arbitrary data structures in bpf_arena (hash > > tables, rb-trees, sparse arrays), while user space consumes it. > > 3. bpf_arena is a "heap" of memory from the bpf program's point of view. > > The user space may mmap it, but bpf program will not convert pointers > > to user base at run-time to improve bpf program speed. > > > > Initially, the kernel vm_area and user vma are not populated. User space > > can fault in pages within the range. While servicing a page fault, > > bpf_arena logic will insert a new page into the kernel and user vmas. The > > bpf program can allocate pages from that region via > > bpf_arena_alloc_pages(). This kernel function will insert pages into the > > kernel vm_area. The subsequent fault-in from user space will populate that > > page into the user vma. The BPF_F_SEGV_ON_FAULT flag at arena creation time > > can be used to prevent fault-in from user space. In such a case, if a page > > is not allocated by the bpf program and not present in the kernel vm_area, > > the user process will segfault. This is useful for use cases 2 and 3 above. > > > > bpf_arena_alloc_pages() is similar to user space mmap(). It allocates pages > > either at a specific address within the arena or allocates a range with the > > maple tree. bpf_arena_free_pages() is analogous to munmap(), which frees > > pages and removes the range from the kernel vm_area and from user process > > vmas. > > > > bpf_arena can be used as a bpf program "heap" of up to 4GB. The speed of > > bpf program is more important than ease of sharing with user space. This is > > use case 3. In such a case, the BPF_F_NO_USER_CONV flag is recommended. > > It will tell the verifier to treat the rX = bpf_arena_cast_user(rY) > > instruction as a 32-bit move wX = wY, which will improve bpf prog > > performance. Otherwise, bpf_arena_cast_user is translated by JIT to > > conditionally add the upper 32 bits of user vm_start (if the pointer is not > > NULL) to arena pointers before they are stored into memory. This way, user > > space sees them as valid 64-bit pointers. > > > > Diff https://github.com/llvm/llvm-project/pull/84410 enables LLVM BPF > > backend generate the bpf_addr_space_cast() instruction to cast pointers > > between address_space(1) which is reserved for bpf_arena pointers and > > default address space zero. All arena pointers in a bpf program written in > > C language are tagged as __attribute__((address_space(1))). Hence, clang > > provides helpful diagnostics when pointers cross address space. Libbpf and > > the kernel support only address_space == 1. All other address space > > identifiers are reserved. > > > > rX = bpf_addr_space_cast(rY, /* dst_as */ 1, /* src_as */ 0) tells the > > verifier that rX->type = PTR_TO_ARENA. Any further operations on > > PTR_TO_ARENA register have to be in the 32-bit domain. The verifier will > > mark load/store through PTR_TO_ARENA with PROBE_MEM32. JIT will generate > > them as kern_vm_start + 32bit_addr memory accesses. The behavior is similar > > to copy_from_kernel_nofault() except that no address checks are necessary. > > The address is guaranteed to be in the 4GB range. If the page is not > > present, the destination register is zeroed on read, and the operation is > > ignored on write. > > > > rX = bpf_addr_space_cast(rY, 0, 1) tells the verifier that rX->type = > > unknown scalar. If arena->map_flags has BPF_F_NO_USER_CONV set, then the > > verifier converts such cast instructions to mov32. Otherwise, JIT will emit > > native code equivalent to: > > rX = (u32)rY; > > if (rY) > > rX |= clear_lo32_bits(arena->user_vm_start); /* replace hi32 bits in rX */ > > > > After such conversion, the pointer becomes a valid user pointer within > > bpf_arena range. The user process can access data structures created in > > bpf_arena without any additional computations. For example, a linked list > > built by a bpf program can be walked natively by user space. > > > > Reviewed-by: Barret Rhoden <brho@xxxxxxxxxx> > > Signed-off-by: Alexei Starovoitov <ast@xxxxxxxxxx> > > --- > > include/linux/bpf.h | 7 +- > > include/linux/bpf_types.h | 1 + > > include/uapi/linux/bpf.h | 10 + > > kernel/bpf/Makefile | 3 + > > kernel/bpf/arena.c | 558 +++++++++++++++++++++++++++++++++ > > kernel/bpf/core.c | 11 + > > kernel/bpf/syscall.c | 36 +++ > > kernel/bpf/verifier.c | 1 + > > tools/include/uapi/linux/bpf.h | 10 + > > 9 files changed, 635 insertions(+), 2 deletions(-) > > create mode 100644 kernel/bpf/arena.c > > > > [...] > > > > > struct bpf_offload_dev; > > @@ -2215,6 +2216,8 @@ int generic_map_delete_batch(struct bpf_map *map, > > struct bpf_map *bpf_map_get_curr_or_next(u32 *id); > > struct bpf_prog *bpf_prog_get_curr_or_next(u32 *id); > > > > +int bpf_map_alloc_pages(const struct bpf_map *map, gfp_t gfp, int nid, > > nit: you use more meaningful node_id in arena_alloc_pages(), here > "nid" was a big mystery when looking at just function definition nid is a standard name in mm/. git grep -w nid mm/|wc -l 1064 git grep -w node_id mm/|wc -l 77 Also see slab_nid(), folio_nid(). > > + unsigned long nr_pages, struct page **page_array); > > #ifdef CONFIG_MEMCG_KMEM > > void *bpf_map_kmalloc_node(const struct bpf_map *map, size_t size, gfp_t flags, > > int node); > > [...] > > > +u64 bpf_arena_get_kern_vm_start(struct bpf_arena *arena) > > +{ > > + return arena ? (u64) (long) arena->kern_vm->addr + GUARD_SZ / 2 : 0; > > +} > > + > > +u64 bpf_arena_get_user_vm_start(struct bpf_arena *arena) > > +{ > > + return arena ? arena->user_vm_start : 0; > > +} > > + > > is it anticipated that these helpers can be called with NULL? I might > see this later in the patch set, but if not, these NULL checks would > be best removed to not create wrong expectations. Looks like you figured it out. > > +static long arena_map_peek_elem(struct bpf_map *map, void *value) > > +{ > > + return -EOPNOTSUPP; > > +} > > + > > +static long arena_map_push_elem(struct bpf_map *map, void *value, u64 flags) > > +{ > > + return -EOPNOTSUPP; > > +} > > + > > +static long arena_map_pop_elem(struct bpf_map *map, void *value) > > +{ > > + return -EOPNOTSUPP; > > +} > > + > > +static long arena_map_delete_elem(struct bpf_map *map, void *value) > > +{ > > + return -EOPNOTSUPP; > > +} > > + > > +static int arena_map_get_next_key(struct bpf_map *map, void *key, void *next_key) > > +{ > > + return -EOPNOTSUPP; > > +} > > + > > This is a separate topic, but I'll just mention it here. It was always > confusing to me why we don't just treat all these callbacks as > optional and return -EOPNOTSUPP in generic map code. Unless I miss > something subtle, we should do a round of clean ups and remove dozens > of unnecessary single line callbacks like these throughout the entire > BPF kernel code. yeah. could be a generic follow up. agree. > > +static long compute_pgoff(struct bpf_arena *arena, long uaddr) > > +{ > > + return (u32)(uaddr - (u32)arena->user_vm_start) >> PAGE_SHIFT; > > +} > > + > > [...] > > > +static vm_fault_t arena_vm_fault(struct vm_fault *vmf) > > +{ > > + struct bpf_map *map = vmf->vma->vm_file->private_data; > > + struct bpf_arena *arena = container_of(map, struct bpf_arena, map); > > + struct page *page; > > + long kbase, kaddr; > > + int ret; > > + > > + kbase = bpf_arena_get_kern_vm_start(arena); > > + kaddr = kbase + (u32)(vmf->address & PAGE_MASK); > > + > > + guard(mutex)(&arena->lock); > > + page = vmalloc_to_page((void *)kaddr); > > + if (page) > > + /* already have a page vmap-ed */ > > + goto out; > > + > > + if (arena->map.map_flags & BPF_F_SEGV_ON_FAULT) > > + /* User space requested to segfault when page is not allocated by bpf prog */ > > + return VM_FAULT_SIGSEGV; > > + > > + ret = mtree_insert(&arena->mt, vmf->pgoff, MT_ENTRY, GFP_KERNEL); > > + if (ret) > > + return VM_FAULT_SIGSEGV; > > + > > + /* Account into memcg of the process that created bpf_arena */ > > + ret = bpf_map_alloc_pages(map, GFP_KERNEL | __GFP_ZERO, NUMA_NO_NODE, 1, &page); > > any specific reason to not take into account map->numa_node here? There are several reasons for this. 1. is to avoid expectations that map->num_node is meaningful for actual run-time allocations. Since arena_alloc_pages() take explicit nid and it would conflict if map->numa_node was also used. 2. In this case it's user space faulting in a page, so best to let NUMA_NO_NODE pick the right one. > > + if (ret) { > > + mtree_erase(&arena->mt, vmf->pgoff); > > + return VM_FAULT_SIGSEGV; > > + } > > + > > + ret = vm_area_map_pages(arena->kern_vm, kaddr, kaddr + PAGE_SIZE, &page); > > + if (ret) { > > + mtree_erase(&arena->mt, vmf->pgoff); > > + __free_page(page); > > + return VM_FAULT_SIGSEGV; > > + } > > +out: > > + page_ref_add(page, 1); > > + vmf->page = page; > > + return 0; > > +} > > + > > [...] > > > +/* > > + * Allocate pages and vmap them into kernel vmalloc area. > > + * Later the pages will be mmaped into user space vma. > > + */ > > +static long arena_alloc_pages(struct bpf_arena *arena, long uaddr, long page_cnt, int node_id) > > +{ > > + /* user_vm_end/start are fixed before bpf prog runs */ > > + long page_cnt_max = (arena->user_vm_end - arena->user_vm_start) >> PAGE_SHIFT; > > + u64 kern_vm_start = bpf_arena_get_kern_vm_start(arena); > > + struct page **pages; > > + long pgoff = 0; > > + u32 uaddr32; > > + int ret, i; > > + > > + if (page_cnt > page_cnt_max) > > + return 0; > > + > > + if (uaddr) { > > + if (uaddr & ~PAGE_MASK) > > + return 0; > > + pgoff = compute_pgoff(arena, uaddr); > > + if (pgoff + page_cnt > page_cnt_max) > > As I mentioned offline, is this guaranteed to not overflow? it's not > obvious because at least according to all the types (longs), uaddr can > be arbitrary, so pgoff can be quite large, etc. Might be worthwhile > rewriting as `pgoff > page_cnt_max - page_cnt` or something, just to > make it clear in code it has no chance of overflowing. It cannot overflow. All three variables: page_cnt, pgoff, page_cnt_max are within u32 range. Look at how they're computed. > > + /* requested address will be outside of user VMA */ > > + return 0; > > + } > > + > > + /* zeroing is needed, since alloc_pages_bulk_array() only fills in non-zero entries */ > > + pages = kvcalloc(page_cnt, sizeof(struct page *), GFP_KERNEL); > > + if (!pages) > > + return 0; > > + > > + guard(mutex)(&arena->lock); > > + > > + if (uaddr) > > + ret = mtree_insert_range(&arena->mt, pgoff, pgoff + page_cnt - 1, > > + MT_ENTRY, GFP_KERNEL); > > + else > > + ret = mtree_alloc_range(&arena->mt, &pgoff, MT_ENTRY, > > + page_cnt, 0, page_cnt_max - 1, GFP_KERNEL); > > mtree_alloc_range() is lacking documentation, unfortunately, so it's > not clear to me whether max should be just `page_cnt_max - 1` as you > have or `page_cnt_max - page_cnt`. There is a "Test a single entry" in > lib/test_maple_tree.c where min == max and size == 4096 which is > expected to work, so I have a feeling that the correct max should be > up to the maximum possible beginning of range, but I might be > mistaken. Can you please double check? Not only I double checked this, the patch 12 selftest covers this boundary condition :) > > > + if (ret) > > + goto out_free_pages; > > + > > + ret = bpf_map_alloc_pages(&arena->map, GFP_KERNEL | __GFP_ZERO, > > + node_id, page_cnt, pages); > > + if (ret) > > + goto out; > > + > > + uaddr32 = (u32)(arena->user_vm_start + pgoff * PAGE_SIZE); > > + /* Earlier checks make sure that uaddr32 + page_cnt * PAGE_SIZE will not overflow 32-bit */ > > we checked that `uaddr32 + page_cnt * PAGE_SIZE - 1` won't overflow, > full page_cnt * PAGE_SIZE can actually overflow, so comment is a bit > imprecise. But it's not really clear why it matters here, tbh. > kern_vm_start + uaddr32 + page_cnt * PAGE_SIZE can actually update > upper 32-bits of the kernel-side memory address, is that a problem? There is a giant thread in v2 series between myself and Barrett that goes into length why we decided to prohibit overflow within lower 32-bit. The shortest tldr: technically we can allow lower 32-bit overflow, but the code gets very complex. > > > + ret = vm_area_map_pages(arena->kern_vm, kern_vm_start + uaddr32, > > + kern_vm_start + uaddr32 + page_cnt * PAGE_SIZE, pages); > > + if (ret) { > > + for (i = 0; i < page_cnt; i++) > > + __free_page(pages[i]); > > + goto out; > > + } > > + kvfree(pages); > > + return clear_lo32(arena->user_vm_start) + uaddr32; > > +out: > > + mtree_erase(&arena->mt, pgoff); > > +out_free_pages: > > + kvfree(pages); > > + return 0; > > +} > > + > > +/* > > + * If page is present in vmalloc area, unmap it from vmalloc area, > > + * unmap it from all user space vma-s, > > + * and free it. > > + */ > > +static void zap_pages(struct bpf_arena *arena, long uaddr, long page_cnt) > > +{ > > + struct vma_list *vml; > > + > > + list_for_each_entry(vml, &arena->vma_list, head) > > + zap_page_range_single(vml->vma, uaddr, > > + PAGE_SIZE * page_cnt, NULL); > > +} > > + > > +static void arena_free_pages(struct bpf_arena *arena, long uaddr, long page_cnt) > > +{ > > + u64 full_uaddr, uaddr_end; > > + long kaddr, pgoff, i; > > + struct page *page; > > + > > + /* only aligned lower 32-bit are relevant */ > > + uaddr = (u32)uaddr; > > + uaddr &= PAGE_MASK; > > + full_uaddr = clear_lo32(arena->user_vm_start) + uaddr; > > + uaddr_end = min(arena->user_vm_end, full_uaddr + (page_cnt << PAGE_SHIFT)); > > + if (full_uaddr >= uaddr_end) > > + return; > > + > > + page_cnt = (uaddr_end - full_uaddr) >> PAGE_SHIFT; > > + > > + guard(mutex)(&arena->lock); > > + > > + pgoff = compute_pgoff(arena, uaddr); > > + /* clear range */ > > + mtree_store_range(&arena->mt, pgoff, pgoff + page_cnt - 1, NULL, GFP_KERNEL); > > + > > + if (page_cnt > 1) > > + /* bulk zap if multiple pages being freed */ > > + zap_pages(arena, full_uaddr, page_cnt); > > + > > + kaddr = bpf_arena_get_kern_vm_start(arena) + uaddr; > > + for (i = 0; i < page_cnt; i++, kaddr += PAGE_SIZE, full_uaddr += PAGE_SIZE) { > > + page = vmalloc_to_page((void *)kaddr); > > + if (!page) > > + continue; > > + if (page_cnt == 1 && page_mapped(page)) /* mapped by some user process */ > > + zap_pages(arena, full_uaddr, 1); > > The way you split these zap_pages for page_cnt == 1 and page_cnt > 1 > is quite confusing. Why can't you just unconditionally zap_pages() > regardless of page_cnt before this loop? And why for page_cnt == 1 we > have `page_mapped(page)` check, but it's ok to not check this for > page_cnt>1 case? > > This asymmetric handling is confusing and suggests something more is > going on here. Or am I overthinking it? It's an important optimization for the common case of page_cnt==1. If page wasn't mapped into some user vma there is no need to call zap_pages which is slow. But when page_cnt is big it's much faster to do the batched zap which is what this code does. For the case of page_cnt=2 or small number there is no good optimization to do other than try to count whether all pages in this range are not page_mapped() and omit zap_page(). I don't think it's worth doing such optimization at this point, since page_cnt=1 is likely the most common case. If it changes, it can be optimized later. > > + vm_area_unmap_pages(arena->kern_vm, kaddr, kaddr + PAGE_SIZE); > > + __free_page(page); > > can something else in the kernel somehow get a refcnt on this page? > I.e., is it ok to unconditionally free page here instead of some sort > of put_page() instead? hmm. __free_page() does the refcnt. It's not an unconditional free. put_page() is for the case when folio is present. Here all of them are privately managed pages. All are single page individual allocations and a normal refcnt scheme applies. Which is what __free_page() does. Thanks for the review. I believe I answered all the questions. Looks like the only followup is to generalize all 'return -EOPNOTSUPP' across all map types. I'll add it to my todo. Unless somebody will have more free cycles.