On Wed, Feb 7, 2024 at 10:40 AM Barret Rhoden <brho@xxxxxxxxxx> wrote: > > On 2/6/24 17:04, Alexei Starovoitov wrote: > > + > > +static long compute_pgoff(struct bpf_arena *arena, long uaddr) > > +{ > > + return (u32)(uaddr - (u32)arena->user_vm_start) >> PAGE_SHIFT; > > +} > > + > > +#define MT_ENTRY ((void *)&arena_map_ops) /* unused. has to be valid pointer */ > > + > > +/* > > + * Reserve a "zero page", so that bpf prog and user space never see > > + * a pointer to arena with lower 32 bits being zero. > > + * bpf_cast_user() promotes it to full 64-bit NULL. > > + */ > > +static int reserve_zero_page(struct bpf_arena *arena) > > +{ > > + long pgoff = compute_pgoff(arena, 0); > > + > > + return mtree_insert(&arena->mt, pgoff, MT_ENTRY, GFP_KERNEL); > > +} > > + > > this is pretty tricky, and i think i didn't understand it at first. > > you're punching a hole in the arena, such that BPF won't allocate it via > arena_alloc_pages(). thus BPF won't 'produce' an object with a pointer > ending in 0x00000000. > > depending on where userspace mmaps the arena, that hole may or may not > be the first page in the array. if userspace mmaps it to a 4GB aligned > virtual address, it'll be page 0. but it could be at some arbitrary > offset within the 4GB arena. > > that arbitrariness makes it harder for a BPF program to do its own > allocations within the arena. i'm planning on carving up the 4GB arena > for my own purposes, managed by BPF, with the expectation that i'll be > able to allocate any 'virtual address' within the arena. but there's a > magic page that won't be usable. > > i can certainly live with this. just mmap userspace to a 4GB aligned > address + PGSIZE, so that the last page in the arena is page 0. but > it's a little weird. Agree. I made the same conclusion while adding global variables to the arena. >From the compiler point of view all such global vars start at offset zero and there is no way to just "move them up by a page". For example in C code it will look like: int __arena var1; int __arena var2; &var1 == user_vm_start &var2 == user_vm_start + 4 If __ulong(map_extra,...) was used or mmap(addr, MAP_FIXED) and was 4Gb aligned the lower 32-bits of &var1 address will be zero and there is not much we can do about it. We can tell LLVM to emit extra 8 byte padding to the arena section, but it will be useless pad if the arena is not aligned to 4Gb. Anyway, in the v2 I will remove this reserve_zero_page() logic. It's causing more harm than good. > > though i think we'll have more serious issues if anyone accidentally > tries to use that zero page. BPF would get an EEXIST if they try to > allocate it directly, but then page fault and die if they touched it, > since there's no page. i can live with that, if we force it to be the > last page in the arena. > > however, i think you need to add something to the fault handler (below) > in case userspace touches that page: > > [snip] > > +static vm_fault_t arena_vm_fault(struct vm_fault *vmf) > > +{ > > + struct bpf_map *map = vmf->vma->vm_file->private_data; > > + struct bpf_arena *arena = container_of(map, struct bpf_arena, map); > > + struct page *page; > > + long kbase, kaddr; > > + int ret; > > + > > + kbase = bpf_arena_get_kern_vm_start(arena); > > + kaddr = kbase + (u32)(vmf->address & PAGE_MASK); > > + > > + guard(mutex)(&arena->lock); > > + page = vmalloc_to_page((void *)kaddr); > > + if (page) > > + /* already have a page vmap-ed */ > > + goto out; > > + > > + if (arena->map.map_flags & BPF_F_SEGV_ON_FAULT) > > + /* User space requested to segfault when page is not allocated by bpf prog */ > > + return VM_FAULT_SIGSEGV; > > + > > + ret = mtree_insert(&arena->mt, vmf->pgoff, MT_ENTRY, GFP_KERNEL); > > + if (ret == -EEXIST) > > + return VM_FAULT_RETRY; > > say this was the zero page. vmalloc_to_page() failed, so we tried to > insert. we get EEXIST, since the slot is reserved. we retry, since we > were expecting the case where "no page, yet slot reserved" meant that > BPF was in the middle of filling this page. Yes. Great catch! I hit that too while playing with global vars. > > though i think you can fix this by just treating this as a SIGSEGV > instead of RETRY. Agree. > when i made the original suggestion of making this a > retry (in an email off list), that was before you had the arena mutex. > now that you have the mutex, you shouldn't have the scenario where two > threads are concurrently trying to fill a page. i.e. mtree_insert + > page_alloc + vmap are all atomic w.r.t. the mutex. yes. mutex part makes sense. > > + > > + if (!arena->user_vm_start) { > > + arena->user_vm_start = vma->vm_start; > > + err = reserve_zero_page(arena); > > + if (err) > > + return err; > > + } > > + arena->user_vm_end = vma->vm_end; > > + /* > > + * bpf_map_mmap() checks that it's being mmaped as VM_SHARED and > > + * clears VM_MAYEXEC. Set VM_DONTEXPAND as well to avoid > > + * potential change of user_vm_start. > > + */ > > + vm_flags_set(vma, VM_DONTEXPAND); > > + vma->vm_ops = &arena_vm_ops; > > + return 0; > > +} > > i think this whole function needs to be protected by the mutex, or at > least all the stuff relate to user_vm_{start,end}. if you have to > threads mmapping the region for the first time, you'll race on the > values of user_vm_*. yes. will add a mutex guard. > > [snip] > > > +/* > > + * Allocate pages and vmap them into kernel vmalloc area. > > + * Later the pages will be mmaped into user space vma. > > + */ > > +static long arena_alloc_pages(struct bpf_arena *arena, long uaddr, long page_cnt, int node_id) > > instead of uaddr, can you change this to take an address relative to the > arena ("arena virtual address"?)? the caller of this is in BPF, and > they don't easily know the user virtual address. maybe even just pgoff > directly. I thought about it, but it doesn't quite make sense. bpf prog only sees user addresses. All load/store returns them. If it bpf_printk-s an address it will be user address. bpf_arena_alloc_pages() also returns a user address. Kernel addresses are not seen by bpf prog at all. kern_vm_base is completely hidden. Only at JIT time, it's added to pointers. So passing uaddr to arena_alloc_pages() matches mmap style. uaddr = bpf_arena_alloc_pages(... uaddr ...) uaddr = mmap(uaddr, ...MAP_FIXED) Passing pgoff would be weird. Also note that there is no extra flag for bpf_arena_alloc_pages(). uaddr == full 64-bit of zeros is not a valid addr to use. > additionally, you won't need to call compute_pgoff(). as it is now, i'm > not sure what would happen if BPF did an arena_alloc with a uaddr and > user_vm_start wasn't set yet. That's impossible. bpf prog won't load, if the arena map is not created either with fixed map_extra == user_vm_start or map is created with map_extra == 0 and mmaped. Only then bpf prog that uses that arena can be loaded. > > > +{ > > + long page_cnt_max = (arena->user_vm_end - arena->user_vm_start) >> PAGE_SHIFT; > > any time you compute_pgoff() or look at user_vm_{start,end}, maybe > either hold the mutex, or only do it from mmap faults (where we know > user_vm_start is already set). o/w there might be subtle races where > some other thread is mmapping the arena for the first time. That's unnecessary, since user_vm_start is fixed for the lifetime of bpf prog. But you spotted a bug. I need to set user_vm_end in arena_map_alloc() when map_extra is specified. Otherwise after arena create and before mmap bpf prog will see different page_cnt_max. user_vm_start will be the same, of course. > > + > > + kaddr = kern_vm_start + (u32)(arena->user_vm_start + pgoff * PAGE_SIZE); > > adding user_vm_start here is pretty subtle. > > so far i've been thinking that the mtree is the "address space" of the > arena, in units of pages instead of bytes. and pgoff is an address > within the arena. so mtree slot 0 is the 0th page of the 4GB region. > and that "arena address space" is mapped at a kernel virtual address and > a user virtual address (the same for all processes). > > to convert user addresses (uaddr et al.) to arena addresses, we subtract > user_vm_start, which makes sense. that's what compute_pgoff() does. > > i was expecting kaddr = kern_vm_start + pgoff * PGSIZE, essentially > converting from arena address space to kernel virtual address. > > instead, by adding user_vm_start and casting to u32, you're converting > or shifting arena addresses to *another* arena address (user address, > truncated to 4GB to keep it in the arena), and that is the one that the > kernel will use. > > is that correct? Pretty much. kern and user have to see lower 32-bit exactly the same. >From user pov allocation starts at pgoff=0 which is the first page after user_vm_start. Which is normal mmap behavior and in-kernel vma convention. Hence in arena_alloc_pages() does the same pgoff=0 is the first page from user vma pov. The math to compute kernel address gets complicated because two bases need to be added. Both kernel and user bases are different 64-bit bases and may not be aligned to 4G. > my one concern is that there's some subtle wrap-around going on, and due > to the shifting, kaddr can be very close to the end of the arena and > page_cnt can be big enough to go outside the 4GB range. we'd want it to page_cnt cannot go outside of 4gb range. page_cnt is a number of pages in arena->user_vm_end - arena->user_vm_start and during mmap we check that it's <= 4gb. > wrap around. e.g. > > user_start_va = 0x1,fffff000 > user_end_va = 0x2,fffff000 > page_cnt_max = 0x100000 or whatever. full 4GB range. > > say we want to alloc at pgoff=0 (uaddr 0x1,fffff000), page_cnt = X. you > can get this pgoff either by doing mtree_insert_range or > mtree_alloc_range on an arena with no allocations. > > kaddr = kern_vm_start + 0xfffff000 > > the size of the vm area is 4GB + guard stuff, and we're right up against > the end of it. > > if page_cnt > the guard size, vmap_pages_range() would be called on > something outside the vm area we reserved, which seems bad. and even if > it wasn't, what we want is for later page maps to start at the beginning > of kern_vm_start. > > the fix might be to just only map a page at a time - maybe a loop. or > detect when we're close to the edge and break it into two vmaps. i feel > like the loop would be easier to understand, but maybe less efficient. Oops. You're correct. Great catch. In earlier versions I had it as a loop, but then I decided that doing all mapping ops page at a time is not efficient. Oh well. Will fix. Thanks a lot for the review!