On Wed, Feb 21, 2024 at 11:05 AM Alexei Starovoitov <alexei.starovoitov@xxxxxxxxx> wrote: > > On Tue, Feb 20, 2024 at 9:52 PM Christoph Hellwig <hch@xxxxxxxxxxxxx> wrote: > > > > On Tue, Feb 20, 2024 at 11:26:13AM -0800, Alexei Starovoitov wrote: > > > From: Alexei Starovoitov <ast@xxxxxxxxxx> > > > > > > vmap() API is used to map a set of pages into contiguous kernel virtual space. > > > > > > BPF would like to extend the vmap API to implement a lazily-populated > > > contiguous kernel virtual space which size and start address is fixed early. > > > > > > The vmap API has functions to request and release areas of kernel address space: > > > get_vm_area() and free_vm_area(). > > > > As said before I really hate growing more get_vm_area and > > free_vm_area outside the core vmalloc code. We have a few of those > > mostly due to ioremap (which is beeing consolidate) and executable code > > allocation (which there have been various attempts at consolidation, > > and hopefully one finally succeeds..). So let's take a step back and > > think how we can do that without it. > > There are also xen grant tables that grab the range with get_vm_area(), > but manage it on their own. It's not an ioremap case. > It looks to me the vmalloc address range has different kinds of areas > already: vmalloc, vmap, ioremap, xen. > > Maybe we can do: > diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h > index 7d112cc5f2a3..633c7b643daa 100644 > --- a/include/linux/vmalloc.h > +++ b/include/linux/vmalloc.h > @@ -28,6 +28,7 @@ struct iov_iter; /* in uio.h */ > #define VM_MAP_PUT_PAGES 0x00000200 /* put pages and free > array in vfree */ > #define VM_ALLOW_HUGE_VMAP 0x00000400 /* Allow for huge > pages on archs with HAVE_ARCH_HUGE_VMALLOC */ > +#define VM_BPF 0x00000800 /* bpf_arena pages */ > > +static inline struct vm_struct *get_bpf_vm_area(unsigned long size) > +{ > + return get_vm_area(size, VM_BPF); > +} > > and enforce that flag in vm_area_[un]map_pages() ? > > vmallocinfo can display it or skip it. > Things like find_vm_area() can do something different with such an area > (if that was the concern). > > > For the dynamically growing part do you need a special allocator or > > can we just go straight to the page allocator and implement this > > in common code? > > It's a bit special allocator that is using maple tree to manage > range within 4G region and > alloc_pages_node(GFP_KERNEL | __GFP_ZERO | __GFP_ACCOUNT) > to grab pages. > With extra dance: > memcg = bpf_map_get_memcg(map); > old_memcg = set_active_memcg(memcg); > to make sure memcg accounting is done the common way for all bpf maps. > > The tricky bpf specific part is a computation of pgoff, > since it's a shared memory region between user space and bpf prog. > The lower 32-bits of the pointer have to be the same for user space and bpf. > > Not much changed in the patch since the earlier thread. > Either find it in your email or here: > https://git.kernel.org/pub/scm/linux/kernel/git/ast/bpf.git/commit/?h=arena&id=364c9b5d233d775728ec2bf3b4168fa6909e58d1 > > Are you suggesting the api like: > > struct vm_struct *area = get_sparse_vm_area(size); > vm_area_alloc_pages(struct vm_struct *area, ulong addr, int page_cnt, > int numa_id); > > and vm_area_alloc_pages() will allocate pages and vmap_pages_range() > them while all code in mm/vmalloc.c ? > > I can give it a shot. > > The ugly part is bpf_map_get_memcg() would need to be passed in somehow. > > Another bpf specific bit is the guard pages before and after 4G range > and such vm_area_alloc_pages() would need to skip them. I've looked at this approach more. The somewhat generic-ish api for mm/vmalloc.c may look like: struct vm_sparse_struct *area; area = get_sparse_vm_area(vm_area_size, guard_size, pgoff_offset, max_pages, memcg, ...); vm_area_size is what get_vm_area() will reserve out of the kernel vmalloc region. For bpf_arena case it will be 4gb+64k. guard_size is the size of the guard area. 64k for bpf_arena. pgoff_offset is the offset where pages would need to start allocating after the guard area. For any normal vma the pgoff==0 is the first page after vma->vm_start. bpf_arena is bpf/user shared sparse region and it needs to keep lower 32-bit from the address that user space received from mmap(). So that the first allocated page with pgoff=0 will be the first page for _user_ vma->vm_start. Hence for kernel vmalloc range the page allocator needs that pgoff_offset. max_pages is easy. It's the max number of pages that this sparse_vm_area is allowed to allocate. It's also driven by user space. When user does mmap(NULL, bpf_arena_size, ..., bpf_arena_map_fd) it gets an address and that address determines pgoff_offset and arena_size determines the max_pages. That arena_size can be 1 page or 1000 pages. Always less than 4Gb. But vm_area_size will be 4gb+64k regardless. vm_area_alloc_pages(struct vm_sparse_struct *area, ulong addr, int page_cnt, int numa_id); is semantically similar to user's mmap(). If addr == 0 the kernel will find a free range after pgoff_offset and will allocate page_cnt pages from there and vmap to kernel's vm_sparse_struct area. If addr is specified it would have to be >= pgoff_offset and page_cnt <= max_pages. All pages are accounted into memcg specified at vm_sparse_struct creation time. And it will use maple tree to track all these range allocation within vm_sparse_struct. So far it looks like the bigger half of kernel/bpf/arena.c will migrate to mm/vmalloc.c and will be very bpf specific. So I don't particularly like this direction. Feels like a burden for mm and bpf folks. btw LWN just posted a nice article describing the motivation https://lwn.net/Articles/961941/ So far doing: +#define VM_BPF 0x00000800 /* bpf_arena pages */ or VM_SPARSE ? +static inline struct vm_struct *get_bpf_vm_area(unsigned long size) +{ + return get_vm_area(size, VM_BPF); +} and enforcing that flag where appropriate in mm/vmalloc.c is the easiest for everyone. We probably should add #define VM_XEN 0x00001000 and use it in xen use cases to differentiate vmalloc vs vmap vs ioremap vs bpf vs xen users. Please share your opinions.