On Thu, Feb 22, 2024 at 3:25 PM Alexei Starovoitov <alexei.starovoitov@xxxxxxxxx> wrote: > > > > I can give it a shot. > > > > The ugly part is bpf_map_get_memcg() would need to be passed in somehow. > > > > Another bpf specific bit is the guard pages before and after 4G range > > and such vm_area_alloc_pages() would need to skip them. > > I've looked at this approach more. > The somewhat generic-ish api for mm/vmalloc.c may look like: > struct vm_sparse_struct *area; > > area = get_sparse_vm_area(vm_area_size, guard_size, > pgoff_offset, max_pages, memcg, ...); > > vm_area_size is what get_vm_area() will reserve out of the kernel > vmalloc region. For bpf_arena case it will be 4gb+64k. > guard_size is the size of the guard area. 64k for bpf_arena. > pgoff_offset is the offset where pages would need to start allocating > after the guard area. > For any normal vma the pgoff==0 is the first page after vma->vm_start. > bpf_arena is bpf/user shared sparse region and it needs to keep lower 32-bit > from the address that user space received from mmap(). > So that the first allocated page with pgoff=0 will be the first > page for _user_ vma->vm_start. > Hence for kernel vmalloc range the page allocator needs that > pgoff_offset. > max_pages is easy. It's the max number of pages that > this sparse_vm_area is allowed to allocate. > It's also driven by user space. When user does > mmap(NULL, bpf_arena_size, ..., bpf_arena_map_fd) > it gets an address and that address determines pgoff_offset > and arena_size determines the max_pages. > That arena_size can be 1 page or 1000 pages. Always less than 4Gb. > But vm_area_size will be 4gb+64k regardless. > > vm_area_alloc_pages(struct vm_sparse_struct *area, ulong addr, > int page_cnt, int numa_id); > is semantically similar to user's mmap(). > If addr == 0 the kernel will find a free range after pgoff_offset > and will allocate page_cnt pages from there and vmap to > kernel's vm_sparse_struct area. > If addr is specified it would have to be >= pgoff_offset > and page_cnt <= max_pages. > All pages are accounted into memcg specified at vm_sparse_struct > creation time. > And it will use maple tree to track all these range allocation > within vm_sparse_struct. > > So far it looks like the bigger half of kernel/bpf/arena.c > will migrate to mm/vmalloc.c and will be very bpf specific. > > So I don't particularly like this direction. Feels like a burden > for mm and bpf folks. > > btw LWN just posted a nice article describing the motivation > https://lwn.net/Articles/961941/ > > So far doing: > > +#define VM_BPF 0x00000800 /* bpf_arena pages */ > or VM_SPARSE ? > > and enforcing that flag where appropriate in mm/vmalloc.c > is the easiest for everyone. > We probably should add > #define VM_XEN 0x00001000 > and use it in xen use cases to differentiate > vmalloc vs vmap vs ioremap vs bpf vs xen users. Here is what I had in mind: https://lore.kernel.org/bpf/20240223235728.13981-1-alexei.starovoitov@xxxxxxxxx/