On Mon, Mar 4, 2024 at 10:05 PM Alexei Starovoitov <alexei.starovoitov@xxxxxxxxx> wrote: > > From: Alexei Starovoitov <ast@xxxxxxxxxx> > > vmap/vmalloc APIs are used to map a set of pages into contiguous kernel > virtual space. > > get_vm_area() with appropriate flag is used to request an area of kernel > address range. It's used for vmalloc, vmap, ioremap, xen use cases. > - vmalloc use case dominates the usage. Such vm areas have VM_ALLOC flag. > - the areas created by vmap() function should be tagged with VM_MAP. > - ioremap areas are tagged with VM_IOREMAP. > > BPF would like to extend the vmap API to implement a lazily-populated > sparse, yet contiguous kernel virtual space. Introduce VM_SPARSE flag > and vm_area_map_pages(area, start_addr, count, pages) API to map a set > of pages within a given area. > It has the same sanity checks as vmap() does. > It also checks that get_vm_area() was created with VM_SPARSE flag > which identifies such areas in /proc/vmallocinfo > and returns zero pages on read through /proc/kcore. > > The next commits will introduce bpf_arena which is a sparsely populated > shared memory region between bpf program and user space process. It will > map privately-managed pages into a sparse vm area with the following steps: > > // request virtual memory region during bpf prog verification > area = get_vm_area(area_size, VM_SPARSE); > > // on demand > vm_area_map_pages(area, kaddr, kend, pages); > vm_area_unmap_pages(area, kaddr, kend); > > // after bpf program is detached and unloaded > free_vm_area(area); > > Signed-off-by: Alexei Starovoitov <ast@xxxxxxxxxx> > --- > include/linux/vmalloc.h | 5 ++++ > mm/vmalloc.c | 59 +++++++++++++++++++++++++++++++++++++++-- > 2 files changed, 62 insertions(+), 2 deletions(-) > > diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h > index c720be70c8dd..0f72c85a377b 100644 > --- a/include/linux/vmalloc.h > +++ b/include/linux/vmalloc.h > @@ -35,6 +35,7 @@ struct iov_iter; /* in uio.h */ > #else > #define VM_DEFER_KMEMLEAK 0 > #endif > +#define VM_SPARSE 0x00001000 /* sparse vm_area. not all pages are present. */ > > /* bits [20..32] reserved for arch specific ioremap internals */ > > @@ -232,6 +233,10 @@ static inline bool is_vm_area_hugepages(const void *addr) > } > > #ifdef CONFIG_MMU > +int vm_area_map_pages(struct vm_struct *area, unsigned long start, > + unsigned long end, struct page **pages); > +void vm_area_unmap_pages(struct vm_struct *area, unsigned long start, > + unsigned long end); > void vunmap_range(unsigned long addr, unsigned long end); > static inline void set_vm_flush_reset_perms(void *addr) > { > diff --git a/mm/vmalloc.c b/mm/vmalloc.c > index f42f98a127d5..e5b8c70950bc 100644 > --- a/mm/vmalloc.c > +++ b/mm/vmalloc.c > @@ -648,6 +648,58 @@ static int vmap_pages_range(unsigned long addr, unsigned long end, > return err; > } > > +static int check_sparse_vm_area(struct vm_struct *area, unsigned long start, > + unsigned long end) > +{ > + might_sleep(); This interface and in general VM_SPARSE would be useful for dynamically grown kernel stacks [1]. However, the might_sleep() here would be a problem. We would need to be able to handle vm_area_map_pages() from interrupt disabled context therefore no sleeping. The caller would need to guarantee that the page tables are pre-allocated before the mapping. Pasha [1] https://lore.kernel.org/all/CA+CK2bBYt9RAVqASB2eLyRQxYT5aiL0fGhUu3TumQCyJCNTWvw@xxxxxxxxxxxxxx