Re: [PATCH v4 bpf-next 2/2] mm: Introduce VM_SPARSE kind and vm_area_[un]map_pages().

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Wed, Mar 6, 2024 at 1:04 PM Pasha Tatashin <pasha.tatashin@xxxxxxxxxx> wrote:
>
> On Mon, Mar 4, 2024 at 10:05 PM Alexei Starovoitov
> <alexei.starovoitov@xxxxxxxxx> wrote:
> >
> > From: Alexei Starovoitov <ast@xxxxxxxxxx>
> >
> > vmap/vmalloc APIs are used to map a set of pages into contiguous kernel
> > virtual space.
> >
> > get_vm_area() with appropriate flag is used to request an area of kernel
> > address range. It's used for vmalloc, vmap, ioremap, xen use cases.
> > - vmalloc use case dominates the usage. Such vm areas have VM_ALLOC flag.
> > - the areas created by vmap() function should be tagged with VM_MAP.
> > - ioremap areas are tagged with VM_IOREMAP.
> >
> > BPF would like to extend the vmap API to implement a lazily-populated
> > sparse, yet contiguous kernel virtual space. Introduce VM_SPARSE flag
> > and vm_area_map_pages(area, start_addr, count, pages) API to map a set
> > of pages within a given area.
> > It has the same sanity checks as vmap() does.
> > It also checks that get_vm_area() was created with VM_SPARSE flag
> > which identifies such areas in /proc/vmallocinfo
> > and returns zero pages on read through /proc/kcore.
> >
> > The next commits will introduce bpf_arena which is a sparsely populated
> > shared memory region between bpf program and user space process. It will
> > map privately-managed pages into a sparse vm area with the following steps:
> >
> >   // request virtual memory region during bpf prog verification
> >   area = get_vm_area(area_size, VM_SPARSE);
> >
> >   // on demand
> >   vm_area_map_pages(area, kaddr, kend, pages);
> >   vm_area_unmap_pages(area, kaddr, kend);
> >
> >   // after bpf program is detached and unloaded
> >   free_vm_area(area);
> >
> > Signed-off-by: Alexei Starovoitov <ast@xxxxxxxxxx>
> > ---
> >  include/linux/vmalloc.h |  5 ++++
> >  mm/vmalloc.c            | 59 +++++++++++++++++++++++++++++++++++++++--
> >  2 files changed, 62 insertions(+), 2 deletions(-)
> >
> > diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
> > index c720be70c8dd..0f72c85a377b 100644
> > --- a/include/linux/vmalloc.h
> > +++ b/include/linux/vmalloc.h
> > @@ -35,6 +35,7 @@ struct iov_iter;              /* in uio.h */
> >  #else
> >  #define VM_DEFER_KMEMLEAK      0
> >  #endif
> > +#define VM_SPARSE              0x00001000      /* sparse vm_area. not all pages are present. */
> >
> >  /* bits [20..32] reserved for arch specific ioremap internals */
> >
> > @@ -232,6 +233,10 @@ static inline bool is_vm_area_hugepages(const void *addr)
> >  }
> >
> >  #ifdef CONFIG_MMU
> > +int vm_area_map_pages(struct vm_struct *area, unsigned long start,
> > +                     unsigned long end, struct page **pages);
> > +void vm_area_unmap_pages(struct vm_struct *area, unsigned long start,
> > +                        unsigned long end);
> >  void vunmap_range(unsigned long addr, unsigned long end);
> >  static inline void set_vm_flush_reset_perms(void *addr)
> >  {
> > diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> > index f42f98a127d5..e5b8c70950bc 100644
> > --- a/mm/vmalloc.c
> > +++ b/mm/vmalloc.c
> > @@ -648,6 +648,58 @@ static int vmap_pages_range(unsigned long addr, unsigned long end,
> >         return err;
> >  }
> >
> > +static int check_sparse_vm_area(struct vm_struct *area, unsigned long start,
> > +                               unsigned long end)
> > +{
> > +       might_sleep();
>
> This interface and in general VM_SPARSE would be useful for
> dynamically grown kernel stacks [1]. However, the might_sleep() here
> would be a problem. We would need to be able to handle
> vm_area_map_pages() from interrupt disabled context therefore no
> sleeping. The caller would need to guarantee that the page tables are
> pre-allocated before the mapping.

Sounds like we'd need to differentiate two kinds of sparse regions.
One that is really sparse where page tables are not populated (bpf use case)
and another where only the pte level might be empty.
Only the latter one will be usable for such auto-grow stacks.

Months back I played with this idea:
https://git.kernel.org/pub/scm/linux/kernel/git/ast/bpf.git/commit/?&id=ce63949a879f2f26c1c1834303e6dfbfb79d1fbd
that
"Make vmap_pages_range() allocate page tables down to the last (PTE) level."
Essentially pass NULL instead of 'pages' into vmap_pages_range()
and it will populate all levels except the last.
Then the page fault handler can service a fault in auto-growing stack
area if it has a page stashed in some per-cpu free list.
I suspect this is something you might need for
"16k stack that is populated on fault",
plus a free list of 3 pages per-cpu,
and set_pte_at() in pf handler.





[Index of Archives]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Bugtraq]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux