Re: [PATCH v4 bpf 0/4] vmalloc: bpf: introduce VM_ALLOW_HUGE_VMAP

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi,

On Tue, Apr 19, 2022 at 05:36:45AM +0000, Song Liu wrote:
> Hi Mike, Luis, and Rick,
> 
> Thanks for sharing your work and findings in the space. I didn't 
> realize we were looking at the same set of problems. 
> 
> > On Apr 18, 2022, at 6:56 PM, Edgecombe, Rick P <rick.p.edgecombe@xxxxxxxxx> wrote:
> > 
> > On Mon, 2022-04-18 at 17:44 -0700, Luis Chamberlain wrote:
> >>> There are use-cases that require 4K pages with non-default
> >>> permissions in
> >>> the direct map and the pages not necessarily should be executable.
> >>> There
> >>> were several suggestions to implement caches of 4K pages backed by
> >>> 2M
> >>> pages.
> >> 
> >> Even if we just focus on the executable side of the story... there
> >> may
> >> be users who can share this too.
> >> 
> >> I've gone down memory lane now at least down to year 2005 in kprobes
> >> to see why the heck module_alloc() was used. At first glance there
> >> are
> >> some old comments about being within the 2 GiB text kernel range...
> >> But
> >> some old tribal knowledge is still lost. The real hints come from
> >> kprobe work
> >> since commit 9ec4b1f356b3 ("[PATCH] kprobes: fix single-step out of
> >> line
> >> - take2"), so that the "For the %rip-relative displacement fixups to
> >> be
> >> doable"... but this got me wondering, would other users who *do* want
> >> similar funcionality benefit from a cache. If the space is limited
> >> then
> >> using a cache makes sense. Specially if architectures tend to require
> >> hacks for some of this to all work.
> > 
> > Yea, that was my understanding. X86 modules have to be linked within
> > 2GB of the kernel text, also eBPF x86 JIT generates code that expects
> > to be within 2GB of the kernel text.
> > 
> > 
> > I think of two types of caches we could have: caches of unmapped pages
> > on the direct map and caches of virtual memory mappings. Caches of
> > pages on the direct map reduce breakage of the large pages (and is
> > somewhat x86 specific problem). Caches of virtual memory mappings
> > reduce shootdowns, and are also required to share huge pages. I'll plug
> > my old RFC, where I tried to work towards enabling both:
> > 
> > https://lore.kernel.org/lkml/20201120202426.18009-1-rick.p.edgecombe@xxxxxxxxx/
> > 
> > Since then Mike has taken a lot further the direct map cache piece.
> 
> These are really interesting work. With this landed, we won't need 
> the bpf_prog_pack work at all (I think). OTOH, this looks like a 
> long term project, as some of the work in bpf_prog_pack took quite 
> some time to discuss/debate, and that was just a subset of the 
> whole thing. 

I'd say that bpf_prog_pack was a cure for symptoms and this project tries
to address more general problem.
But you are right, it'll take some time and won't land in 5.19.
 
> I really like the two types of cache concept. But there are some 
> details I cannot figure out about them:

After some discussions we decided to try moving the caching of large pages
to the page allocator and see if the second cache will be needed at all.
But I've got distracted after posting the RFC and that work didn't have
real progress since then.
 
> 1. Is "caches of unmapped pages on direct map" (cache #1) 
>    sufficient to fix all direct map fragmentation? IIUC, pages in
>    the cache may still be used by other allocation (with some 
>    memory pressure). If the system runs for long enough, there 
>    may be a lot of direct map fragmentation. Is this right?

If the system runs long enough, it may run out of high-order free pages
regardless of the way the caches are implemented. Then we either fail the
allocation because it is impossible to refill the cache with large pages or
fall back to 4k pages and fragment direct map.

I don't see how can we avoid direct map fragmentation entirely and still be
able to allocate memory for users of set_memory APIs.

> 2. If we have "cache of virtual memory mappings" (cache #2), do we
>    still need cache #1? I know cache #2 alone may waste some 
>    memory, but I still think 2MB within noise for modern systems. 

I presume that by cache #1 you mean the cache in the page allocator. In
that case cache #2 is probably not needed at all, because the cache at page
allocator level will be used by vmalloc() and friends to provide what Rick
called "permissioned allocations".

> Thanks,
> Song

-- 
Sincerely yours,
Mike.




[Index of Archives]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Bugtraq]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux