Re: [PATCH bpf-next v2 0/5] execmem_alloc for BPF programs

Song Liu <song@xxxxxxxxxx> · Wed, 9 Nov 2022 09:43:50 -0800

On Wed, Nov 9, 2022 at 3:18 AM Mike Rapoport <rppt@xxxxxxxxxx> wrote:
>
[...]

> > >
> > > The proposed execmem_alloc() looks to me very much tailored for x86
> > > to be
> > > used as a replacement for module_alloc(). Some architectures have
> > > module_alloc() that is quite different from the default or x86
> > > version, so
> > > I'd expect at least some explanation how modules etc can use execmem_
> > > APIs
> > > without breaking !x86 architectures.
> >
> > I think this is fair, but I think we should ask ask ourselves - how
> > much should we do in one step?
>
> I think that at least we need an evidence that execmem_alloc() etc can be
> actually used by modules/ftrace/kprobes. Luis said that RFC v2 didn't work
> for him at all, so having a core MM API for code allocation that only works
> with BPF on x86 seems not right to me.

While using execmem_alloc() et. al. in module support is difficult, folks are
making progress with it. For example, the prototype would be more difficult
before CONFIG_ARCH_WANTS_MODULES_DATA_IN_VMALLOC
(introduced by Christophe).

We also have other users that we can onboard soon: BPF trampoline on
x86_64, BPF jit and trampoline on arm64, and maybe also on powerpc and
s390.

>
> > For non-text_poke() architectures, the way you can make it work is have
> > the API look like:
> > execmem_alloc()  <- Does the allocation, but necessarily usable yet
> > execmem_write()  <- Loads the mapping, doesn't work after finish()
> > execmem_finish() <- Makes the mapping live (loaded, executable, ready)
> >
> > So for text_poke():
> > execmem_alloc()  <- reserves the mapping
> > execmem_write()  <- text_pokes() to the mapping
> > execmem_finish() <- does nothing
> >
> > And non-text_poke():
> > execmem_alloc()  <- Allocates a regular RW vmalloc allocation
> > execmem_write()  <- Writes normally to it
> > execmem_finish() <- does set_memory_ro()/set_memory_x() on it
> >
> > Non-text_poke() only gets the benefits of centralized logic, but the
> > interface works for both. This is pretty much what the perm_alloc() RFC
> > did to make it work with other arch's and modules. But to fit with the
> > existing modules code (which is actually spread all over) and also
> > handle RO sections, it also needed some additional bells and whistles.
>
> I'm less concerned about non-text_poke() part, but rather about
> restrictions where code and data can live on different architectures and
> whether these restrictions won't lead to inability to use the centralized
> logic on, say, arm64 and powerpc.
>
> For instance, if we use execmem_alloc() for modules, it means that data
> sections should be allocated separately with plain vmalloc(). Will this
> work universally? Or this will require special care with additional
> complexity in the modules code?
>
> > So the question I'm trying to ask is, how much should we target for the
> > next step? I first thought that this functionality was so intertwined,
> > it would be too hard to do iteratively. So if we want to try
> > iteratively, I'm ok if it doesn't solve everything.
>
> With execmem_alloc() as the first step I'm failing to see the large
> picture. If we want to use it for modules, how will we allocate RO data?
> with similar rodata_alloc() that uses yet another tree in vmalloc?
> How the caching of large pages in vmalloc can be made useful for use cases
> like secretmem and PKS?

If RO data causes problems with direct map fragmentation, we can use
similar logic. I think we will need another tree in vmalloc for this case.
Since the logic will be mostly identical, I personally don't think adding
another tree is a big overhead.

Thanks,
Song