On Wed, Nov 9, 2022 at 1:24 PM Christophe Leroy <christophe.leroy@xxxxxxxxxx> wrote: > > + linuxppc-dev list as we start mentioning powerpc. > > Le 09/11/2022 à 18:43, Song Liu a écrit : > > On Wed, Nov 9, 2022 at 3:18 AM Mike Rapoport <rppt@xxxxxxxxxx> wrote: > >> > > [...] > > > >>>> > >>>> The proposed execmem_alloc() looks to me very much tailored for x86 > >>>> to be > >>>> used as a replacement for module_alloc(). Some architectures have > >>>> module_alloc() that is quite different from the default or x86 > >>>> version, so > >>>> I'd expect at least some explanation how modules etc can use execmem_ > >>>> APIs > >>>> without breaking !x86 architectures. > >>> > >>> I think this is fair, but I think we should ask ask ourselves - how > >>> much should we do in one step? > >> > >> I think that at least we need an evidence that execmem_alloc() etc can be > >> actually used by modules/ftrace/kprobes. Luis said that RFC v2 didn't work > >> for him at all, so having a core MM API for code allocation that only works > >> with BPF on x86 seems not right to me. > > > > While using execmem_alloc() et. al. in module support is difficult, folks are > > making progress with it. For example, the prototype would be more difficult > > before CONFIG_ARCH_WANTS_MODULES_DATA_IN_VMALLOC > > (introduced by Christophe). > > By the way, the motivation for CONFIG_ARCH_WANTS_MODULES_DATA_IN_VMALLOC > was completely different: This was because on powerpc book3s/32, no-exec > flaggin is per segment of size 256 Mbytes, so in order to provide > STRICT_MODULES_RWX it was necessary to put data outside of the segment > that holds module text in order to be able to flag RW data as no-exec. Yeah, I only noticed the actual motivation of this work earlier today. :) > > But I'm happy if it can also serve other purposes. > > > > > We also have other users that we can onboard soon: BPF trampoline on > > x86_64, BPF jit and trampoline on arm64, and maybe also on powerpc and > > s390. > > > >> > >>> For non-text_poke() architectures, the way you can make it work is have > >>> the API look like: > >>> execmem_alloc() <- Does the allocation, but necessarily usable yet > >>> execmem_write() <- Loads the mapping, doesn't work after finish() > >>> execmem_finish() <- Makes the mapping live (loaded, executable, ready) > >>> > >>> So for text_poke(): > >>> execmem_alloc() <- reserves the mapping > >>> execmem_write() <- text_pokes() to the mapping > >>> execmem_finish() <- does nothing > >>> > >>> And non-text_poke(): > >>> execmem_alloc() <- Allocates a regular RW vmalloc allocation > >>> execmem_write() <- Writes normally to it > >>> execmem_finish() <- does set_memory_ro()/set_memory_x() on it > >>> > >>> Non-text_poke() only gets the benefits of centralized logic, but the > >>> interface works for both. This is pretty much what the perm_alloc() RFC > >>> did to make it work with other arch's and modules. But to fit with the > >>> existing modules code (which is actually spread all over) and also > >>> handle RO sections, it also needed some additional bells and whistles. > >> > >> I'm less concerned about non-text_poke() part, but rather about > >> restrictions where code and data can live on different architectures and > >> whether these restrictions won't lead to inability to use the centralized > >> logic on, say, arm64 and powerpc. > > Until recently, powerpc CPU didn't implement PC-relative data access. > Only very recent powerpc CPUs (power10 only ?) have capability to do > PC-relative accesses, but the kernel doesn't use it yet. So there's no > constraint about distance between text and data. What matters is the > distance between core kernel text and module text to avoid trampolines. Ah, this is great. I guess this means powerpc can benefit from this work with much less effort than x86_64. > > >> > >> For instance, if we use execmem_alloc() for modules, it means that data > >> sections should be allocated separately with plain vmalloc(). Will this > >> work universally? Or this will require special care with additional > >> complexity in the modules code? > >> > >>> So the question I'm trying to ask is, how much should we target for the > >>> next step? I first thought that this functionality was so intertwined, > >>> it would be too hard to do iteratively. So if we want to try > >>> iteratively, I'm ok if it doesn't solve everything. > >> > >> With execmem_alloc() as the first step I'm failing to see the large > >> picture. If we want to use it for modules, how will we allocate RO data? > >> with similar rodata_alloc() that uses yet another tree in vmalloc? > >> How the caching of large pages in vmalloc can be made useful for use cases > >> like secretmem and PKS? > > > > If RO data causes problems with direct map fragmentation, we can use > > similar logic. I think we will need another tree in vmalloc for this case. > > Since the logic will be mostly identical, I personally don't think adding > > another tree is a big overhead. > > On powerpc, kernel core RAM is not mapped by pages but is mapped by > blocks. There are only two blocks: One ROX block which contains both > text and rodata, and one RW block that contains everything else. Maybe > the same can be done for modules. What matters is to be sure you never > have WX memory. Having ROX rodata is not an issue. Got it. Thanks! Song