Re: [PATCH bpf-next 1/3] mm/vmalloc: introduce vmalloc_exec which allocates RO+X memory

Song Liu <songliubraving@xxxxxx> · Wed, 13 Jul 2022 15:48:35 +0000

> On Jul 13, 2022, at 3:20 AM, Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:
> 
> On Wed, Jul 13, 2022 at 12:18:44AM -0700, Song Liu wrote:
>> Dynamically allocated kernel texts, such as module texts, bpf programs,
>> and ftrace trampolines, are used in more and more scenarios. Currently,
>> these users allocate meory with module_alloc, fill the memory with text,
>> and then use set_memory_[ro|x] to protect the memory.
>> 
>> This approach has two issues:
>> 1) each of these user occupies one or more RO+X page, and thus one or
>>    more entry in the page table and the iTLB;
>> 2) frequent allocate/free of RO+X pages causes fragmentation of kernel
>>    direct map [1].
>> 
>> BPF prog pack [2] addresses this from the BPF side. Now, make the same
>> logic available to other users of dynamic kernel text.
>> 
>> The new API is like:
>> 
>>  void *vmalloc_exec(size_t size);
>>  void vfree_exec(void *addr, size_t size);
>> 
>> vmalloc_exec has different handling for small and big allocations
>> (> PMD_SIZE * num_possible_nodes). bigger allocations have dedicated
>> vmalloc allocation; while small allocations share a vmalloc_exec_pack
>> with other allocations.
>> 
>> Once allocated, the vmalloc_exec_pack is filled with invalid instructions
> 
> *sigh*, again, INT3 is a *VALID* instruction.

I am fully aware "invalid" or "illegal" is not accurate, but I am not 
sure what to use. Shall we call them "safe" instructions?

> 
>> and protected with RO+X. Some text_poke feature is required to make
>> changes to the vmalloc_exec_pack. Therefore, vmalloc_exec requires changes
>> from the arch (to provide text_poke family APIs), and the user (to use
>> text poke APIs to make any changes to the memory).
> 
> I hate the naming; this isn't just vmalloc, this is a whole different
> allocator build on top of things.
> 
> I'm also not convinced this is the right way to go about doing this;
> much of the design here is because of how the module range is mixing
> text and data and working around that.

Hmm.. I am not sure mixed data/text is the only problem here. 

> 
> So how about instead we separate them? Then much of the problem goes
> away, you don't need to track these 2M chunks at all.

If we manage the memory in < 2MiB granularity, either 4kB or smaller, 
we still need some way to track which parts are being used, no? I mean
the bitmap.  

> 
> Start by adding VM_TOPDOWN_VMAP, which instead of returning the lowest
> (leftmost) vmap_area that fits, picks the higests (rightmost).
> 
> Then add module_alloc_data() that uses VM_TOPDOWN_VMAP and make
> ARCH_WANTS_MODULE_DATA_IN_VMALLOC use that instead of vmalloc (with a
> weak function doing the vmalloc).
> 
> This gets you bottom of module range is RO+X only, top is shattered
> between different !X types.
> 
> Then track the boundary between X and !X and ensure module_alloc_data()
> and module_alloc() never cross over and stay strictly separated.
> 
> Then change all module_alloc() users to expect RO+X memory, instead of
> RW.
> 
> Then make sure any extention of the X range is 2M aligned.
> 
> And presto, *everybody* always uses 2M TLB for text, modules, bpf,
> ftrace, the lot and nobody is tracking chunks.
> 
> Maybe migration can be eased by instead providing module_alloc_text()
> and ARCH_WANTS_MODULE_ALLOC_TEXT.

If we have the text/data separation, can we just put text after _etext? 

Right now, we allocate huge pages for _stext to round_down(_etext, 2MB),
and 4kB pages for round_down(_etext, 2MB) to round_up(_etext, 4kB). To 
make this more efficient, we can allocate huge pages for _stext to 
round_up(_etext, 2MB), and use _etext to round_up(_etext, 2MB) as the
first pool of memory for module_alloc_text(). Once we used all the 
memory there, we allocate more huge pages after round_up(_etext, 2MB).

I am not sure how to make this work, but I guess this is similar to 
the idea you are describing here? However, we will need some bitmap 
to track the usage of these memory pools, right?

Thanks,
Song