Re: [PATCH v6 bpf-next 0/5] bpf_prog_pack followup

Song Liu <songliubraving@xxxxxx> · Tue, 12 Jul 2022 23:12:22 +0000

> On Jul 12, 2022, at 12:04 PM, Luis Chamberlain <mcgrof@xxxxxxxxxx> wrote:
> 
> On Tue, Jul 12, 2022 at 05:49:32AM +0000, Song Liu wrote:
>>> On Jul 11, 2022, at 9:18 PM, Luis Chamberlain <mcgrof@xxxxxxxxxx> wrote:
>> 
>>> I believe you are mentioning requiring text_poke() because the way
>>> eBPF code uses the module_alloc() is different. Correct me if I'm
>>> wrong, but from what I gather is you use the text_poke_copy() as the data
>>> is already RO+X, contrary module_alloc() use cases. You do this since your
>>> bpf_prog_pack_alloc() calls set_memory_ro() and set_memory_x() after
>>> module_alloc() and before you can use this memory. This is a different type
>>> of allocator. And, again please correct me if I'm wrong but now you want to
>>> share *one* 2 MiB huge-page for multiple BPF programs to help with the
>>> impact of TLB misses.
>> 
>> Yes, sharing 1x 2MiB huge page is the main reason to require text_poke. 
>> OTOH, 2MiB huge pages without sharing is not really useful. Both kprobe
>> and ftrace only uses a fraction of a 4kB page. Most BPF programs and 
>> modules cannot use 2MiB either. Therefore, vmalloc_rw_exec() doesn't add
>> much value on top of current module_alloc(). 
> 
> Thanks for the clarification.
> 
>>> A vmalloc_ro_exec() by definition would imply a text_poke().
>>> 
>>> Can kprobes, ftrace and modules use it too? It would be nice
>>> so to not have to deal with the loose semantics on the user to
>>> have to use set_vm_flush_reset_perms() on ro+x later, but
>>> I think this can be addressed separately on a case by case basis.
>> 
>> I am pretty confident that kprobe and ftrace can share huge pages with 
>> BPF programs.
> 
> Then wonderful, we know where to go in terms of a new API then as it
> can be shared in the future for sure and there are gains.
> 
>> I haven't looked into all the details with modules, but 
>> given CONFIG_ARCH_WANTS_MODULES_DATA_IN_VMALLOC, I think it is also 
>> possible.
> 
> Sure.
> 
>> Once this is done, a regular system (without huge BPF program or huge
>> modules) will just use 1x 2MB page for text from module, ftrace, kprobe, 
>> and bpf programs. 
> 
> That would be nice, if possible, however modules will require likely its
> own thing, on my system I see about 57 MiB used on coresize alone.
> 
> lsmod | grep -v Module | cut -f1 -d ' ' | \
> 	xargs sudo modinfo | grep filename | \
> 	grep -o '/.*' | xargs stat -c "%s - %n" | \
> 	awk 'BEGIN {sum=0} {sum+=$1} END {print sum}'
> 60001272
> 
> And so perhaps we need such a pool size to be configurable.
> 
>>> But a vmalloc_ro_exec() with a respective free can remove the
>>> requirement to do set_vm_flush_reset_perms().
>> 
>> Removing the requirement to set_vm_flush_reset_perms() is the other
>> reason to go directly to vmalloc_ro_exec(). 
> 
> Yes fantastic.
> 
>> My current version looks like this:
>> 
>> void *vmalloc_exec(unsigned long size);
>> void vfree_exec(void *ptr, unsigned int size);
>> 
>> ro is eliminated as there is no rw version of the API. 
> 
> Alright.
> 
> I am not sure if 2 MiB will suffice given what I mentioned above, and
> what to do to ensure this grows at a reasonable pace. Then, at least for
> usage for all architectures since not all will support text_poke() we
> will want to consider a way to make it easy to users to use non huge
> page fallbacks, but that would be up to those users, so we can wait for
> that.

We are not limited to 2MiB total. The logic is like: 

1. Anything bigger than 2MiB gets its own allocation.
2. We maintain a list of 2MiB pages, and bitmaps showing which parts of 
   these pages are in use. 
3. For objects smaller than 2MiB, we will try to fit it in one of these
   pages. 
   3. a) If there isn't a page with big enough continuous free space, we
        will allocate a new 2MiB page. 

(For system with n NUMA nodes, multiple 2MiB above by n). 

So, if we have 100 kernel modules using 1MiB each, they will share 50x
2MiB pages. 

Thanks,
Song