Re: [PATCH bpf-next v2 0/5] execmem_alloc for BPF programs

Thomas Gleixner <tglx@xxxxxxxxxxxxx> · Thu, 01 Dec 2022 23:34:57 +0100

Mike!

On Thu, Dec 01 2022 at 22:23, Mike Rapoport wrote:
> On Thu, Dec 01, 2022 at 10:08:18AM +0100, Thomas Gleixner wrote:
>> On Wed, Nov 30 2022 at 08:18, Song Liu wrote:
>> The symptom is iTLB pressure. The root cause is the way how module
>> memory is allocated, which in turn causes the fragmentation into
>> 4k PTEs. That's the same problem for anything which uses module_alloc()
>> to get space for text allocated, e.g. kprobes, tracing....
>
> There's also dTLB pressure caused by the fragmentation of the direct map.
> The memory allocated with module_alloc() is a priori mapped with 4k PTEs,
> but setting RO in the malloc address space also updates the direct map
> alias and this causes splits of large pages.
>
> It's not clear what causes more performance improvement: avoiding splits of
> large pages in the direct map or reducing iTLB pressure by backing text
> memory with 2M pages.

>From our experiments when doing the first version of the SKX retbleed
mitigation, the main improvement came from reducing iTLB pressure simply
because the iTLB cache is really small.

The kernel text placement is way beyond suboptimal. If you really do a
hotpath analysis and (manually) place all hot code into one or two 2M
pages, then you can achieve massive performance improvements way above
the 10% range.

We currently have a master student investigating this, but it will take
some time until usable results materialize.

> If the major improvement comes from keeping direct map intact, it's
> might be possible to mix data and text in the same 2M page.

No. That can't work.

    text = RX
    data = RW or RO

If you mix this, then you end up with RWX for the whole 2M page. Not an
option really as you lose _all_ protections in one go.

That's why I said:

>>      As a logical next step we make that three blocks and allocate text,
>>      data and rodata separately, which will preserve the large mappings for
>>      text and data. rodata still needs to be split because we need a space to
>>      accomodate ro_after_init data.

The point is, that rodata and ro_after_init_data is a pretty small
portion of modules as far as my limited analysis of a distro build
shows.

The bulk is in text and data. So if we preserve 2M pages for text and
for RW data and bite the bullet to split one 2M page for
ro[_after_init_]data, we get the maximum benefit for the least
complexity.

>> But at the end we want an allocation mechanism which:
>> 
>>   - preserves large mappings
>>   - handles a distinct address range
>>   - is mapping type aware
>> 
>> That solves _all_ the issues of modules, kprobes, tracing, bpf in one
>> go. See?
>
> There is also
>
>     - handles kaslr
>
> and at least for arm and powerpc we'd also need 
>
>     - handles architecture specific range restrictions and fallbacks

Good points.

kaslr should be fairly trivial.

The architecture specific restrictions and fallbacks are not really hard
to solve either. If done right then the allocator just falls back to 4k
maps during initialization in early boot which brings it back to the
status quo. But we can provide consistent semantics for the three types
which are required for modules and the text only usage for kprobes,
tracing, bpf...

Thanks,

        tglx