Re: [PATCH bpf-next v2 0/5] execmem_alloc for BPF programs

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Song!

On Wed, Nov 30 2022 at 08:18, Song Liu wrote:
> On Tue, Nov 29, 2022 at 3:56 PM Thomas Gleixner <tglx@xxxxxxxxxxxxx> wrote:
>> You are not making anything easier. You are violating the basic
>> engineering principle of "Fix the root cause, not the symptom".
>>
>
> I am not sure what is the root cause and the symptom here.

The symptom is iTLB pressure. The root cause is the way how module
memory is allocated, which in turn causes the fragmentation into
4k PTEs. That's the same problem for anything which uses module_alloc()
to get space for text allocated, e.g. kprobes, tracing....

A module consists of:

  - text sections
  - data sections

Except for PPC32, which has the module data in vmalloc space, all others
allocate text and data sections in one lump.

This en-bloc allocation is one reason for the 4k splits:

   - text is RX
   - data is RW or RO

Truly vmalloc'ed module data is not an option for 64bit architectures
which use PC relative addressing as vmalloc does not guarantee that the
data ends up within the limited displacement range (s32 on x8664)

This made me look at your allocator again:

> +#if defined(CONFIG_MODULES) && defined(MODULES_VADDR)
> +#define EXEC_MEM_START MODULES_VADDR
> +#define EXEC_MEM_END MODULES_END
> +#else
> +#define EXEC_MEM_START VMALLOC_START
> +#define EXEC_MEM_END VMALLOC_END
> +#endif

The #else part is completely broken on x86/64 and any other
architecture, which has PC relative restricted displacement.

Even if modules are disabled in Kconfig the only safe place to allocate
executable kernel text from (on these architectures) is the modules
address space. The ISA restrictions do not go magically away when
modules are disabled.

In the early version of the SKX retbleed mitigation work I had

  https://lore.kernel.org/all/20220716230953.442937066@xxxxxxxxxxxxx

exactly to handle this correctly for the !MODULE case. It went nowhere
as we did not need the trampolines in the final version.

This is why Peter suggested to 'split' the module address range into a
top down and bottom up part:

  https://lore.kernel.org/bpf/Ys6cWUMHO8XwyYgr@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/
  
That obviously separates text and data, but keeps everything within the
defined working range.

It immediately solves the text problem for _all_ module_alloc() users
and still leaves the data split into 4k pages due to RO/RW sections.

But after staring at it for a while I think this top down and bottom up
dance is too much effort for not much gain. The module address space is
sized generously, so the straight forward solution is to split that
space into two blocks and use them to allocate text and data separately.

The rest of Peter's suggestions how to migrate there still apply.

The init sections of a module are obviously separate as they are freed
after the module is initialized, but they are not really special either.
Today they leave holes in the address range. With the new scheme these
holes will be in the memory backed large mapping, but I don't see a real
issue with that, especially as those holes at least in text can be
reused for small allocations (kprobes, trace, bpf).

As a logical next step we make that three blocks and allocate text,
data and rodata separately, which will preserve the large mappings for
text and data. rodata still needs to be split because we need a space to
accomodate ro_after_init data.

Alternatively, instead of splitting the module address space, the
allocation mechanism can keep track of the types (text, data, rodata)
and manage large mapping blocks per type. There are pros and cons for
both approaches, so that needs some thought.

But at the end we want an allocation mechanism which:

  - preserves large mappings
  - handles a distinct address range
  - is mapping type aware

That solves _all_ the issues of modules, kprobes, tracing, bpf in one
go. See?

Thanks,

        tglx




[Index of Archives]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Bugtraq]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux