Re: [LSF/MM/BPF TOPIC] reducing direct map fragmentation

Hyeonggon Yoo <42.hyeyoo@xxxxxxxxx> · Sun, 19 Feb 2023 08:07:59 +0000

On Wed, Feb 01, 2023 at 08:06:37PM +0200, Mike Rapoport wrote:
> Hi all,

Hi Mike, I'm interested in this topic and hope to discuss this with you
at LSF/MM/BPF.

> There are use-cases that need to remove pages from the direct map or at least
> map them at PTE level. These use-cases include vfree, module loading, ftrace,
> kprobe, BPF, secretmem and generally any caller of set_memory/set_direct_map
> APIs.
> 
> Remapping pages at PTE level causes split of the PUD and PMD sized mappings
> in the direct map which leads to performance degradation.
>
> To reduce the performance hit caused by the fragmentation of the direct
> map, it makes sense to group and/or cache the base pages removed from the
> direct map so that the most of base pages created during a split of a large
> page will be consumed by users requiring PTE level mappings.

How much performance difference did you see in your test when direct
map was fragmented, or is there a way to check this difference? 

> Last year the proposal to use a new migrate type for such cache received
> strong pushback and the suggested alternative was to try to use slab
> instead.
> 
> I've been thinking about it (yeah, it took me a while) and I believe slab
> is not appropriate because use cases require at least page size allocations
> and some would really benefit from higher order allocations, and in the
> most cases the code that allocates memory excluded from the direct map
> needs the struct page/folio. 
>
> For example, caching allocations of text in 2M pages would benefit from
> reduced iTLB pressure and doing kmalloc() from vmalloc() will be way more
> intrusive than using some variant of __alloc_pages().
>
> Secretmem and potentially PKS protected page tables also need struct
> page/folio.
> 
> My current proposal is to have a cache of 2M pages close to the page
> allocator and use a GFP flag to make allocation request use that cache. On
> the free() path, the pages that are mapped at PTE level will be put into
> that cache.

I would like to discuss not only having cache layer of pages but also how
direct map could be merged correctly and efficiently.

I vaguely recall that Aaron Lu sent RFC series about this and Kirill A.
Shutemov's feedback was to batch merge operations. [1]

Also a CPA API called by the cache layer that could merge fragmented
mappings would work for merging 4K pages to 2M [2], but won't work
for merging 2M mappings to 1G mappings.

At that time I didn't follow more discussions (e.g. execmem_alloc())
Maybe I'm missing some points.

[1] https://lore.kernel.org/linux-mm/20220809100408.rm6ofiewtty6rvcl@box

[2] https://lore.kernel.org/linux-mm/YvfLxuflw2ctHFWF@xxxxxxxxxx

> The cache is internally implemented as a buddy allocator so it can satisfy
> high order allocations, and there will be a shrinker to release free pages
> from that cache to the page allocator.
> 
> I hope to have a first prototype posted Really Soon.

Looking forward to that!
Wonder how it would be shaped.

> 
> -- 
> Sincerely yours,
> Mike.