[LSF/MM/BPF TOPIC] reducing direct map fragmentation

Mike Rapoport <rppt@xxxxxxxxxx> · Wed, 1 Feb 2023 20:06:37 +0200

Hi all,

There are use-cases that need to remove pages from the direct map or at least
map them at PTE level. These use-cases include vfree, module loading, ftrace,
kprobe, BPF, secretmem and generally any caller of set_memory/set_direct_map
APIs.

Remapping pages at PTE level causes split of the PUD and PMD sized mappings
in the direct map which leads to performance degradation.

To reduce the performance hit caused by the fragmentation of the direct
map, it makes sense to group and/or cache the base pages removed from the
direct map so that the most of base pages created during a split of a large
page will be consumed by users requiring PTE level mappings.

Last year the proposal to use a new migrate type for such cache received
strong pushback and the suggested alternative was to try to use slab
instead.

I've been thinking about it (yeah, it took me a while) and I believe slab
is not appropriate because use cases require at least page size allocations
and some would really benefit from higher order allocations, and in the
most cases the code that allocates memory excluded from the direct map
needs the struct page/folio. 

For example, caching allocations of text in 2M pages would benefit from
reduced iTLB pressure and doing kmalloc() from vmalloc() will be way more
intrusive than using some variant of __alloc_pages().

Secretmem and potentially PKS protected page tables also need struct
page/folio.

My current proposal is to have a cache of 2M pages close to the page
allocator and use a GFP flag to make allocation request use that cache. On
the free() path, the pages that are mapped at PTE level will be put into
that cache.

The cache is internally implemented as a buddy allocator so it can satisfy
high order allocations, and there will be a shrinker to release free pages
from that cache to the page allocator.

I hope to have a first prototype posted Really Soon.

-- 
Sincerely yours,
Mike.