On Wed, Feb 01, 2023 at 08:06:37PM +0200, Mike Rapoport wrote: > Hi all, Hi Mike, I'm interested in this topic and hope to discuss this with you at LSF/MM/BPF. > There are use-cases that need to remove pages from the direct map or at least > map them at PTE level. These use-cases include vfree, module loading, ftrace, > kprobe, BPF, secretmem and generally any caller of set_memory/set_direct_map > APIs. > > Remapping pages at PTE level causes split of the PUD and PMD sized mappings > in the direct map which leads to performance degradation. > > To reduce the performance hit caused by the fragmentation of the direct > map, it makes sense to group and/or cache the base pages removed from the > direct map so that the most of base pages created during a split of a large > page will be consumed by users requiring PTE level mappings. How much performance difference did you see in your test when direct map was fragmented, or is there a way to check this difference? > Last year the proposal to use a new migrate type for such cache received > strong pushback and the suggested alternative was to try to use slab > instead. > > I've been thinking about it (yeah, it took me a while) and I believe slab > is not appropriate because use cases require at least page size allocations > and some would really benefit from higher order allocations, and in the > most cases the code that allocates memory excluded from the direct map > needs the struct page/folio. > > For example, caching allocations of text in 2M pages would benefit from > reduced iTLB pressure and doing kmalloc() from vmalloc() will be way more > intrusive than using some variant of __alloc_pages(). > > Secretmem and potentially PKS protected page tables also need struct > page/folio. > > My current proposal is to have a cache of 2M pages close to the page > allocator and use a GFP flag to make allocation request use that cache. On > the free() path, the pages that are mapped at PTE level will be put into > that cache. I would like to discuss not only having cache layer of pages but also how direct map could be merged correctly and efficiently. I vaguely recall that Aaron Lu sent RFC series about this and Kirill A. Shutemov's feedback was to batch merge operations. [1] Also a CPA API called by the cache layer that could merge fragmented mappings would work for merging 4K pages to 2M [2], but won't work for merging 2M mappings to 1G mappings. At that time I didn't follow more discussions (e.g. execmem_alloc()) Maybe I'm missing some points. [1] https://lore.kernel.org/linux-mm/20220809100408.rm6ofiewtty6rvcl@box [2] https://lore.kernel.org/linux-mm/YvfLxuflw2ctHFWF@xxxxxxxxxx > The cache is internally implemented as a buddy allocator so it can satisfy > high order allocations, and there will be a shrinker to release free pages > from that cache to the page allocator. > > I hope to have a first prototype posted Really Soon. Looking forward to that! Wonder how it would be shaped. > > -- > Sincerely yours, > Mike.