On Mon, Aug 23, 2021 at 08:29:49PM +0000, Edgecombe, Rick P wrote: > On Mon, 2021-08-23 at 16:25 +0300, Mike Rapoport wrote: > > From: Mike Rapoport <rppt@xxxxxxxxxxxxx> > > > > When __GFP_PTE_MAPPED flag is passed to an allocation request of > > order 0, > > the allocated page will be mapped at PTE level in the direct map. > > > > To reduce the direct map fragmentation, maintain a cache of 4K pages > > that > > are already mapped at PTE level in the direct map. Whenever the cache > > should be replenished, try to allocate 2M page and split it to 4K > > pages > > to localize shutter of the direct map. If the allocation of 2M page > > fails, > > fallback to a single page allocation at expense of the direct map > > fragmentation. > > > > The cache registers a shrinker that releases free pages from the > > cache to > > the page allocator. > > > > The __GFP_PTE_MAPPED and caching of 4K pages are enabled only if an > > architecture selects ARCH_WANTS_PTE_MAPPED_CACHE in its Kconfig. > > > > [ > > cache management are mostly copied from > > > https://lore.kernel.org/lkml/20210505003032.489164-4-rick.p.edgecombe@xxxxxxxxx/ > > ] > > > > Signed-off-by: Mike Rapoport <rppt@xxxxxxxxxxxxx> > > --- > > arch/Kconfig | 8 + > > arch/x86/Kconfig | 1 + > > include/linux/gfp.h | 11 +- > > include/linux/mm.h | 2 + > > include/linux/pageblock-flags.h | 26 ++++ > > init/main.c | 1 + > > mm/internal.h | 3 +- > > mm/page_alloc.c | 261 +++++++++++++++++++++++++++++++- > > 8 files changed, 309 insertions(+), 4 deletions(-) ... > > +static void pte_mapped_cache_add_neighbour_pages(struct page *page) > > +{ > > +#if 0 > > + /* > > + * TODO: if pte_mapped_cache_replenish() had to fallback to > > order-0 > > + * allocation, the large page in the direct map will be split > > + * anyway and if there are free pages in the same pageblock > > they > > + * can be added to pte_mapped cache. > > + */ > > + unsigned int order = (1 << HUGETLB_PAGE_ORDER); > > + unsigned int nr_pages = (1 << order); > > + unsigned long pfn = page_to_pfn(page); > > + struct page *page_head = page - (pfn & (order - 1)); > > + > > + for (i = 0; i < nr_pages; i++) { > > + page = page_head + i; > > + if (is_free_buddy_page(page)) { > > + take_page_off_buddy(page); > > + pte_mapped_cache_add(&pte_mapped_cache, page); > > + } > > + } > > +#endif > > +} > > > This seems a nice benefit of doing this sort of stuff in the page > allocator if it can work. I didn't try enable it yet, but I don't see a fundamental reason why this won't work. > > +static struct page *alloc_page_pte_mapped(gfp_t gfp) > > > I'm a little disappointed building into the page allocator didn't > automatically make higher order allocations easy. It seems this mostly > bolts the grouped pages code on to the page allocator and splits out of > the allocation/free paths to call into it? > > I was thinking the main benefit of handling direct map permissions in > the page allocator would be re-using the buddy part to support high > order pages, etc. Did you try to build it in like that? If we can't get > that, what is the benefit to doing permission stuff in the pageallocator? The addition of grouped pages to page allocator the way I did is somewhat intermediate solution between keeping such cache entirely separate from page allocator vs making it really tightly integrated, e.g. using a new migratetype or doing more intrusive changes to page allocator. One of the reasons I did it this way is to present various trade-offs because, tbh, I'm not yet sure what's the best way to move forward. [The other reason being my laziness, dropping your grouped pages code into the page allocator was the simplest thing to do ;-)]. The immediate benefit of having this code close to the page allocator is the simplification of the free path. Otherwise we'd need a cache-specific free method or some information in struct page about how to free a grouped page. Besides, it is possible to put pages mapped as 4k into such cache at boot time when page allocator is initialized. Also, keeping a central cache for multiple users will improve memory utilization and I believe it would require less splits of the direct map. OTOH, keeping such caches per-user allows managing access policy per cache which could be better from the security POV. I'm also going to explore the possibilities of using a new migratetype or SL*B as Dave suggested. > > +{ > > + struct pte_mapped_cache *cache = &pte_mapped_cache; > > + struct page *page; > > + > > + page = pte_mapped_cache_get(cache); > > + if (page) { > > + prep_new_page(page, 0, gfp, 0); > > + goto out; > > + } > > + > > + page = pte_mapped_cache_replenish(cache, gfp); > > + > > +out: > > + return page; > > +} > > + > We probably want to exclude GFP_ATOMIC before calling into CPA unless > debug page alloc is on, because it may need to split and sleep for the > allocation. There is a page table allocation with GFP_ATOMIC passed actually. Looking at the callers of alloc_low_pages() it seems that GFP_ATOMIC there is stale... > In my next series of this I added support for GFP_ATOMIC to this code, > but that solution should only work for permission changing grouped page > allocators in the protected page tables case where the direct map > tables are handled differently. As a general solution though (that's > the long term intention right?), GFP_ATOMIC might deserve some > consideration. ... but for the general solution GFP_ATOMIC indeed deserves some consideration. > The other thing is we probably don't want to clean out the atomic > reserves and add them to a cache just for one page. I opted to just > convert one page in the GFP_ATOMIC case. Do you mean to allocate one page in GFP_ATOMIC case and bypass high order allocation? But the CPA split is still necessary here, isn't it? -- Sincerely yours, Mike.