David Hildenbrand <david@xxxxxxxxxx> writes: > On 11.02.22 10:16, Aneesh Kumar K V wrote: >> On 2/11/22 14:00, David Hildenbrand wrote: >>> On 11.02.22 07:52, Aneesh Kumar K.V wrote: >>>> commit: d9c234005227 ("Do not depend on MAX_ORDER when grouping pages by mobility") >>>> introduced pageblock_order which will be used to group pages better. >>>> The kernel now groups pages based on the value of HPAGE_SHIFT. Hence HPAGE_SHIFT >>>> should be set before we call set_pageblock_order. >>>> >>>> set_pageblock_order happens early in the boot and default hugetlb page size >>>> should be initialized before that to compute the right pageblock_order value. >>>> >>>> Currently, default hugetlbe page size is set via arch_initcalls which happens >>>> late in the boot as shown via the below callstack: >>>> >>>> [c000000007383b10] [c000000001289328] hugetlbpage_init+0x2b8/0x2f8 >>>> [c000000007383bc0] [c0000000012749e4] do_one_initcall+0x14c/0x320 >>>> [c000000007383c90] [c00000000127505c] kernel_init_freeable+0x410/0x4e8 >>>> [c000000007383da0] [c000000000012664] kernel_init+0x30/0x15c >>>> [c000000007383e10] [c00000000000cf14] ret_from_kernel_thread+0x5c/0x64 >>>> >>>> and the pageblock_order initialization is done early during the boot. >>>> >>>> [c0000000018bfc80] [c0000000012ae120] set_pageblock_order+0x50/0x64 >>>> [c0000000018bfca0] [c0000000012b3d94] sparse_init+0x188/0x268 >>>> [c0000000018bfd60] [c000000001288bfc] initmem_init+0x28c/0x328 >>>> [c0000000018bfe50] [c00000000127b370] setup_arch+0x410/0x480 >>>> [c0000000018bfed0] [c00000000127401c] start_kernel+0xb8/0x934 >>>> [c0000000018bff90] [c00000000000d984] start_here_common+0x1c/0x98 >>>> >>>> delaying default hugetlb page size initialization implies the kernel will >>>> initialize pageblock_order to (MAX_ORDER - 1) which is not an optimal >>>> value for mobility grouping. IIUC we always had this issue. But it was not >>>> a problem for hash translation mode because (MAX_ORDER - 1) is the same as >>>> HUGETLB_PAGE_ORDER (8) in the case of hash (16MB). With radix, >>>> HUGETLB_PAGE_ORDER will be 5 (2M size) and hence pageblock_order should be >>>> 5 instead of 8. >>> >>> >>> A related question: Can we on ppc still have pageblock_order > MAX_ORDER >>> - 1? We have some code for that and I am not so sure if we really need that. >>> >> >> I also have been wondering about the same. On book3s64 I don't think we >> need that support for both 64K and 4K page size because with hash >> hugetlb size is MAX_ORDER -1. (16MB hugepage size) >> >> I am not sure about the 256K page support. Christophe may be able to >> answer that. >> >> For the gigantic hugepage support we depend on cma based allocation or >> firmware reservation. So I am not sure why we ever considered pageblock >> > MAX_ORDER -1 scenario. If you have pointers w.r.t why that was ever >> needed, I could double-check whether ppc64 is still dependent on that. > > commit dc78327c0ea7da5186d8cbc1647bd6088c5c9fa5 > Author: Michal Nazarewicz <mina86@xxxxxxxxxx> > Date: Wed Jul 2 15:22:35 2014 -0700 > > mm: page_alloc: fix CMA area initialisation when pageblock > MAX_ORDER > > indicates that at least arm64 used to have cases for that as well. > > However, nowadays with ARM64_64K_PAGES we have FORCE_MAX_ZONEORDER=14 as > default, corresponding to 512MiB. > > So I'm not sure if this is something worth supporting. If you want > somewhat reliable gigantic pages, use CMA or preallocate them during boot. > > -- > Thanks, > > David / dhildenb I could build a kernel with FORCE_MAX_ZONEORDER=8 and pageblock_order = 8. We need to disable THP for such a kernel to boot, because THP do check for PMD_ORDER < MAX_ORDER. I was able to boot that kernel on a virtualized platform, but then gigantic_page_runtime_supported is not supported on such config with hash translation. On non virtualized platform I am hitting crashes like below during boot. [ 47.637865][ C42] ============================================================================= [ 47.637907][ C42] BUG pgtable-2^11 (Not tainted): Object already free [ 47.637925][ C42] ----------------------------------------------------------------------------- [ 47.637925][ C42] [ 47.637945][ C42] Allocated in __pud_alloc+0x84/0x2a0 age=278 cpu=40 pid=1409 [ 47.637974][ C42] __slab_alloc.isra.0+0x40/0x60 [ 47.637995][ C42] kmem_cache_alloc+0x1a8/0x510 [ 47.638010][ C42] __pud_alloc+0x84/0x2a0 [ 47.638024][ C42] copy_page_range+0x38c/0x1b90 [ 47.638040][ C42] dup_mm+0x548/0x880 [ 47.638058][ C42] copy_process+0xdc0/0x1e90 [ 47.638076][ C42] kernel_clone+0xd4/0x9d0 [ 47.638094][ C42] __do_sys_clone+0x88/0xe0 [ 47.638112][ C42] system_call_exception+0x368/0x3a0 [ 47.638128][ C42] system_call_common+0xec/0x250 [ 47.638147][ C42] Freed in __tlb_remove_table+0x1d4/0x200 age=263 cpu=57 pid=326 [ 47.638172][ C42] kmem_cache_free+0x44c/0x680 [ 47.638187][ C42] __tlb_remove_table+0x1d4/0x200 [ 47.638204][ C42] tlb_remove_table_rcu+0x54/0xa0 [ 47.638222][ C42] rcu_core+0xdd4/0x15d0 [ 47.638239][ C42] __do_softirq+0x360/0x69c [ 47.638257][ C42] run_ksoftirqd+0x54/0xc0 [ 47.638273][ C42] smpboot_thread_fn+0x28c/0x2f0 [ 47.638290][ C42] kthread+0x1a4/0x1b0 [ 47.638305][ C42] ret_from_kernel_thread+0x5c/0x64 [ 47.638320][ C42] Slab 0xc00c00000000d600 objects=10 used=9 fp=0xc0000000035a8000 flags=0x7ffff000010201(locked|slab|head|node=0|zone=0|lastcpupid=0x7ffff) [ 47.638352][ C42] Object 0xc0000000035a8000 @offset=163840 fp=0x0000000000000000 [ 47.638352][ C42] [ 47.638373][ C42] Redzone c0000000035a4000: bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb ................ [ 47.638394][ C42] Redzone c0000000035a4010: bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb ................ [ 47.638414][ C42] Redzone c0000000035a4020: bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb ................ [ 47.638435][ C42] Redzone c0000000035a4030: bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb ................ [ 47.638455][ C42] Redzone c0000000035a4040: bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb ................ [ 47.638474][ C42] Redzone c0000000035a4050: bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb ................ [ 47.638494][ C42] Redzone c0000000035a4060: bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb ................ [ 47.638514][ C42] Redzone c0000000035a4070: bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb ................ [ 47.638534][ C42] Redzone c0000000035a4080: bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb ................