On 05.08.21 21:02, Zi Yan wrote:
From: Zi Yan <ziy@xxxxxxxxxx> Hi all, This patchset add support for kernel boot time adjustable MAX_ORDER, so that user can change the largest size of pages obtained from buddy allocator. It also removes the restriction on MAX_ORDER based on SECTION_SIZE_BITS, so that buddy allocator can merge PFNs across memory sections when SPARSEMEM_VMEMMAP is set. It is on top of v5.14-rc4-mmotm-2021-08-02-18-51. Motivation === This enables kernel to allocate 1GB pages and is necessary for my ongoing work on adding support for 1GB PUD THP[1]. This is also the conclusion I came up with after some discussion with David Hildenbrand on what methods should be used for allocating gigantic pages[2], since other approaches like using CMA allocator or alloc_contig_pages() are regarded as suboptimal. This also prevents increasing SECTION_SIZE_BITS when increasing MAX_ORDER, since increasing SECTION_SIZE_BITS is not desirable as memory hotadd/hotremove chunk size will be increased as well, causing memory management difficulty for VMs. In addition, make MAX_ORDER a kernel boot time parameter can enable user to adjust buddy allocator without recompiling the kernel for their own needs, so that one can still have a small MAX_ORDER if he/she does not need to allocate gigantic pages like 1GB PUD THPs. Background === At the moment, kernel imposes MAX_ORDER - 1 + PAGE_SHFIT < SECTION_SIZE_BITS restriction. This prevents buddy allocator merging pages across memory sections, as PFNs might not be contiguous and code like page++ would fail. But this would not be an issue when SPARSEMEM_VMEMMAP is set, since all struct page are virtually contiguous. In addition, as long as buddy allocator checks the PFN validity during buddy page merging (done in Patch 3), pages allocated from buddy allocator can be manipulated by code like page++. Description === I tested the patchset on both x86_64 and ARM64 at 4KB, 16KB, and 64KB base pages. The systems boot and ltp mm test suite finished without issue. Also memory hotplug worked on x86_64 when I tested. It definitely needs more tests and reviews for other architectures. In terms of the concerns on performance degradation if MAX_ORDER is increased, I did some initial performance tests comparing MAX_ORDER=11 and MAX_ORDER=20 on x86_64 machines and saw no performance difference[3]. Patch 1 excludes MAX_ORDER check from 32bit vdso compilation. The check uses irrelevant 32bit SECTION_SIZE_BITS during 64bit kernel compilation. The exclusion does not break the check in 32bit kernel, since the check will still be performed during other kernel component compilation. Patch 2 gives FORCE_MAX_ZONEORDER a better name. Patch 3 restores the pfn_valid_within() check when buddy allocator can merge pages across memory sections. The check was removed when ARM64 gets rid of holes in zones, but holes can appear in zones again after this patchset. Patch 4-11 convert the use of MAX_ORDER to SECTION_SIZE_BITS or its derivative constants, since these code places use MAX_ORDER as boundary check for physically contiguous pages, where SECTION_SIZE_BITS should be used. After this patchset, MAX_ORDER can go beyond SECTION_SIZE_BITS, the code can break. I separate changes to different patches for easy review and can merge them into a single one if that works better. Patch 12 adds a new Kconfig option SET_MAX_ORDER to allow specifying MAX_ORDER when ARCH_FORCE_MAX_ORDER is not used by the arch, like x86_64. Patch 13 converts statically allocated arrays with MAX_ORDER length to dynamic ones if possible and prepares for making MAX_ORDER a boot time parameter. Patch 14 adds a new MIN_MAX_ORDER constant to replace soon-to-be-dynamic MAX_ORDER for places where converting static array to dynamic one is causing hassle and not necessary, i.e., ARM64 hypervisor page allocation and SLAB. Patch 15 finally changes MAX_ORDER to be a kernel boot time parameter. Any suggestion and/or comment is welcome. Thanks. TODO === 1. Redo the performance comparison tests using this patchset to understand the performance implication of changing MAX_ORDER.
2. Make alloc_contig_range() cleanly deal with pageblock_order instead of MAX_ORDER - 1 to not force the minimal CMA area size/alignment to be e.g., 1 GiB instead of 4 MiB and to keep virtio-mem working as expected.
virtio-mem short term would mean disallowing initialization when an incompatible setup (MAX_ORDER_NR_PAGE > SECTION_NR_PAGES) is detected and bailing out loud that the admin has to fix that on the command line. I have optimizing alloc_contig_range() on my todo list, to get rid of the MAX_ORDER -1 dependency in virtio-mem; but I have no idea when I'll have time to work on that.
-- Thanks, David / dhildenb