On Wed Jan 22, 2025 at 3:11 AM EST, David Hildenbrand wrote: > On 22.01.25 03:24, Zi Yan wrote: >> On Tue Jan 21, 2025 at 9:08 PM EST, Juan Yescas wrote: >>> On Mon, Jan 20, 2025 at 9:59 AM David Hildenbrand <david@xxxxxxxxxx> wrote: >>>> >>>> On 20.01.25 16:29, Zi Yan wrote: >>>>> On Mon Jan 20, 2025 at 3:14 AM EST, David Hildenbrand wrote: >>>>>> On 20.01.25 01:39, Zi Yan wrote: >>>>>>> On Sun Jan 19, 2025 at 6:55 PM EST, Barry Song wrote: >>>>>>> <snip> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> However, with this workaround, we can't use transparent huge pages. >>>>>>>>>>>> >>>>>>>>>>>> Is the CMA_MIN_ALIGNMENT_BYTES requirement alignment only to support huge pages? >>>>>>>>> No. CMA_MIN_ALIGNMENT_BYTES is limited by CMA_MIN_ALIGNMENT_PAGES, which >>>>>>>>> is equal to pageblock size. Enabling THP just bumps the pageblock size. >>>>>>>> >>> >>> Thanks, I can see the initialization in include/linux/pageblock-flags.h >>> >>> #define pageblock_order MIN_T(unsigned int, HUGETLB_PAGE_ORDER, MAX_PAGE_ORDER) >>> >>>>>>>> Currently, THP might be mTHP, which can have a significantly smaller >>>>>>>> size than 32MB. For >>>>>>>> example, on arm64 systems with a 16KiB page size, a 2MB CONT-PTE mTHP >>>>>>>> is possible. >>>>>>>> Additionally, mTHP relies on the CONFIG_TRANSPARENT_HUGEPAGE configuration. >>>>>>>> >>>>>>>> I wonder if it's possible to enable CONFIG_TRANSPARENT_HUGEPAGE >>>>>>>> without necessarily >>>>>>>> using 32MiB THP. If we use other sizes, such as 64KiB, perhaps a large >>>>>>>> pageblock size wouldn't >>>>>>>> be necessary? >>> >>> Do you mean with mTHP? We haven't explored that option. >> >> Yes. Unless your applications have special demands for PMD THPs. 2MB >> mTHP should work. >> >>> >>>>>>> >>>>>>> I think this should work by reducing MAX_PAGE_ORDER like Juan did for >>>>>>> the experiment. But MAX_PAGE_ORDER is a macro right now, Kconfig needs >>>>>>> to be changed and kernel needs to be recompiled. Not sure if it is OK >>>>>>> for Juan's use case. >>>>>> >>> >>> The main goal is to reserve only the necessary CMA memory for the >>> drivers, which is >>> usually the same for 4kb and 16kb page size kernels. >> >> Got it. Based on your experiment, you changed MAX_PAGE_ORDER to get the >> minimal CMA alignment size. Can you deploy that kernel to production? >> If yes, you can use mTHP instead of PMD THP and still get the CMA >> alignemnt you want. >> >>> >>>>>> >>>>>> IIRC, we set pageblock size == THP size because this is the granularity >>>>>> we want to optimize defragmentation for. ("try keep pageblock >>>>>> granularity of the same memory type: movable vs. unmovable") >>>>> >>>>> Right. In past, it is optimized for PMD THP. Now we have mTHP. If user >>>>> does not care about PMD THP (32MB in ARM64 16KB base page case) and mTHP >>>>> (2MB mTHP here) is good enough, reducing pageblock size works. >>>>> >>>>>> >>>>>> However, the buddy already supports having different pagetypes for large >>>>>> allocations. >>>>> >>>>> Right. To be clear, only MIGRATE_UNMOVABLE, MIGRATE_RECLAIMABLE, and >>>>> MIGRATE_MOVABLE can be merged. >>>> >>>> Yes! An a THP cannot span partial MIGRATE_CMA, which would be fine. >>>> >>>>> >>>>>> >>>>>> So we could leave MAX_ORDER alone and try adjusting the pageblock size >>>>>> in these setups. pageblock size is already variable on some >>>>>> architectures IIRC. >>>>> >>> >>> Which values would work for the CMA_MIN_ALIGNMENT_BYTES macro? In the >>> 16KiB page size kernel, >>> I tried these 2 configurations: >>> >>> #define CMA_MIN_ALIGNMENT_BYTES (2048 * CMA_MIN_ALIGNMENT_PAGES) >>> >>> and >>> >>> #define CMA_MIN_ALIGNMENT_BYTES (4096 * CMA_MIN_ALIGNMENT_PAGES) >>> >>> with both of them, the kernel failed to boot. >> >> CMA_MIN_ALIGNMENT_BYTES needs to be PAGE_SIZE * CMA_MIN_ALIGNMENT_PAGES. >> So you need to adjust CMA_MIN_ALIGNMENT_PAGES, which is set by pageblock >> size. pageblock size is determined by pageblock order, which is >> affected by MAX_PAGE_ORDER. > > Yes, most importantly we must not exceed MAX_PAGE_ORDER. Going smaller > is the common case. > >> >>> >>>>> Making pageblock size a boot time variable? We might want to warn >>>>> sysadmin/user that >pageblock_order THP/mTHP creation will suffer. >>>> >>>> Yes, some way to configure it. >>>> >>>>> >>>>>> >>>>>> We'd only have to check if all of the THP logic can deal with pageblock >>>>>> size < THP size. >>>>> >>> >>> The reason that THP was disabled in my experiment is because this >>> assertion failed >>> >>> mm/huge_memory.c >>> /* >>> * hugepages can't be allocated by the buddy allocator >>> */ >>> MAYBE_BUILD_BUG_ON(HPAGE_PMD_ORDER > MAX_PAGE_ORDER); >>> >>> when >>> >>> config ARCH_FORCE_MAX_ORDER >>> int >>> ..... >>> default "8" if ARM64_16K_PAGES >>> >> >> You can remove that BUILD_BUG_ON and turn on mTHP and see if mTHP works. >> >>> >>>>> Probably yes, pageblock should be independent of THP logic, although >>>>> compaction (used to create THPs) logic is based on pageblock. >>>> >>>> Right. As raised in the past, we need a higher level mechanism that >>>> tries to group pageblocks together during comapction/conversion to limit >>>> fragmentation on a higher level. >>>> >>>> I assume that many use cases would be fine with not using 32MB/512MB >>>> THPs at all for now -- and instead using 2 MB ones. Of course, for very >>>> large installations it might be different. >>>> >>>>>> >>>>>> This issue is even more severe on arm64 with 64k (pageblock = 512MiB). >>>>> >>> >>> I agree, and if ARCH_FORCE_MAX_ORDER is configured to the max value we get: >>> >>> PAGE_SIZE | max MAX_PAGE_ORDER | CMA_MIN_ALIGNMENT_BYTES >>> 4KiB | 15 | 4KiB >>> * 32KiB = 128MiB >>> 16KiB | 13 | 16KiB >>> * 8KiB = 128MiB >>> 64KiB | 13 | 64KiB >>> * 8KiB = 512MiB >>> >>>>> This is also good for virtio-mem, since the offline memory block size >>>>> can also be reduced. I remember you complained about it before. >>>> >>>> Yes, yes, yes! :) >>>> >> >> David's proposal should work in general, but will might take non-trivial >> amount of work: >> >> 1. keep pageblock size always at 4MB for all arch. > > My proposal was to leave it unchanged for most archs, but allow for > overriding it on aarch64 as a first step. Got it. Makes sense. > > s390x is happy with 1MiB, x86 with 2MiB. It's aarch64 that does > questionable things :) > > CONFIG_HUGETLB_PAGE_SIZE_VARIABLE already allows for variable > pageblock_order. That whole code likely needs some love, but most of it > should already be there. > > > In the future, I could imagine just going for a smaller pageblock size > on aarch64, and handling fragmentation avoidance for larger THPs (512 > MiB really is close to 1 GiB on x86) differently, not using pageblocks. Right. That is what I meant by "...compaction, to work on a different range, independent of pageblock". But based on Juan's reply[1], 32MB PMD THP is still needed for the deployment. That means "the future" needs to be done to fully satisfy Juan's needs. :) -- Best Regards, Yan, Zi