Re: mm: CMA reservations require 32MiB alignment in 16KiB page size kernels instead of 8MiB in 4KiB page size kernel.

David Hildenbrand <david@xxxxxxxxxx> · Wed, 22 Jan 2025 09:04:21 +0100

On 22.01.25 07:52, Barry Song wrote:
On Wed, Jan 22, 2025 at 5:06 PM Juan Yescas <jyescas@xxxxxxxxxx> wrote:

On Tue, Jan 21, 2025 at 6:24 PM Zi Yan <ziy@xxxxxxxxxx> wrote:

On Tue Jan 21, 2025 at 9:08 PM EST, Juan Yescas wrote:
On Mon, Jan 20, 2025 at 9:59 AM David Hildenbrand <david@xxxxxxxxxx> wrote:

On 20.01.25 16:29, Zi Yan wrote:
On Mon Jan 20, 2025 at 3:14 AM EST, David Hildenbrand wrote:
On 20.01.25 01:39, Zi Yan wrote:
On Sun Jan 19, 2025 at 6:55 PM EST, Barry Song wrote:
<snip>

However, with this workaround, we can't use transparent huge pages.

Is the CMA_MIN_ALIGNMENT_BYTES requirement alignment only to support huge pages?
No. CMA_MIN_ALIGNMENT_BYTES is limited by CMA_MIN_ALIGNMENT_PAGES, which
is equal to pageblock size. Enabling THP just bumps the pageblock size.

Thanks, I can see the initialization in include/linux/pageblock-flags.h

#define pageblock_order MIN_T(unsigned int, HUGETLB_PAGE_ORDER, MAX_PAGE_ORDER)

Currently, THP might be mTHP, which can have a significantly smaller
size than 32MB. For
example, on arm64 systems with a 16KiB page size, a 2MB CONT-PTE mTHP
is possible.
Additionally, mTHP relies on the CONFIG_TRANSPARENT_HUGEPAGE configuration.

I wonder if it's possible to enable CONFIG_TRANSPARENT_HUGEPAGE
without necessarily
using 32MiB THP. If we use other sizes, such as 64KiB, perhaps a large
pageblock size wouldn't
be necessary?

Do you mean with mTHP? We haven't explored that option.

Yes. Unless your applications have special demands for PMD THPs. 2MB
mTHP should work.

I think this should work by reducing MAX_PAGE_ORDER like Juan did for
the experiment. But MAX_PAGE_ORDER is a macro right now, Kconfig needs
to be changed and kernel needs to be recompiled. Not sure if it is OK
for Juan's use case.

The main goal is to reserve only the necessary CMA memory for the
drivers, which is
usually the same for 4kb and 16kb page size kernels.

Got it. Based on your experiment, you changed MAX_PAGE_ORDER to get the
minimal CMA alignment size. Can you deploy that kernel to production?

We can't deploy that because many Android partners are using PMD THP instead
of mTHP.

If yes, you can use mTHP instead of PMD THP and still get the CMA
alignemnt you want.

IIRC, we set pageblock size == THP size because this is the granularity
we want to optimize defragmentation for. ("try keep pageblock
granularity of the same memory type: movable vs. unmovable")

Right. In past, it is optimized for PMD THP. Now we have mTHP. If user
does not care about PMD THP (32MB in ARM64 16KB base page case) and mTHP
(2MB mTHP here) is good enough, reducing pageblock size works.

However, the buddy already supports having different pagetypes for large
allocations.

Right. To be clear, only MIGRATE_UNMOVABLE, MIGRATE_RECLAIMABLE, and
MIGRATE_MOVABLE can be merged.

Yes! An a THP cannot span partial MIGRATE_CMA, which would be fine.

So we could leave MAX_ORDER alone and try adjusting the pageblock size
in these setups. pageblock size is already variable on some
architectures IIRC.

Which values would work for the CMA_MIN_ALIGNMENT_BYTES macro? In the
16KiB page size kernel,
I tried these 2 configurations:

#define CMA_MIN_ALIGNMENT_BYTES (2048 * CMA_MIN_ALIGNMENT_PAGES)

and

#define CMA_MIN_ALIGNMENT_BYTES (4096 * CMA_MIN_ALIGNMENT_PAGES)

with both of them, the kernel failed to boot.

CMA_MIN_ALIGNMENT_BYTES needs to be PAGE_SIZE * CMA_MIN_ALIGNMENT_PAGES.
So you need to adjust CMA_MIN_ALIGNMENT_PAGES, which is set by pageblock
size. pageblock size is determined by pageblock order, which is
affected by MAX_PAGE_ORDER.

Making pageblock size a boot time variable? We might want to warn
sysadmin/user that >pageblock_order THP/mTHP creation will suffer.

Yes, some way to configure it.

We'd only have to check if all of the THP logic can deal with pageblock
size < THP size.

The reason that THP was disabled in my experiment is because this
assertion failed

mm/huge_memory.c
/*
* hugepages can't be allocated by the buddy allocator
*/
MAYBE_BUILD_BUG_ON(HPAGE_PMD_ORDER > MAX_PAGE_ORDER);

when

     config ARCH_FORCE_MAX_ORDER
         int
         .....
         default "8" if ARM64_16K_PAGES

You can remove that BUILD_BUG_ON and turn on mTHP and see if mTHP works.

We'll do that and post the results.

Probably yes, pageblock should be independent of THP logic, although
compaction (used to create THPs) logic is based on pageblock.

Right. As raised in the past, we need a higher level mechanism that
tries to group pageblocks together during comapction/conversion to limit
fragmentation on a higher level.

I assume that many use cases would be fine with not using 32MB/512MB
THPs at all for now -- and instead using 2 MB ones. Of course, for very
large installations it might be different.

This issue is even more severe on arm64 with 64k (pageblock = 512MiB).

I agree, and if ARCH_FORCE_MAX_ORDER is configured to the max value we get:

PAGE_SIZE  | max MAX_PAGE_ORDER | CMA_MIN_ALIGNMENT_BYTES
4KiB              |                      15                   |  4KiB
* 32KiB = 128MiB
16KiB            |                      13                   |  16KiB
* 8KiB = 128MiB
64KiB            |                      13                   |  64KiB
* 8KiB = 512MiB

This is also good for virtio-mem, since the offline memory block size
can also be reduced. I remember you complained about it before.

Yes, yes, yes! :)

David's proposal should work in general, but will might take non-trivial
amount of work:

1. keep pageblock size always at 4MB for all arch.
2. adjust existing pageblock users, like compaction, to work on a
different range, independent of pageblock.
     a. for anti-fragmentation mechanism, multiple pageblocks might have
     different migratetypes but would be compacted to generate huge
     pages, but how to align their migratetypes is TBD.
3. other corner case handlings.

The final question is that Barry mentioned that over-reserved CMA areas
can be used for movable page allocations. Why does it not work for you?

I need to run more experiments to see what type of page allocations in
the system is the dominant one (unmovable or movable). If it is movable,
over-reserved CMA areas should be fine.

My understanding is that over-reserving 28MiB is unlikely to cause
noticeable regression, given that we frequently handle allocations like
GFP_HIGHUSER_MOVABLE or similar, which are significantly larger
than 28MiB. However, David also mentioned a reservation of 512MiB
for a 64KiB page size. In that case, 512MiB might be large enough to
potentially impact the balance between movable and unmovable
allocations. For instance, if we still have 512MiB reserved in CMA
but are allocating unmovable folios(for example dma-buf), we could
fail an allocation even when there’s actually capacity. So, in any case,
there is still work to be done here.

By the way, is 512MiB truly a reasonable size for THP? 

No, it's absolutely stupid for most setups.

Just think of a small VM with 4 GiB: great you have 8 pageblocks and 
probably never get a single THP.

--
Cheers,

David / dhildenb