Re: [PATCH v6 0/9] variable-order, large folios for anonymous memory

David Hildenbrand <david@xxxxxxxxxx> · Fri, 6 Oct 2023 22:06:21 +0200

On 29.09.23 13:44, Ryan Roberts wrote:
Hi All,

Let me highlight some core decisions on the things discussed in the 
previous alignment meetings, and comment on them.

This is v6 of a series to implement variable order, large folios for anonymous
memory. (previously called "ANON_LARGE_FOLIO", "LARGE_ANON_FOLIO",
"FLEXIBLE_THP", but now exposed as an extension to THP; "small-order THP"). The
objective of this is to improve performance by allocating larger chunks of
memory during anonymous page faults:

Change number 1: Let's call these things THP.

Fine with me; I previously rooted for that but was told that end users 
could be confused. I think the important bit is that we don't mess up 
the stats, and when we talk about THP we default to "PMD-sized THP", 
unless we explicitly include the other ones.

I dislike exposing "orders" to the users, I'm happy to be convinced why 
I am wrong and it is a good idea.

So maybe "Small THP"/"Small-sized THP" is better. Or "Medium-sized THP" 
-- as said, I think FreeBSD tends to call it "Medium-sized superpages". 
But what's small/medium/large is debatable. "Small" implies at least 
that it's smaller than what we used to know, which is a fact.

Can we also now use the terminology consistently? (e.g., 
"variable-order, large folios for anonymous memory" -> "Small-sized 
anonymous THP", you can just point at the previous patch set name in the 
cover letter)

1) Since SW (the kernel) is dealing with larger chunks of memory than base
    pages, there are efficiency savings to be had; fewer page faults, batched PTE
    and RMAP manipulation, reduced lru list, etc. In short, we reduce kernel
    overhead. This should benefit all architectures.
2) Since we are now mapping physically contiguous chunks of memory, we can take
    advantage of HW TLB compression techniques. A reduction in TLB pressure
    speeds up kernel and user space. arm64 systems have 2 mechanisms to coalesce
    TLB entries; "the contiguous bit" (architectural) and HPA (uarch).

The major change in this revision is the addition of sysfs controls to allow
this "small-order THP" to be enabled/disabled/configured independently of
PMD-order THP. The approach I've taken differs a bit from previous discussions;
instead of creating a whole new interface ("large_folio"), I'm extending THP. I
personally think this makes things clearer and more extensible. See [6] for
detailed rationale.

Change 2: sysfs interface.

If we call it THP, it shall go under 
"/sys/kernel/mm/transparent_hugepage/", I agree.

What we expose there and how, is TBD. Again, not a friend of "orders" 
and bitmaps at all. We can do better if we want to go down that path.

Maybe we should take a look at hugetlb, and how they added support for 
multiple sizes. What *might* make sense could be (depending on which 
values we actually support!)

/sys/kernel/mm/transparent_hugepage/hugepages-64kB/
/sys/kernel/mm/transparent_hugepage/hugepages-128kB/
/sys/kernel/mm/transparent_hugepage/hugepages-256kB/
/sys/kernel/mm/transparent_hugepage/hugepages-512kB/
/sys/kernel/mm/transparent_hugepage/hugepages-1024kB/
/sys/kernel/mm/transparent_hugepage/hugepages-2048kB/

Each one would contain an "enabled" and "defrag" file. We want something 
minimal first? Start with the "enabled" option.

enabled: always [global] madvise never

Initially, we would set it for PMD-sized THP to "global" and for 
everything else to "never".

That sounds reasonable at least to me, and we would be using what we 
learned from THP (as John suggested).  That still gives reasonable 
flexibility without going too wild, and a better IMHO interface.

I understand Yu's point about ABI discussions and "0 knobs". I'm happy 
as long as we can have something that won't hurt us later and still be 
able to use this in distributions within a reasonable timeframe. 
Enabling/disabling individual sizes does not sound too restrictive to 
me. And we could always add an "auto" setting later and default to that 
with a new kconfig knob.

If someone wants to configure it, why not. Let's just prepare a way to 
to handle this "better" automatically in the future (if ever ...).

Change 3: Stats

> /proc/meminfo:
>   Introduce new "AnonHugePteMap" field, which reports the amount of
>   memory (in KiB) mapped from large folios globally (similar to
>   AnonHugePages field).

AnonHugePages is and remains "PMD-sized THP that is mapped using a PMD", 
I think we all agree on that. It should have been named "AnonPmdMapped" 
or "AnonHugePmdMapped", too bad, we can't change that.

"AnonHugePteMap" better be "AnonHugePteMapped".

But, I wonder if we want to expose this "PteMapped" to user space *at 
all*. Why should they care if it's PTE mapped? For PMD-sized THP it 
makes a bit of sense, because !PMD implied !performance, and one might 
have been able to troubleshoot that somehow. For PTE-mapped, it doesn't 
make much sense really, they are always PTE-mapped.

That also raises the question how you would account a PTE-mapped THP. 
The hole thing? Only the parts that are mapped? Let's better not go down 
that path.

That leaves the question why we would want to include them here at all 
in a special PTE-mapped way?

Again, let's look at hugetlb: I prepared 1 GiB and one 2 MiB page.

HugePages_Total:       1
HugePages_Free:        1
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB
Hugetlb:         1050624 kB

-> Only the last one gives the sum, the other stats don't even mention 
the other ones. [how do we get their stats, if at all?]

So maybe, we only want a summary of how many anon huge pages of any size 
are allocated (independent of the PTE vs. PMD mapping), and some other 
source to eventually inspect how the different sizes behave.

But note that for non-PMD-sized file THP we don't even have special 
counters! ... so maybe we should also defer any such stats and come up 
with something uniform for all types of non-PMD-sized THP.

Sane discussion applies to all other stats.

Because we now have runtime enable/disable control, I've removed the compile
time Kconfig switch. It still defaults to runtime-disabled.

NOTE: These changes should not be merged until the prerequisites are complete.
These are in progress and tracked at [7].

We should probably list them here, and classify which one we see as 
strict a requirement, which ones might be an optimization.

Now, these are just my thoughts, and I'm happy about other thoughts.

--
Cheers,

David / dhildenb