Re: [RFC 00/11] khugepaged: mTHP support

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 





On 21/01/25 3:18 pm, Ryan Roberts wrote:
On 20/01/2025 18:39, David Hildenbrand wrote:
On 20.01.25 17:27, Ryan Roberts wrote:
On 20/01/2025 13:56, David Hildenbrand wrote:
On 20.01.25 14:37, Ryan Roberts wrote:
On 20/01/2025 12:54, David Hildenbrand wrote:
I think the 1 problem that emerged during review of Dev's series, which we
don't
have a proper solution to yet, is the issue of "creep", where regions can be
collapsed to progressively higher orders through iterative scans. At each
collapse, the required thresholds (e.g. max_ptes_none) are met, and the
collapse
effectively adds more non-none ptes so the next scan will then collapse to
even
higher order. Does your solution suffer from this (theoretical/edge case)
issue?
If not, how did you solve?

Yes sadly it suffers from the same issue. bringing max_ptes_none much
lower as a default would "help".

Can we just keep it simple and only support max_ptes_none = 511 ("pagefault
behavior" -- PMD_NR_PAGES - 1) or max_ptes_none = 0 ("deferred behavior") and
document that the other weird configurations will make mTHP skip, because
"weird
and unexpetced" ? :)

nit: Rather than values of max_ptes_none other than 0 and max making mTHP skip,
perhaps it's better to say we round to closest of 0 and max?

Maybe. Rounding down always implies doing something not necessarily desired.

In any case, I assume most setups just have the default values here ... :)




That sounds like a great simplification in principle!

And certainly a much easier to start with :)

If we ever get the request to support something else, maybe that's also where we
can learn *why*, and what we would actually want to do with mTHP.

We would need to consider
the swap and shared tunables too though. Perhaps we can pull a similar trick
with those?

Swapped and shared are a bit more challenging, because they are set to "/ 2" or
"/ 8" heuristics.


One simple starting point here is of course to say "when collapsing mTHP, all
have to be unshared and all have to be swapped in", so to essentially ignore
both tunables (in a memory friendly way, as if they are set to 0) for mTHP
collapse and worry about that later, when really required.

For swap, if we assume we start with the whole VMA swapped out, I think setting
max_ptes_swap to 0 could still cause the "creep" problem if faulting pages back
in sequentially? I guess that's creep due to faulting pattern though, so at
least it's not due to collapse. Doesn't feel ideal though.
I'm not sure what the semantic of "shared" is? I'm guessing it's specifically
for private COWed pages, and khugepaged will trigger the COW on collapse?

Yes.

So
again depending on the pattern of writes we could still end up with creep in a
similar way to swap?

I think in regards of both "yes", so a simple starting point but not necessarily
what we want long term. The creep is at least "not wasting more memory", because
we don't collapse where PMD wouldn't have collapsed.

After all, right now we don't collapse mTHP, now we would collapse mTHP in many
scenarios, so we don't have to be perfect initially.

Deriving stuff for small THP sizes when configured for PMD THP sizes is not easy
to do right.



Two alternatives I discussed with Nico for these (not sure which is implemented
here) is to calculate it proportionally to the folio order we are collapsing:

You're only listing one option here... what's the other one you discussed?


Ah sorry, reshuffled it and then had to rush.

The other thing I had in mind is to scan the whole PMD range, and discard skip
the whole PMD range if it doesn't obey the max_ptes_* stuff. Not perfect, but
will mean that we behave just like PMD collapse would, unless I am missing
something.

Hmm that's an interesting idea; If I've understood, we would effectively test
the PMD for collapse as if we were collapsing to PMD-size, but then do the
actual collapse to the "highest allowed order" (dictated by what's enabled +
MADV_HUGEPAGE config).

I'm not so sure this is a good way to go; there would be no way to support VMAs
(or parts of VMAs) that don't span a full PMD. And I can imagine we might see
memory bloat; imagine you have 2M=madvise, 64K=always, max_ptes_none=511, and
let's say we have a 2M (aligned portion of a) VMA that does NOT have
MADV_HUGEPAGE set and has a single page populated. It passes the PMD-size test,
but we opt to collapse to 64K (since 2M=madvise). So now we end up with 32x 64K
folios, 31 of which are all zeros. We have spent the same amount of memory as if
2M=always. Perhaps that's a detail that could be solved by ignoring fully none
64K blocks when collapsing to 64K...

There are two ways a collapse can fail.
(1) We exceed one of max_ptes_* in the range.
(2) We fail in the function collapse_huge_page(), whose real bottleneck really is alloc_charge_folio(); i.e we fail to find physically contiguous memory for the corresponding scan order.

Now, we do not care whether we get a "creep" in case of (2), because khugepaged's end goal is to collapse to the highest order. We care if we get a creep from (1), because a smaller collapse leads us to come under the max_ptes_* constraint for a bigger order.

So I think, what David is suggesting is to ignore (1) for smaller orders. I'll try to formalize the proposed policy as follows: khugepaged's goal should be to collapse to the highest order possible, and therefore, we should bother with max_ptes_* only for the highest order, since that really is the end goal. So, the algorithm is: check max_ptes_* for PMD -> if success, collapse_huge_page(PMD_ORDER) -> if this fails, we drop down the order, and we ignore the local distribution of none, swap, shared PTEs -> collapse_huge_page(say, 64K) -> success. Now, this won't give us any creep since PMD collapse was eligible anyways. The only rule we should add is "collapse to a smaller order only when at least one PTE is filled in that range" since we do not want to collapse an empty range, as Ryan notes above. This should be easy to do; just maintain a local bitmap in hpage_collapse_scan_pmd(), the question we need to answer is "is the bitmap non-zero for a range?" which we can get in O(1). The issue of bothering with max_ptes_* will come here when we drop and reacquire locks, because the global distribution may change, so we will be trading with accuracy...

With regards to "there would be no way to support VMAs that don't span a full PMD", the policy should be "scale max_ptes_* to the highest order possible for the VMA". Basically, just get rid of the PMD-case and generalize this algorithm.


Personally, I think your "enforce simplicifation of the tunables for mTHP
collapse" idea is the best we have so far.

What I understood: keep max_ptes_none = 511 or 0, document that keeping it other than that may cause the creep, and consider max_ptes_swap = max_ptes_shared = 0 for mTHP. I will vote for this too...we won't have locking issues since we will *have* to scan ranges for smaller orders to check nothing is swap and shared. I have two justifications to support this policy:

(1) Systems in general may, and in particular, Android has a good percent of memory in swap, so we really don't want khugepaged to say "please swap-in memory for this range so I could collapse to 64K" when the real goal is to collapse to 2MB.

(2) The reason we dropped down to a lower order, is because we failed PMD-collapse. The reason for that is that we couldn't find 2MB physically contiguous memory, so, assume for the sake of argument that we are in memory pressure. We don't want khugepaged to say "please swapin memory and let me create even more memory pressure". The analogy runs for the shared case.

All in all, the essence of the policy is that mTHP collapse should be made stricter, since the fact that we failed PMD-collapse means we can't ask a lot of memory from the system just to do an mTHP collapse.

With regards to max_ptes_none, David says:
"If we ever get the request to support something else, maybe that's also where we can learn *why*, and what we would actually want to do with mTHP. "

Which sounds very reasonable, since we are solving a theoretical problem, we don't have a real-use-case justification as to why we should bother to solve this problem :)






[Index of Archives]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Bugtraq]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux