Re: [PATCH] mm/rmap: do not add fully unmapped large folio to deferred split list

David Hildenbrand <david@xxxxxxxxxx> · Mon, 15 Apr 2024 21:19:25 +0200

We could have
* THP_DEFERRED_SPLIT_PAGE
* THP_UNDO_DEFERRED_SPLIT_PAGE
* THP_PERFORM_DEFERRED_SPLIT_PAGE

Maybe that would catch more cases (not sure if all, though). Then, you
could tell how many are still on that list. THP_DEFERRED_SPLIT_PAGE -
THP_UNDO_DEFERRED_SPLIT_PAGE - THP_PERFORM_DEFERRED_SPLIT_PAGE.

That could give one a clearer picture how deferred split interacts with
actual splitting (possibly under memory pressure), the whole reason why
deferred splitting was added after all.

I'm not quite sure whether there is a solid usecase or not. If we
have, we could consider this. But a simpler counter may be more
preferred.

Yes.

It may be useful. However the counter is typically used to estimate
how many THP are partially unmapped during a period of time.

I'd say it's a bit of an abuse of that counter; well, or interpreting
something into the counter that that counter never reliably represented.

It was way more reliable than now.

Correct me if I am wrong: now that we only adjust the counter for 
PMD-sized THP, it is as (un)reliable as it always was.

Or was there another unintended change by some of my cleanups or 
previous patches?

I can easily write a program that keeps sending your counter to infinity
simply by triggering that behavior in a loop, so it's all a bit shaky.

I don't doubt that. But let's get back to reality. The counter used to
stay reasonable and reliable with most real life workloads before
mTHP. There may be over-counting, for example, when unmapping a
PTE-mapped THP which was not on a deferred split queue before. But
such a case is not common for real life workloads because the huge PMD
has to be split by partial unmap for most cases. And the partial unmap
will add the THP to deferred split queue.

But now a common workload, for example, just process exit, may
probably send the counter to infinity.

Agreed, that's stupid.

Something like Ryans script makes more sense, where you get a clearer
picture of what's mapped where and how. Because that information can be
much more valuable than just knowing if it's mapped fully or partially
(again, relevant for handling with memory waste).

Ryan's script is very helpful. But the counter has been existing and
used for years, and it is a quick indicator and much easier to monitor
in a large-scale fleet.

If we think the reliability of the counter is not worth fixing, why
don't we just remove it. No counter is better than a broken counter.

Again, is only counting the PMD-sized THPs "fixing" the old use cases? 
Then it should just stick around. And we can even optimize it for some 
more cases as proposed in this patch. But there is no easy way to "get 
it completely right" I'm afraid.

--
Cheers,

David / dhildenb