Re: [PATCH v3 0/6] add mTHP support for anonymous shmem

David Hildenbrand <david@xxxxxxxxxx> · Tue, 4 Jun 2024 11:59:51 +0200

[...]

How can we use per-block tracking for reclaiming memory and what changes would
be needed? Or is per-block really a non-viable option?
The interesting thing is: with mTHP toggles it is opt-in -- like for
PMD-sized THP with shmem_enabled -- and we don't have to be that concerned
about this problem right now.
Without respecting the size when allocating large folios, mTHP toggles would
over allocate. My proposal added earlier to this thread is to combine the 2
to avoid that case. Otherwise, shouldn't we try to find a solution for the
reclaiming path?
I think at some point we'll really have to do a better job at reclaiming 
(either memory-overallocation, PUNCHHOLE that couldn't split, but maybe 
also pages that are now all-zero again and could be reclaimed again).
Clearly, if per-block is viable option, shmem_fault() bug would required to be
fixed first. Any ideas on how to make it reproducible?

The alternatives discussed where sub-page refcounting and zeropage scanning.
Yeah, I don't think sub-page refcounting is a feasible (and certainly not
desired ;) ) option in the folio world.

The first one is not possible (IIUC) because of a refactor years back that
simplified the code and also requires extra complexity. The second option would
require additional overhead as we would involve scanning.
We'll likely need something similar (scanning, tracking?) for anonymous
memory as well. There was a proposal for a THP shrinker some time ago, that
would solve part of the problem.
It's good to know we have the same problem in different places. I'm more
inclined to keep the information rather than adding an extra overhead. Unless
the complexity is really overwhelming. Considering the concerns here, not sure
how much should we try merging with iomap as per Ritesh's suggestion.
As raised in the meeting, I do see value in maintaining the information; 
but I also see why Hugh and Kirill think this might be unwarranted 
complexity in shmem.c. Meaning: we might be able to achieve something 
similar without it, and we don't have to solve the problem right now to 
benefit from large folios.
The THP shrinker, could you please confirm if it is this following thread?

https://lore.kernel.org/all/cover.1667454613.git.alexlzhu@xxxxxx/
Yes, although there was no follow up so far. Possibly, because in the 
current khugepaged approach, there will be a constant back-and-forth 
between khugepaged collapsing memory (and wasting memory in the default 
setting), and the THP shrinker reclaiming memory; doesn't sound quite 
effective :) That needs more thought IMHO.
[...]

To optimize FALLOC_FL_PUNCH_HOLE for the cases where splitting+freeing
is not possible at fallcoate() time, detecting zeropages later and
retrying to split+free might be an option, without per-block tracking.
(2) mTHP controls

As a default, we should not be using large folios / mTHP for any shmem, just
like we did with THP via shmem_enabled. This is what this series currently
does, and is aprt of the whole mTHP user-space interface design.
That was clear for me too. But what is the reason we want to boot in 'safe
mode'? What are the implications of not respecing that?
[...]

As I understood from the call, mTHP with sysctl knobs is preferred over
optimistic falloc/write allocation? But is still unclear to me why the former
is preferred.
I think Hugh's point was that this should be an opt-in feature, just like
PMD-sized THP started out, and still is, an opt-in feature.
I'd be keen to understand the use case for this. Even the current THP controls
we have in tmpfs. I guess these are just scenarios with no swap involved?
Are these use cases the same for both tmpfs and shmem anon mm?
Systems without swap are one case I think. The other reason for a toggle 
in the past was to be able to disable it to troubleshoot issues (Hugh 
mentioned in the meeting about unlocking new code paths in shmem.c only 
with a toggle).
Staring at my Fedora system:

$ cat /sys/kernel/mm/transparent_hugepage/shmem_enabled
always within_size advise [never] deny force

Maybe because it uses tmpfs to mount /tmp (interesting article on 
lwn.net about that) and people are concerned about the side-effects 
(that can currently waste memory, or result in more reclaim work being 
required when exceeding file sizes).
For VMs, I know that shmem+THP with memory ballooning is problematic and 
not really recommended.
[...]

Also, we should properly fallback within the configured sizes, and not jump
"over" configured sizes. Unless there is a good reason.

(3) khugepaged

khugepaged needs to handle larger folios properly as well. Until fixed,
using smaller THP sizes as fallback might prohibit collapsing a PMD-sized
THP later. But really, khugepaged needs to be fixed to handle that.

(4) force/disable

These settings are rather testing artifacts from the old ages. We should not
add them to the per-size toggles. We might "inherit" it from the global one,
though.

"within_size" might have value, and especially for consistency, we should
have them per size.

So, this series only tackles anonymous shmem, which is a good starting
point. Ideally, we'd get support for other shmem (especially during fault
time) soon afterwards, because we won't be adding separate toggles for that
from the interface POV, and having inconsistent behavior between kernel
versions would be a bit unfortunate.

@Baolin, this series likely does not consider (4) yet. And I suggest we have
to take a lot of the "anonymous thp" terminology out of this series,
especially when it comes to documentation.

@Daniel, Pankaj, what are your plans regarding that? It would be great if we
could get an understanding on the next steps on !anon shmem.
I realize I've raised so many questions, but it's essential for us to grasp the
mm concerns and usage scenarios. This understanding will provide clarity on the
direction regarding folios for !anon shmem.
If I understood correctly, Hugh had strong feelings against not respecting
mTHP toggles for shmem. Without per-block tracking, I agree with that.
My understanding was the same. But I have this follow-up question: should
we respect mTHP toggles without considering mapping constraints (size and
index)? Or perhaps we should use within_size where we can fit this intermediate
approach, as long as mTHP granularity is respected?
Likely we should do exactly the same as we would do with the existing 
PMD-sized THPs.
I recall in the meeting that we discussed that always vs. within_size 
seems to have some value, and that we should respect that setting like 
we did for PMD-sized THP.
Which other mapping constraints could we have?

--
Cheers,

David / dhildenb