Re: [PATCH 0/8] add mTHP support for anonymous shmem

David Hildenbrand <david@xxxxxxxxxx> · Wed, 8 May 2024 19:03:57 +0200

On 08.05.24 16:28, Daniel Gomez wrote:
On Wed, May 08, 2024 at 01:58:19PM +0200, David Hildenbrand wrote:
On 08.05.24 13:39, Daniel Gomez wrote:
On Mon, May 06, 2024 at 04:46:24PM +0800, Baolin Wang wrote:
Anonymous pages have already been supported for multi-size (mTHP) allocation
through commit 19eaf44954df, that can allow THP to be configured through the
sysfs interface located at '/sys/kernel/mm/transparent_hugepage/hugepage-XXkb/enabled'.

However, the anonymous shared pages will ignore the anonymous mTHP rule
configured through the sysfs interface, and can only use the PMD-mapped
THP, that is not reasonable. Many implement anonymous page sharing through
mmap(MAP_SHARED | MAP_ANONYMOUS), especially in database usage scenarios,
therefore, users expect to apply an unified mTHP strategy for anonymous pages,
also including the anonymous shared pages, in order to enjoy the benefits of
mTHP. For example, lower latency than PMD-mapped THP, smaller memory bloat
than PMD-mapped THP, contiguous PTEs on ARM architecture to reduce TLB miss etc.

The primary strategy is similar to supporting anonymous mTHP. Introduce
a new interface '/mm/transparent_hugepage/hugepage-XXkb/shmem_enabled',
which can have all the same values as the top-level
'/sys/kernel/mm/transparent_hugepage/shmem_enabled', with adding a new
additional "inherit" option. By default all sizes will be set to "never"
except PMD size, which is set to "inherit". This ensures backward compatibility
with the shmem enabled of the top level, meanwhile also allows independent
control of shmem enabled for each mTHP.

I'm trying to understand the adoption of mTHP and how it fits into the adoption
of (large) folios that the kernel is moving towards. Can you, or anyone involved
here, explain this? How much do they overlap, and can we benefit from having
both? Is there any argument against the adoption of large folios here that I
might have missed?

mTHP are implemented using large folios, just like traditional PMD-sized THP
are. (you really should explore the history of mTHP and how it all works
internally)

I'll check more in deep the code. By any chance are any of you going to be at
LSFMM this year? I have this session [1] scheduled for Wednesday and it would
be nice to get your feedback on it and if you see this working together with
mTHP/THP.

I'll be around and will attend that session! But note that I am still 
scratching my head what to do with "ordinary" shmem, especially because 
of the weird way shmem behaves in contrast to real files (below). Some 
input from Hugh might be very helpful.

Example: you write() to a shmem file and populate a 2M THP. Then, nobody 
touches that file for a long time. There are certainly other mmap() 
users that could better benefit from that THP ... and without swap that 
THP will be trapped there possibly a long time (unless I am missing an 
important piece of shmem THP design :) )? Sure, if we only have THP's 
it's nice, that's just not the reality unfortunately. IIRC, that's one 
of the reasons why THP for shmem can be enabled/disabled. But again, 
still scratching my head ...

Note that this patch set only tackles anonymous shmem 
(MAP_SHARED|MAP_ANON), which is in 99.999% of all cases only accessed 
via page tables (memory allocated during page faults). I think there are 
ways to grab the fd (/proc/self/fd), but IIRC only corner cases 
read/write that.

So in that sense, anonymous shmem (this patch set) behaves mostly like 
ordinary anonymous memory, and likely there is not much overlap with 
other "allocate large folios during read/write/fallocate" as in [1]. 
swap might have an overlap.

The real confusion begins when we have ordinary shmem: some users never 
mmap it and only read/write, some users never read/write it and only 
mmap it and some (less common?) users do both.

And shmem really is special: it looks like "just another file", but 
memory-consumption and reclaim wise it behaves just like anonymous 
memory. It might be swappable ("usually very limited backing disk space 
available") or it might not.

In a subthread here we are discussing what to do with that special 
"shmem_enabled = force" mode ... and it's all complicated I think.

[1] https://lore.kernel.org/all/4ktpayu66noklllpdpspa3vm5gbmb5boxskcj2q6qn7md3pwwt@kvlu64pqwjzl/

The biggest challenge with memory that cannot be evicted on memory pressure
to be reclaimed (in contrast to your ordinary files in the pagecache) is
memory waste, well, and placement of large chunks of memory in general,
during page faults.

In the worst case (no swap), you allocate a large chunk of memory once and
it will stick around until freed: no reclaim of that memory.

I can see that path being triggered by some fstests but only for THP (where we
can actually reclaim memory).

Is that when we punch-hole a partial THP and split it? I'd be interested 
in what that test does.

--
Cheers,

David / dhildenb