On 2024/4/25 16:57, David Hildenbrand wrote:
On 25.04.24 10:46, Ryan Roberts wrote:
On 25/04/2024 09:26, David Hildenbrand wrote:
On 25.04.24 10:17, Ryan Roberts wrote:
On 25/04/2024 07:20, Baolin Wang wrote:
On 2024/4/24 22:20, Ryan Roberts wrote:
On 24/04/2024 14:49, Baolin Wang wrote:
On 2024/4/24 18:01, Ryan Roberts wrote:
On 24/04/2024 10:55, Baolin Wang wrote:
On 2024/4/24 16:26, Ryan Roberts wrote:
On 24/04/2024 07:55, Baolin Wang wrote:
On 2024/4/23 18:41, Ryan Roberts wrote:
On 22/04/2024 08:02, Baolin Wang wrote:
Anonymous pages have already been supported for multi-size
(mTHP)
allocation
through commit 19eaf44954df, that can allow THP to be
configured
through the
sysfs interface located at
'/sys/kernel/mm/transparent_hugepage/hugepage-XXkb/enabled'.
However, the anonymous shared pages will ignore the
anonymous mTHP rule
configured through the sysfs interface, and can only use
the PMD-mapped
THP, that is not reasonable. Many implement anonymous page
sharing
through
mmap(MAP_SHARED | MAP_ANONYMOUS), especially in database usage
scenarios,
therefore, users expect to apply an unified mTHP strategy
for anonymous
pages,
also including the anonymous shared pages, in order to
enjoy the
benefits of
mTHP. For example, lower latency than PMD-mapped THP,
smaller memory
bloat
than PMD-mapped THP, contiguous PTEs on ARM architecture to
reduce TLB
miss
etc.
This sounds like a very useful addition!
Out of interest, can you point me at any workloads (and
off-the-shelf
benchmarks
for those workloads) that predominantly use shared anon memory?
As far as I know, some database related workloads make
extensive use of
shared
anonymous page, such as PolarDB[1] in our Alibaba fleet, or
MySQL likely
also
uses shared anonymous memory. And I still need to do some
investigation to
measure the performance.
[1] https://github.com/ApsaraDB/PolarDB-for-PostgreSQL
Thanks for the pointer!
The primary strategy is that, the use of huge pages for
anonymous shared
pages
still follows the global control determined by the mount
option "huge="
parameter
or the sysfs interface at
'/sys/kernel/mm/transparent_hugepage/shmem_enabled'.
The utilization of mTHP is allowed only when the global
'huge' switch is
enabled.
Subsequently, the mTHP sysfs interface
(/sys/kernel/mm/transparent_hugepage/hugepage-XXkb/enabled)
is checked to determine the mTHP size that can be used for
large folio
allocation
for these anonymous shared pages.
I'm not sure about this proposed control mechanism; won't it
break
compatibility? I could be wrong, but I don't think shmem's
use of THP
used to
depend upon the value of
/sys/kernel/mm/transparent_hugepage/enabled?
So it
Yes, I realized this after more testing.
doesn't make sense to me that we now depend upon the
/sys/kernel/mm/transparent_hugepage/hugepage-XXkb/enabled
values
(which by
default disables all sizes except 2M, which is set to
"inherit" from
/sys/kernel/mm/transparent_hugepage/enabled).
The other problem is that shmem_enabled has a different set
of options
(always/never/within_size/advise/deny/force) to enabled
(always/madvise/never)
Perhaps it would be cleaner to do the same trick we did for
enabled;
Introduce
/mm/transparent_hugepage/hugepage-XXkb/shmem_enabled, which
can have all
the
same values as the top-level
/sys/kernel/mm/transparent_hugepage/shmem_enabled,
plus the additional "inherit" option. By default all sizes
will be set to
"never" except 2M, which is set to "inherit".
Sounds good to me. But I do not want to copy all same values
from
top-level
'/sys/kernel/mm/transparent_hugepage/shmem_enabled':
always within_size advise never deny force
For mTHP's shmem_enabled interface, we can just keep below
values:
always within_size advise never
Cause when checking if mTHP can be used for anon shmem,
'deny' is equal to
'never', and 'force' is equal to 'always'.
I'll admit it wasn't completely clear to me after reading the
docs, but my
rough
understanding is:
- /sys/kernel/mm/transparent_hugepage/shmem_enabled
controls
mmap(SHARED|ANON) allocations (mostly; see rule 3)
- huge=... controls tmpfs allocations
- deny and force in shmem_enabled are equivalent to
never and
always for
mmap(SHARED|ANON) but additionally override all tmpfs
mounts so they
act as
if they were mounted with huge=never or huge=always
Is that correct? If so, then I think it still makes sense to
support
per-size
Correct.
deny/force. Certainly if a per-size control is set to
"inherit" and the
top-level control is set to deny or force, you would need that
to mean
something.
IMHO, the
'/mm/transparent_hugepage/hugepage-XXkb/shmem_enabled' interface
should only control the anonymous shmem. And 'huge=' controls
tmpfs
allocation,
so we should not use anonymous control to override tmpfs
control, which
seems a
little mess?
I agree it would be cleaner to only handle mmap(SHARED|ANON)
here, and leave
the
tmpfs stuff for another time. But my point is that
/mm/transparent_hugepage/shmem_enabled already interferes with
tmpfs if the
value is deny or force. So if you have:
echo deny > /mm/transparent_hugepage/shmem_enabled
IIUC, this global control will cause shmem_is_huge() to always
return
false, so
no matter how
'/mm/transparent_hugepage/hugepage-xxxkB/shmem_enabled' is set,
anonymous shmem will not use mTHP. No?
No, that's not how
'/mm/transparent_hugepage/hugepage-xxxkB/enabled' works, and
I think '/mm/transparent_hugepage/hugepage-xxxkB/shmem_enabled'
should follow
the established pattern.
For anon-private, each size is controlled by its
/mm/transparent_hugepage/hugepage-xxxkB/enabled value. Unless that
value is
"inherit", in which case the value in
/mm/transparent_hugepage/enabled is used
for that size.
That approach enables us to 1) maintain back-compat and 2) control
each size
independently
1) is met because the default is that all sizes are initially set
to "never",
except the PMD-size (e.g.
/mm/transparent_hugepage/hugepage-2048kB/enabled)
which is initially set to inherit. So any mTHP unaware SW can
still modify
/mm/transparent_hugepage/enabled and it will still only apply to
PMD size.
2) is met because mTHP aware SW can come along and e.g. enable the
64K size
(echo always > /mm/transparent_hugepage/hugepage-64kB/enabled)
without
having to
modify the value in /mm/transparent_hugepage/enabled.
Thanks for explanation. Initially, I want to make
‘/mm/transparent_hugepage/shmem_enabled’ be a global control for
huge page, but
I think it should follow the same strategy as anon mTHP as you said.
echo inherit > /mm/transparent_hugepage/hugepage-64kB/shmem_enabled
What does that mean?
So I think /mm/transparent_hugepage/hugepage-xxxkB/shmem_enabled
will need to
support the deny and force values. When applied to non-PMD sizes,
"deny" can
just be a noop for now, because there was no way to configure a
tmpfs mount for
non-PMD size THP in the first place. But I'm not sure what to do
with "force"?
OK. And I also prefer that "force" should be a noop too, since anon
shmem
control should not configure tmpfs huge page allocation.
I guess technically they won't be noops, but (for the non-PMD-sizes)
"force"
will be an alias for "always" and "deny" will be an alias for "never"?
I was just a bit concerned about later changing that behavior to
also impact
tmpfs once tmpfs supports mTHP; could that cause breaks? But
thinking about it,
I don't see that as a problem.
Is the question what should happen if we "inherit" "force" or if someone
specifies "force" for a mTP size explicitly?
Well I think it amounts to the same thing; there isn't much point in
forbidding
"force" to be set directly because it can still be set indirectly through
"inherit". We can't forbid indirectly setting it, because "inherit"
could be set
first, then the top-level shmem_enabled changed to "force" after - and we
wouldn't want to fail that.
The default for PMD should be "inherit", for the other mTHP sizes it
should be "never".
So we should fail if:
* Setting top-level to "force" when any non-PMD size is "inherit"
* Setting "inherit" of a non-PMD size when the top-level is force
IMO, for tmpfs this is true, but for anon shmem, this 2 cases should not
fail.
So I think we should allow this configuration, but for tmpfs huge page
allocation, we will not check the mTHP.
Both will only happen if someone messes with the mTHP configuration
manually.
And we should only offer "force" as an option for PMD-sized mTHP as long
as the others are not supported. See below.
So I think the question is just 'what should happen when "force" is
configured
for a non-PMD-sized mTHP'?
We should hide it and not offer a configuration toggle that is inactive.
If someone wants to sense support for other mTHP "force" settings in the
future, they can just parse if the "shmem_enabled" toggle offers "force"
as an option. Then they know that it can actually be enabled and will
also do what is promised.
Sounds good to me.