Re: [PATCH 5/8] mm: shmem: add multi-size THP sysfs interface for anonymous shmem

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 08.05.24 14:54, Ryan Roberts wrote:
On 08/05/2024 13:45, David Hildenbrand wrote:
On 08.05.24 14:43, Ryan Roberts wrote:
On 08/05/2024 13:10, David Hildenbrand wrote:
On 08.05.24 14:02, David Hildenbrand wrote:
On 08.05.24 11:02, Ryan Roberts wrote:
On 08/05/2024 08:12, David Hildenbrand wrote:
On 08.05.24 09:08, David Hildenbrand wrote:
On 08.05.24 06:45, Baolin Wang wrote:


On 2024/5/7 18:52, Ryan Roberts wrote:
On 06/05/2024 09:46, Baolin Wang wrote:
To support the use of mTHP with anonymous shmem, add a new sysfs
interface
'shmem_enabled' in the
'/sys/kernel/mm/transparent_hugepage/hugepages-kB/'
directory for each mTHP to control whether shmem is enabled for that
mTHP,
with a value similar to the top level 'shmem_enabled', which can be
set to:
"always", "inherit (to inherit the top level setting)", "within_size",
"advise",
"never", "deny", "force". These values follow the same semantics as
the top
level, except the 'deny' is equivalent to 'never', and 'force' is
equivalent
to 'always' to keep compatibility.

We decided at [1] to not allow 'force' for non-PMD-sizes.

[1]
https://lore.kernel.org/linux-mm/533f37e9-81bf-4fa2-9b72-12cdcb1edb3f@xxxxxxxxxx/

However, thinking about this a bit more, I wonder if the decision we
made to
allow all hugepages-xxkB/enabled controls to take "inherit" was the wrong
one.
Perhaps we should have only allowed the PMD-sized enable=inherit (this is
just
for legacy back compat after all, I don't think there is any use case
where
changing multiple mTHP size controls atomically is actually useful).
Applying

Agree. This is also our usage of 'inherit'.

Missed that one: there might be use cases in the future once we would start
defaulting to "inherit" for all knobs (a distro might default to that) and
default-enable THP in the global knob. Then, it would be easy to disable any
THP
by disabling the global knob. (I think that's the future we're heading to,
where
we'd have an "auto" mode that can be set on the global toggle).

But I am just making up use cases ;) I think it will be valuable and just
doing
it consistently now might be cleaner.

I agree that consistency between enabled and shmem_enabled is top priority.
And
yes, I had forgotten about the glorious "auto" future. So probably continuing
all sizes to select "inherit" is best.

But for shmem_enabled, that means we need the following error checking:

     - It is an error to set "force" for any size except PMD-size

     - It is an error to set "force" for the global control if any size except
PMD-
       size is set to "inherit"

     - It is an error to set "inherit" for any size except PMD-size if the
global
       control is set to "force".

Certainly not too difficult to code and prove to be correct, but not the
nicest
UX from the user's point of view when they start seeing errors.

I think we previously said this would likely be temporary, and if/when tmpfs
gets mTHP support, we could simplify and allow all sizes to be set to "force".
But I wonder if tmpfs would ever need explicit mTHP control? Maybe it would be
more suited to the approach the page cache takes to transparently ramp up the
folio size as it faults more in. (Just saying there is a chance that this
error
checking becomes permanent).

Note that with shmem you're inherently facing the same memory waste
issues etc as you would with anonymous memory. (sometimes even worse, if
you're running shmem that's configured to be unswappable!).

Also noting that memory waste is not really a problem when a write to a shmem
file allocates a large folio that stays within boundaries of that write; issues
only pop up if you end up over-allocating, especially, during page faults where
you have not that much clue about what to do (single address, no real range
provided).

There is the other issue that wasting large chunks of contiguous memory on stuff
that barely benefits from it. With memory that maybe never gets evicted, there
is no automatic "handing back" of that memory to the system to be used by
something else. With ordinary files, that's a bit different. But I did not look
closer into that issue yet, it's one of the reasons MADV_HUGEPAGE was added
IIRC.

OK understood. Although, with tmpfs you're not going to mmap it then randomly
extend the file through page faults - mmap doesn't permit that, I don't think?
So presumably the user must explicitly set the size of the file first? Are you
suggesting there are a lot of use cases where a large tmpfs file is created,
mmaped then only accessed sparsely?

I don't know about "a lot of use cases", but for VMs that's certainly how it's
used.


There are more details around that and the sparsity (memory ballooning, virtio-mem, free page reporting), but it might distract here :) I'll note that shmem+THP is known to be problematic with memory ballooning.

Gottya, thanks. And out of curiosity, what's the benefit of using tmpfs rather
than private (or shared) anonymous memory for VMs?

The primary use case I know of is sharing VM memory with other processes (usually not child processes): DPDK/SPDK and other vhost-user variants (such as virtiofs) mmap() all guest memory to access it directly (some sort of multi-process hypervisors). They either use real-file-based shmem or memfd (essentially the same without a named file) for that.

Then, there is live-hypervisor upgrade, whereby you start a second hypervisor process that will take over. People use shmem for that, so you can minimize downtime by migrating guest memory simply by mmap'ing the shmem file into the new hypervisor.

Shared anonymous memory is basically never used (I only know of one corner case in QEMU).

I would assume that there are also DBs making use of rather sparse shmem? But no expert on that.

--
Cheers,

David / dhildenb





[Index of Archives]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Bugtraq]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux