Re: [RFC] mm: userspace hugepage collapse: file/shmem semantics

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hey All,

There are still a couple interface topics (capabilities for process_madvise(2),
errnos) to iron out, but for the most part the behavior and semantics of
MADV_COLLAPSE on anonymous memory seems to be ironed out. Thanks for everyone's
time and effort contributing to that effort.

Looking forward, I'd like to align on the semantics of file/shmem so seal
MADV_COLLAPSE behavior. This is what I'd propose for an initial man-page-like
description of MADV_COLLAPSE for madvise(2), to paint a full-picture view:

---8<---
Perform a best-effort synchronous collapse of the native pages mapped by the
memory range into Transparent Hugepages (THPs). MADV_COLLAPSE operates on the
current state of memory for the specified process and makes no persistent
changes or guarantees on how pages will be mapped, constructed, or faulted in
the future. However, for file/shmem memory, other mappings of this file extent
may be queued and processed later by khugepaged to attempt to update their
pagetables to map the hugepage by a PMD.

If the ranges provided span multiple VMAs, the semantics of the collapse over
each VMA is independent from the others. This implies a hugepage cannot cross a
VMA boundary. If collapse of a given hugepage-aligned/sized region fails, the
operation may continue to attempt collapsing the remainder of the specified
memory.

All non-resident pages covered by the range will first be swapped/faulted-in,
before being copied onto a freshly allocated hugepage. If the native pages
compose the same PTE-mapped hugepage, and are suitably aligned, the collapse
may happen in-place. Unmapped pages will have their data directly initialized
to 0 in the new hugepage. However, for every eligible hugepage aligned/sized
region to-be collapsed, at least one page must currently be backed by memory.

MADV_COLLAPSE is independent of any THP sysfs setting, both in terms of
determining THP eligibility, and allocation semantics. The VMA must not be
marked VM_NOHUGEPAGE, VM_HUGETLB**, VM_IO, VM_DONTEXPAND, VM_MIXEDMAP, or
VM_PFNMAP, nor can it be stack memory or DAX-backed. The process must not have
PR_SET_THP_DISABLE set. For file-backed memory, the file must either be (1) not
open for write, and the mapping must be executable, or (2) the backing
filesystem must support large pages. Allocation for the new hugepage may enter
direct reclaim and/or compaction, regardless of VMA flags.  When the system has
multiple NUMA nodes, the hugepage will be allocated from the node providing the
most native pages.

If all hugepage-sized/aligned regions covered by the provided range were either
successfully collapsed, or were already PMD-mapped THPs, this operation will be
deemed successful. On successful return, all hugepage-aligned/sized memory
regions provided will be mapped by PMDs. Note that this doesn’t guarantee
anything about other possible mappings of the memory. Note that many failures
might have occurred, since the operation may continue to collapse in the event
collapse of a single hugepage-sized/aligned region fails.

MADV_COLLAPSE is only available if the kernel was configured with
CONFIGURE_TRANSPARENT_HUGEPAGE, and file/shmem support additionally require
CONFIG_READ_ONLY_THP_FOR_FS and CONFIG_SHMEM.
---8<---

** Might change with HugeTLB high-granularity mappings[1].


There are a few new items of note here:

1) PMD-mapped on success

MADV_COLLAPSE ultimately wants memory mapped by PMDs, and so I propose we
should always try to actually do the page table updates. For file/shmem, this
means two things: (a) adding support to handle compound pages (both pte-mapped
hugepages and non-HPAGE_PMD_ORDER compound pages), and (b) doing a final PMD
install before returning, and not relying on subsequent fault. This makes the
semantics of file/shmem the same as anonymous. I call out (a), since there was
an existing debate about this, and so I want to ensure we are aligned[1]. Note
that (b), along with presenting a consistent interface to users, also has
real-world usecases too, where relying on fault is difficult (for example,
shmem + UFFDIO_REGISTER_MODE_MINOR-managed memory). Also note that for (b), I'm
proposing to only do the synchronous PMD install for the memory range provided
- the page table collapse of other mappings of the memory can be deferred until
later (by khugepaged).

2) folio timing && file non-writable, executable mapping

I just want to align on some timing due to ongoing folio work. Currently, the
requirement to be able to collapse file/shmem memory is that the file not be
opened for write anywhere, and that the mapping is executable, but we'd
eventually like to support filesystems that claim
mapping_large_folio_support()[2]. Is it acceptable that future MADV_COLLAPSE
works for either mapping_large_folio_support() or the old conditions?
Alternatively, should MADV_COLLAPSE only support mapping_large_folio_support()
filesystems from the onset? (I believe shmem and xfs are the only current
users)

3) (shmem) sysfs settings and huge= tmpfs mount

Should we ignore /sys/kernel/mm/transparent_hugepage/shmem_enabled, similar to
how we ignore /sys/kernel/mm/transparent_hugepage/enabled for anon/file? Does
that include "deny"? This choice is (partially) coupled with tmpfs huge= mount
option. I think today, things work if we ignore this. However, I don't want to
back us into a corner if we ever want to allow MADV_COLLAPSE to work on
writeable shmem mappings one day (or any other incompatibility I'm unaware of).
One option, if in (2) we chose to allow the old conditions, then we could
ignore shmem_enabled in the non-writable, executable case - otherwise defer to
"if the filesystem supports it", where we would then respect huge=.

I think those are the important points. Am I missing anything?

Thanks again everyone for taking the time to read and discuss,

Best,
Zach


[1] https://lore.kernel.org/linux-mm/20220624173656.2033256-23-jthoughton@xxxxxxxxxx/
[2] https://lore.kernel.org/linux-mm/YpGbnbi44JqtRg+n@xxxxxxxxxxxxxxxxxxxx/








[Index of Archives]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Bugtraq]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux