On 2025-03-03 15:49, David Hildenbrand wrote:
On 03.03.25 21:01, Mathieu Desnoyers wrote:
On 2025-02-28 17:32, Peter Xu wrote:
On Fri, Feb 28, 2025 at 12:53:02PM -0500, Mathieu Desnoyers wrote:
On 2025-02-28 11:32, Peter Xu wrote:
On Fri, Feb 28, 2025 at 09:59:00AM -0500, Mathieu Desnoyers wrote:
For the VM use-case, I wonder if we could just add a userfaultfd
"COW" event that would notify userspace when a COW happens ?
I don't know what's the best for KSM and how well this will work,
but we
have such event for years.. See UFFDIO_REGISTER_MODE_WP:
https://man7.org/linux/man-pages/man2/userfaultfd.2.html
userfaultfd UFFDIO_REGISTER only seems to work if I pass an address
resulting from a mmap mapping, but returns EINVAL if I pass a
page-aligned address which sits within a private file mapping
(e.g. executable data).
Yes, so far sync traps only supports RAM-based file systems, or
anonymous.
Generic private file mappings (that stores executables and libraries)
are
not yet supported.
Also, I notice that do_wp_page() only calls handle_userfault
VM_UFFD_WP when vm_fault flags does not have FAULT_FLAG_UNSHARE
set.
AFAICT that's expected, unshare should only be set on reads, never
writes.
So uffd-wp shouldn't trap any of those.
AFAIU, as it stands now userfaultfd would not help tracking COW faults
caused by stores to private file mappings. Am I missing something ?
I think you're right. So we have UFFD_FEATURE_WP_ASYNC that should
work on
most mappings. That one is async, though, so more like soft-dirty. It
might be doable to try making it sync too without a lot of changes
based on
how async tracking works.
I'm looking more closely at admin-guide/mm/pagemap.rst and it appears to
be a good fit. Here is what I have in mind to replace the ksmd scanning
thread for the VM use-case by a purely user-space driven scanning:
Within qemu or similar user-space process:
1) Track guest memory with the userfaultfd UFFD_FEATURE_WP_ASYNC
feature and
UFFDIO_REGISTER_MODE_WP mode.
2) Protect user-space memory with the PAGEMAP_SCAN ioctl
PM_SCAN_WP_MATCHING flag
to detect memory which stays invariant for a long time.
3) Use the PAGEMAP_SCAN ioctl with PAGE_IS_WRITTEN to detect which
pages are written to.
Keep track of memory which is frequently modified, so it can be
left alone and
not write-protected nor merged anymore.
4) Whenever pages stay invariant for a given lapse of time, merge them
with the new
madvise(2) KSM_MERGE behavior.
Let me know if that makes sense.
Note that one of the strengths of ksm in the kernel right now is that we
write-protect + try-deduplicate only when we are fairly sure that we can
deduplicate (unstable tree), and that the interaction with THPs / large
folios is fairly well thought-through.
Also note that, just because data hasn't been written in some time
interval, doesn't mean that it should be deduplicated and result in CoW
on next write access.
Right. This tracking of address range access pattern would have to be
implemented in user-space.
One probably would have to mimic what the KSM implementation in the
kernel does, and built something like the unstable tree, to find
candidates where we can actually deduplciate. Then, have a way to not-
deduplicate if the content changed.
With madvise MADV_MERGE, there is no need to "unmerge". The merge
write-protects the page and merges its content at the time of the
MADV_MERGE with exact duplicates, and keeps that write protected page in
a global hash table indexed by checksum.
However, unlike KSM, it won't track that range on an ongoing basis.
"Unmerging" the page is done naturally by writing to the merged address
range. Because it is write-protected, this will trigger COW, and will
therefore provide a new anonymous page to the process, thus "unmerging"
that page.
It's really just up to userspace to track COW faults and figure out
that it really should not try to merge that range anymore, based on the
the access pattern monitored through write-protection faults.
Thanks,
Mathieu
--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com