For uffd-wp in its current form, it would certainly be the way to go I
think. AFAIKT, the idea of special swap entries isn't new, just that it's
limited to anonymous memory for now, which makes things like fork and new
mappings a lot cheaper.
Thanks for reviewing this series separately; yes I definitely wanted to get
comments on both sides: one on the pte marker idea, the other is whether it's
applicable to this swap+shmem use case.
Exactly.
Here I really want to make the pte marker be flexible - it can be strict when
necessary (it will be 100% strict with uffd-wp), then it can also be a hint
just like what we have with available ptes on soft-dirty, idle, accessed bits.
Here the swap bit I wanted to make it that kind, so we add zero overhead to
fork() and we still solve problems.
Same thing to "newly mapped shmem". Do we have a use case for that? If that's
a hint bit, can we ignore it?
I am really not a fan of taking an already broken feature (broken
presence information for shmem in pagemap) and instead of fixing it
properly, turning it less broken, crossing fingers that nobody will
notice the remaining (undocumented) cases.
[...]
As already expressed, we should try storing as little information in page
tables as possible if we're dealing with shared memory. The features we
design around this all seem to over-complicate the actual users,
over-complicate fork, over-complicate handling on new mappings.
I'll skip the last two "over-complicated" items, because as we've discussed I
don't think we need to take care of them so far. We can revisit when they
become some kind of requirement.
To me having PM_SWAP 99% right on shmem is still a progress comparing to
completely missing it, even if it's not 100% right. It's used for performance
reasons on PAGEOUT and doing finer-grained memory control from userspace, it's
not a strict requirement.
So I still cannot strictly follow why storing information in pte is so bad for
file-backed, which I can see you really don't like it. Could you share some
detailed example?
I am not a fan of your approach of "hints", because while it might work
for some use cases, it might not work for others (see below for CRIU); I
would rather like to avoid such "inconsistent caches" where it's not
really required. But again, this is just my opinion and there might be
plenty other people that will most probably disagree.
Storing hints in page tables isn't actually that bad, because we can
drop hints whenever we like (well, there are side effects, and once we
drop hints too often people might complain again) -- for example, when
reclaiming "empty" but actually "filled with hints" page tables. When we
rely on consistent values, fork() and mmap() are a problem. Well, and
page tables will have to stick around. At least for uffd-wp, mmap() is
not an issue, and we don't expect too many uffd-wp users such that page
table consumption would matter -- my guess.
So I repeat, for uffd-wp in its current form, it sounds like the right
thing to do.
But I guess I'm biased at this point because the main users of these
features actually want to query/set such properties for all sharers, not
individual processes; so the opinion of others would be appreciated.
Known Issues/Concerns
=====================
About THP
---------
Currently we don't need to worry about THP because paged out shmem pages will
be split when shrinking, IOW we only need to consider PTE, and the markers will
only be applied to a shmem pte not pmd or bigger.
About PM_SWAP Accuracy
----------------------
This is not an "accurate" solution to provide PM_SWAP bit. Two exmaples:
- When process A & B both map shmem page P somewhere, it can happen that only
one of these ptes got marked with the pte marker. Imagine below sequence:
0. Process A & B both map shmem page P somewhere
1. Process A zap pte of page P for some reason (e.g. thp split)
2. System decides to recycle page P
3. System replace process B's pte (pointed to P) by PTE marker
4. System _didn't_ replace process A's pte because it was none pte, and
it'll continue to be none pte
5. Only process B's relevant pte has the PTE marker after P swapped out
- When fork, we don't copy shmem vma ptes, including the pte markers. So
even if page P was swapped out, only the parent process has the pte marker
installed, in child it'll be none pte if fork() happened after pageout.
Conclusion: just like it used to be, the PM_SWAP is best-effort. But it should
work in 99.99% cases and it should already start to solve problems.
At least I don't like these semantics at all. PM_SWAP is a cached value
which might be under-represented and consequently wrong.
Please have a look at current pagemap impl in pte_to_pagemap_entry(). It's not
accurate from the 1st day, imho. E.g., when a page is being migrated from numa
node 1 to node 2, we'll mark it PM_SWAP but I think it's not the case. We can
make it more accurate, but I think it's fine, because it's a hint.
That inconsistency doesn't really matter as you can determine if
something is present and worth dumping if it's either swapped or
present. As long as it's one of both but not simply nothing.
I will shamelessly reference
tools/testing/selftests/vm/madv_populate.c:pagemap_is_populated() that
checks exactly for that (the test case uses only private anonymous memory).
Take CRIU as an example, it has to be correct even if a process would remap a
memory region, fork() and unmap in the parent as far as I understand, ...
Are you talking about dirty bit or swap bit? I'm a bit confused on why swap
bit needs to be accurate. Maybe you mean the dirty bit?
https://criu.org/Shared_memory
"Dumping present pages"
"... CRIU does not dump all of the data. Instead, it determines which
pages contain it, and only dumps those pages. This is done similarly to
how regular memory dumping and restoring works, i.e. by looking for
PRESENT or SWAPPED bits in owners' pagemap entries."
-> Neither PRESENT nor SWAPPED results in memory not getting dumped,
which makes perfect sense.
1) Process A sets up shared memory and writes data to it.
2) System swaps out memory, hints are setup.
3) Process A forks Process B, hints are not copied.
4) Process A unmaps shared memory, hints are dropped.
5) CRIU migrates process A and B and migrates only PRESENT or SWAPPED in
pagemap.
6) Process B uses memory in shared memory region. Pages were not migrated.
Just one example; feel free to correct me.
There is notion of the mincore() systemcall:
"There is one particular feature of shared memory dumps worth
mentioning. Sometimes, a shared memory page can exist in the kernel, but
it is not mapped to any process. CRIU detects such pages by calling
mincore() on the shmem segment, which reports back the page in-memory
status. The mincore bitmap is when ANDed with the per-process ones. "
Not sure if they actually mean ORed, because otherwise they'd be losing
pages that have been swapped out. "mincore() returns a vector that
indicates whether pages of the calling process's virtual memory are
resident in core (RAM)"
--
Thanks,
David / dhildenb