> On Dec 18, 2021, at 2:53 PM, Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> wrote: > > On Sat, Dec 18, 2021 at 1:49 PM Nadav Amit <namit@xxxxxxxxxx> wrote: >> >> Yes, I guess that you pin the pages early for RDMA registration, which >> is also something you may do for IO-uring buffers. This would render >> userfaultfd unusable. > > I think this is all on usefaultfd. > > That code literally stole two of the bits from the page table layout - > bits that we could have used for better things. > > And guess what? Because it required those two bits in the page tables, > and because that's non-portable, it turns out that UFFD_WP can only be > enabled and only works on x86-64 in the first place. > > So UFFS_WP is fundamentally non-portable. Don't use it. I have always felt that the PTE software-bits limit is very artificial. We can just allocate two adjacent pages when needed, one for PTEs and one for extra software bits. A software bit in the PTE can indicate “extra software bits” are relevant (to save cache-misses), and a bit in the PTEs' page-struct indicate whether there is adjacent “extra software bits” page. I’ve done it something similar once in a research project. It is rather similar to what is done for PTI in the PGD level. > > Anyway, the good news is that I think that exactly because uffd_wp > stole two bits from the page table layout, it already has all the > knowledge it needs to handle this entirely on its own. It's just too > lazy to do so now. > > In particular, it has that special UFFD_WP bit that basically says > "this page is actually writable, but I've made it read-only just to > get the fault for soft-dirty". > > And the hint here is that if the page truly *was* writable, then COW > just shouldn't happen, and all that the page fault code should do is > set soft-dirty and return with the page set writable. > > And if the page was *not* writable, then UFFD_WP wasn't actually > needed in the first place, but the uffd code just sets it blindly. I don’t think that I am following. The write-protection of UFFD means that the userspace wants to intervene before anything else (including COW). UFFD_WP indications are recorded per PTE (i.e., not VMA). So if userspace wants to intervene on write, it must use UFFD_WP even if the page is write-protected. The kernel then has to keep the UFFD_WP indication to call userspace upon a write. > > Notice? It _should_ be just an operation based purely on the page > table contents, never even looking at the page AT ALL. Not even the > page count, much less some mapcount thing. > > Done right, that soft-dirty thing could work even with no page backing > at all, I think. > > But as far as I know, we've actually never seen a workload that does > all this, so.. Does anybody even have a test-case? > > Because I do think that UFFD_WP really should never really look at the > page, and this issue is actually independent of the "page_count() vs > page_mapcount()" discussion. I can think of two examples for reasonable flows of UFFD: [ M = Monitor thread; F = Faulting thread ] (A) Userspace page-fault tracking (e.g., for memory migration): 1. M: WP memory. 2. F: WP page-fault: provide UFFD notification. 3. M: Unprotect the page. 4. M: Wake the faulting thread (usually as part of the unprotect) 5. F: Retry the page-fault (and handle COW). (B) Userspace memory snapshots: 1. M: Write-protect memory. 2. M: Copy the memory to disk. 3. M: Write-unprotect memory (e.g., speculatively as you expect a page to be written to and do not want to pay the #PF price). [ notice that the un-protection is logical, not really in the PTEs] 4. F: Get a page-fault (later) and handle it (because it might or might not need COW) There may be “crazier” flows (e.g., wake the faulting thread and emulate the instruction that triggered the write with ptrace), but let’s put those aside. IIUC the solution you propose, it tries to address flows such as (A). I am not sure whether the proposal is to change the write-protection API to only provide notifications (i.e., not block to after page-fault as done today), but I do not see how it addresses (B). I am not saying it is impossible, but I think that the solution would complicate the code by making UFFD a special case.