On 10.01.21 01:44, Andrea Arcangeli wrote: > Hello Andrew and everyone, > > Once we agree that COW page reuse requires full accuracy, the next > step is to re-apply 17839856fd588f4ab6b789f482ed3ffd7c403e1f and to > return going in that direction. After stumbling over the heated discussion around this, I wanted to understand the details and the different opinions. I tried to summarize in my simple words (bear with me) what happened and how I think we can proceed from here. Maybe that helps. ==== What happened: 1) We simplified handling of faults on write-protected pages (page table entries): we changed the logic when we can reuse a page ("simply unprotecting it"), and when we have to copy it instead (COW). The essence of the simplification is, that we only reuse a page if we are the only single user of the page, meaning page_count(page) == 1, and the page is mapped into a single process (page_mapcount(page) == 1); otherwise we copy it. Simple. 2) The old code was complicated and there are GUP (e.g., RDMA, VFIO) cases that were broken in various ways in the old code already: most prominently fork(). As one example, it would have been possible for mprotect(READ) memory to still get modified by GUP users like RDMA. Write protection (AFAIU via any mechanism) after GUP pinned a page was not effective; the page was not copied. 3) Speculative pagecache reference can temporarily bump up the page_count(page), resulting in false positives. We could see page_count(page) > 1, although we're the single instance that actually uses a page. In the simplified code, we might copy a page although not necessary (I cannot tell how often that actually happens). 4) clear_refs(4) ("measure approximately how much memory a process is using"), uffd-wp (let's call it "lightweight write-protection, handling the actual fault in user space"), and mprotect(READ) all write-protect page table entries to generate faults on next write access. With the simplified code, we will COW whenever we find the page_count(page) > 1. The simplification seemed to regress clear_refs and uffdio-wp code (AFAIU in case of uffd-wp, it results in memory corruption). But looks like we can mostly fix it by adding more extensive locking. 5) Mechanisms like GUP (AFAIU including Direct I/O) also takes references on pages, increasing page_count(). With the simplification, we might now end up copying a page, although there is "somewhat" only a single user/"process" involved. One example is RDMA: if we read memory using RDMA and mprotect(READ) such memory, we might end up copying the underlying page on the next write: suddenly, RDMA is disconnected and will no longer read what is getting written. Not to mention, we consume more memory. AFAIU, other examples include direct I/O (e.g., write() with O_DIRECT). AFAIU, a more extreme case is probably VFIO: A VM with VFIO (e.g., passthrough of a PCI device) can essentially be corrupted by "echo 4 > /proc/[pid]/clear_refs". 6) While some people think it is okay to break GUP further, as it is already broken in various other ways, other people think this is changing something that used to work (AFAIU a user-visible change) with little benefit. 7) There is no easy way to detect if a page really was pinned: we might have false positives. Further, there is no way to distinguish if it was pinned with FOLL_WRITE or not (R vs R/W). To perform reliable tracking we most probably would need more counters, which we cannot fit into struct page. (AFAIU, for huge pages it's easier). However, AFAIU, even being able to detect if (and how) a page was pinned would not completely help to solve the puzzle. 8) We have a vmsplice security issue that has to be fixed by touching the code in question. A forked child process can read memory content of its parent, which was modified by the parent after fork. AFAIU, the fix will further lock us in into the direction of the code we are heading. 9) The simplification is part of v5.10, which is a LTS release. AFAIU, that one needs fixing, too. I see the following possible directions we can head A) Keep the simplification. Try fixing the fallout. Keep the GUP cases broken or make mprotect() fail when detecting such a scenario; AFAIU, both are user-visible changes. B) Keep the simplification. Try fixing the fallout. Fix GUP cases that used to work; AFAIU fixing this is the hard/impossible part, and is undesired by some people.. C) Revert the simplification for now. Go back to the drawing board and use what we learned to come up with a simplification that (all? ) people are happy with. D) Revert the simplification: turns out the code could not get simplified to this extend. We learned a lot, though. ====== Please let me know in case I messed up anything and/or missed important points. -- Thanks, David / dhildenb