On Sat, Jan 9, 2021 at 7:51 PM Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> wrote: > > COW is about "I'm about to write to this page, and that means I need > an _exclusive_ page so that I don't write to a page that somebody else > is using". So this kind of fundamentally explains why I hate the games we used to play wrt page_mapcount(): they were fundamentally fragile. I _much_ prefer just having the rule that we use page_count(), which the above simple and straightforward single sentence explains 100%. This gets back to the fact that especially considering how we've had subtle bugs here (the "wrong-way COW" issue has existed since literally the first GUP ever, so it goes back decades), I want the core VM rules to be things that can be explained basically from simple "first principles". And the reason I argue for the current direction that I'm pushing, is exactly that the above is a very simple "first principle" for why COW exists. If the rule for COW is simply "I will always COW if it's not clear that I'm the exclusive user", then COW itself is very simple to think about. The other rule I want to stress is that COW is common, and that argues against the model we used to have of "let's lock the page to make sure that everything else is stable". That model was garbage anyway, since page locking doesn't even guarantee any stability wrt exclusive use in the first place (ie GUP being another example), but it's why I truly detested the old model that depended so much on the page lock to serialize things. So if you start off with the rule that "I will always COW unless I can trivially see I'm the only owner", then I think we have really made for a really clear and unambiguous rule. And remember: COW is only an issue for private mappings. So pretty much BY DEFINITION, doing a COW is always safe for all normal circumstances. Now, this is where it does get subtle: that "all normal circumstances" part. The one special case is a cache-coherent GUP. It's arguable whether "pinned" should matter or not, and it would obviously be better if "pinned" simply didn't matter at all (and the only issue with any long-term pinning would simply be about resource counting). The current approach I'm advocating is "coherency means that it must have been writable", and then the way to solve the whole "Oh, it's shared with something else" is to simply never accept making it read-only, because BY DEFINITION it's not _really_ read-only (because we know we've created that other alias of the virtual address that is *not* controlled by the page table protection bits). Notice how this is all both conceptually fairly simple (ie I can explain the rules in plain English without really making any complex argument) and it is arguably internally fairly self-consistent (ie the whole notion of "oh, there's another thing that has write access that page but doesn't go through the page table, so trying to make it read-only in the page tables is a nonsensical operation"). Are the end results wrt something like soft-dirty a bit odd? Not really. If you do soft-dirty, such a GUP-shared page would simply always show up as dirty. That's still consistent with the rules. If somebody else may be writing to it because of GUP, that page really *isn't* clean, and us marking it read-only would be just lying about things. I'm admittedly not very happy about mprotect() itself, though. It's actually ok to do the mprotect(PROT_READ) and turn the page read-only: that will also disable COW itself (because a page fault will now be a SIGSEGV, not a COW). But if you then make it writable again with mprotect(PROT_WRITE), you *have* lost the WP bit, and you'll COW on a write, and lose the coherency. Now, I'm willing to just say: "if you do page pinning, and then do mprotect(PROT_READ), and then do mprotect(PROT_WRITE) and then write to the page, you really do get to keep both broken pieces". IOW, I'm perfectly happy to just say you get what you deserve. But I'd also be perfectly happy to make the whole "I'm the exclusive user" logic a bit more extensive. Right now it's basically _purely_ page_count(), and the other part of "I'm the exclusive owner" is that the RW bit in the page table is simply not clear. That makes things really easy for COW: it just won't happen in the first place if you broke the "obviously exclusive" rule with GUP. But we _could_ do something slightly smarter. But "page_mapcount()" is not that "slightly smarter" thing, because we already know it's broken wrt lots of other uses (GUP, page cache, whatever). Just having a bit in the page flags for "I already made this exclusive, and fork() will keep it so" is I feel the best option. In a way, "page is writable" right now _is_ that bit. By definition, if you have a writable page in an anonymous mapping, that is an exclusive user. But because "writable" has these interactions with other operations, it would be better if it was a harder bit than that "maybe_pinned()", though. It would be lovely if a regular non-pinning write-GUP just always set it, for example. "maybe_pinned()" is good enough for the fork() case, which is the one that matters for long-term pinning. But it's admittedly not perfect. Linus