On Mon, Sep 28, 2020 at 10:23 AM Peter Xu <peterx@xxxxxxxxxx> wrote: > > Yes... Actually I am also thinking about the complete solution to cover > read-only fast-gups too, but now I start to doubt this, at least for the fork() > path. E.g. if we'd finally like to use pte_protnone() to replace the current > pte_wrprotect(), we'll be able to also block the read gups, but we'll suffer > the same degradation on normal fork()s, or even more. Seems unacceptable. So I think the real question about pinned read gups is what semantics they should have. Because honestly, I think we have two options: - the current "it gets a shared copy from the page tables" - the "this is an exclusive pin, and it _will_ follow the source VM changes, and never break" because honestly, if we get a shared copy at the time of the pinning (like we do now), then "fork()" is entirely immaterial. The fork() can have happened ages ago, that page is shared with other processes, and anybody process writing to it - including very much the pinning one - will cause a copy-on-write and get a copy of the page. IOW, the current - and past - semantics for read pinning is that you get a copy of the page, but any changes made by the pinning process may OR MAY NOT show up in your pinned copy. Again: doing a concurrent fork() is entirely immaterial, because the page can have been made a read-only COW page by _previous_ fork() calls (or KSM logic or whatever). In other words: read pinning gets a page efficiently, but there is zero guarantee of any future coherence with the process doing subsequent writes. That has always been the semantics, and FOLL_PIN didn't change that at all. You may have had things that worked almost by accident (ie you had made the page private by writing to it after the fork, so the read pinning _effectively_ gave you a page that was coherent), but even that was always accidental rather than anything else. Afaik it could easily be broken by KSM, for example. In other words, a read pin isn't really any different from a read GUP. You get a reference to a page that is valid at the time of the page lookup, and absolutely nothing more. Now, the alternative is to make a read pin have the same guarantees as a write pin, and say "this will stay attached to this MM until unmap or unpin". But honestly, that is largely going to _be_ the same as a write pin, because it absolutely needs to do a page COW at the time of the pinning to get that initial exclusive guarantee in the first place. Without that initial exclusivity, you cannot avoid future COW events breaking the wrong way. So I think the "you get a reference to the page at the time of the pin, and the page _may_ or may not change under you if the original process writes to it" are really the only relevant semantics. Because if you need those exclusive semantics, you might as well just use a write pin. The downside of a write pin is that it not only makes that page exclusive, it also (a) marks it dirty and (b) requires write access. That can matter particularly for shared mappings. So if you know you're doing the pin on a shared mmap, then a read pin is the right thing, because the page will stay around - not because of the VM it happens in, but because of the underlying file mapping! See the difference? > The other question is, whether we should emphasize and document somewhere that > MADV_DONTFORK is still (and should always be) the preferred way, because > changes like this series can potentially encourage the other way. I really suspect that the concurrent fork() case is fundamentally hard to handle. Is it impossible? No. Even without any real locking, we could change the code to do a seqcount_t, for example. The fastgup code wouldn't take a lock, but it would just fail and fall back to the slow code if the sequence count fails. So the copy_page_range() code would do a write count around the copy: write_seqcount_begin(&mm->seq); .. do the copy .. write_seqcount_end(&mm->seq); and the fast-gup code would do a seq = raw_read_seqcount(&mm->seq); if (seq & 1) return -EAGAIN; at the top, and do a if (__read_seqcount_t_retry(&mm->seq, seq) { .. Uhhuh, that failed, drop the ref to the page again .. return -EAGAIN; } after getting the pin reference. We could make this conditional on FOLL_PIN, or maybe even a new flag ("FOLL_FORK_CONSISTENT"). So I think we can serialize with fork() without serializing each and every PTE. If we want to and really need to. Hmm? Linus