On Thu, Sep 24, 2020 at 2:30 PM Peter Xu <peterx@xxxxxxxxxx> wrote: > > > > > > > With the extra mprotect(!WRITE), I think we'll see a !pte_write() entry. Then > > > it'll not go into maybe_dma_pinned() at all since cow==false. > > > > Hum that seems like a problem in this patch, we still need to do the > > DMA pinned logic even if the pte is already write protected. > > Yes I agree. I'll take care of that in the next version too. You people seem to be worrying too much about crazy use cases. The fact is, if people do pinning, they had better be careful afterwards. I agree that marking things MADV_DONTFORK may not be great, and there may be apps that do it. But honestly, if people then do mprotect() to make a VM non-writable after pinning a page for writing (and before the IO has completed), such an app only has itself to blame. So I don't think this issue is even worth worrying about. At some point, when apps do broken things, the kernel says "you broke it, you get to keep both pieces". Not "Oh, you're doing unreasonable things, let me help you". This has dragged out a lot longer than I hoped it would, and I think it's been over-complicated. In fact, looking at this all, I'm starting to think that we don't actually even need the mm_struct.has_pinned logic, because we can work with something much simpler: the page mapping count. A pinned page will have the page count increased by GUP_PIN_COUNTING_BIAS, and my worry was that this would be ambiguous with the traditional "fork a lot" UNIX style behavior. And that traditional case is obviously one of the cases we very much don't want to slow down. But a pinned page has _another_ thing that is special about it: the pinning action broke COW. So I think we can simply add a if (page_mapcount(page) != 1) return false; to page_maybe_dma_pinned(), and that very naturally protects against the "is the page count perhaps elevated due to a lot of forking?" Because pinning forces the mapcount to 1, and while it is pinned, nothing else should possibly increase it - since the only thing that would increase it is fork, and the whole point is that we won't be doing that "page_dup_rmap()" for this page (which is what increases the mapcount). So we actually already have a very nice flag for "this page isn't duplicated by forking". And if we keep the existing early "ptep_set_wrprotect()", we also know that we cannot be racing with another thread that is pinning at the same time, because the fast-gup code won't be touching a read-only pte. So we'll just have to mark it writable again before we release the page table lock, and we avoid that race too. And honestly, since this is all getting fairly late in the rc, and it took longer than I thought, I think we should do the GFP_ATOMIC approach for now - not great, but since it only triggers for this case that really should never happen anyway, I think it's probably the best thing for 5.9, and we can improve on things later. Linus